Extract business intelligence from screen recordings

Leverage multimodal AI to extract and organize data displayed on computer screens

Despite the tremendous advances made by web scraping tools (e.g., Gumloop, Apify, Scrapy, Oxylabs, Octoparse), I find that there are many situations where we need to automate data collection from unpublished sources.

Such sources may include:

  • Chats or contact details from third-party messaging tools, such as WhatsApp, Telegram, LinkedIn, or Slack.

  • Visuals from webinars or unrecorded video conferences with customers, suppliers, or competitors.

  • Web pages that actively detect and ban web scrapers, such as Twitter/X and Amazon.

With the emergence of multimodal AI tools, it is now possible to ask GenAI services like Google’s Gemini API to analyze videos and extract information according to specific business requirements.

Video recordings are especially useful in situations where the information is spread across multiple pages, slides or interfaces that can’t be easily accessed by a standard web scraper.

Of course, it’s possible to use screnshots instead of screen recordings, but taking successive screenshots requires more manual actions than simply recording the screen.

A straightforward workflow is as follows:

  1. Equip your team members with screen recording tools and automatically upload the recordings to a central repository.

  2. Associate each recording with an appropriate data extraction prompt, depending on its source / purpose, or specific employee instructions.

  3. Use a GenAI API to process each video and save the desired information in a specified format.

Step 1 - Screen recording tool and central repository

Various tools and services exist to achieve this.

Both Windows and Mac operating systems have built-in screen recording tools. For Mac users, Cleanshot is a well-designed app. More advanced services include OBS and Loom.

Once the recording is created, you can also use any number of tools to upload it automatically to a central repository.

I use Cleanshot for screen recordings, Dropbox to save the recordings, Zapier to detect new recordings, and Airtable as the central repository to save the recordings as well as their data extraction prompts.

Step 2 - Associate each recording with a relevant prompt

Depending on its source and purpose, each recording must be associated with a prompt that describes what information must be extracted and how it should be organized.

You should structure the prompt as follows:

  • Context: Who is the AI assistant and what is their job?

  • Inputs: What does the input contain and how to access it?

  • Output: What exactly should be the format of the output, with an example?

  • Final remarks: Use this section to tweak the behavior of the AI assistant when you notice undesired outputs.

For the purpose of demonstrating our approach, let’s use the recording of multiple pages from the Polymarket website. Polymarket is a well-known prediction market. Actually, its website can be scraped easily, but it’s a more appropriate choice for a demo than, say, the recording of a webinar that actually contains private information.

We use the following prompt:

You are an AI assistant tasked with extracting information from a video recording, which is provided to you. 

The video recording contains web pages from a website that allows users to bet on future events. Some pages show an overview of multiple bets available in a specific category, and the current odds of each outcome. Other pages show more detail about a specific bet, its possible outcomes and their odds.
When the bet is just a yes/no question, the odds are displayed in the adjoining text "x% chance" where x is the percentage chance of Yes. You are interested in collecting the list of all available bets, the category that they belong to, the list of possible outcomes and their current odds.

Make sure that your response includes every single bet shown in the video, whether it is shown on an overview page or on a detail page. If you can only see partial data regarding a bet's outcomes and current odds, just include the data that you can. If a bet appears on multiple video frames, you should only include it once.

You must format your response as a list of bets in JSON format. Each bet should be a JSON object in the following format:
<example_object>
{
    "name": "Example Bet",
    "category": "Politics",
    "outcomes": [
        {
            "name": "Outcome 1",
            "probability": 0.5
        },
        {
            "name": "Outcome 2",
            "probability": 0.3
        },
        {
            "name": "Outcome 3",
            "probability": 0.2
        }
        # Etc.
    ]
}
</example_object>

Ensure that your output is a list of objects in the example_format, with nothing else.

Step 3 - Process each video

In the demo, we use Google Cloud’s Gemini 1.5 Pro model (gemini-1.5-pro) to analyze video recordings.

While running the demo, I noticed that Step 2 doesn’t always produce clean data. A data clean-up step must be added, using a LLM too.

Here is an example of data clean-up prompt:

You are an AI assistant tasked with cleaning up the output of an tool that extracts information from a video recording. 

You are given an input between the <input> tags below, which is a JSON list of bets. Each bet is an object containing the name of the bet (name), the category of the bet (category), and alist of possible outcomes (outcomes).
Each possible outcome is an object containing the name of the outcome (name) and the current odds of the outcome as a number between 0 and 1(probability).

Please perform the following clean up tasks:
* If a single bet is included multiple times, merge the data in the way you see fit.
* If a bet does not have any outcomes, remove it from the list.

Ensure that your output is a list of objects in the same JSON format as the input, with nothing else.

Here is the input:
<input>
...
</input>

In our demo, the final output looks like this:

[
    {
        "name": "Next Prime Minister of Canada after the election",
        "category": "Canadian Election",
        "outcomes": [
            {
                "name": "Mark Carney",
                "probability": 0.78
            },
            {
                "name": "Pierre Poilievre",
                "probability": 0.23
            }
        ]
    },
    {
        "name": "Romania Presidential Election Winner",
        "category": "Politics",
        "outcomes": [
            {
                "name": "Crin Antonescu",
                "probability": 0.33
            },
            {
                "name": "Nicusor Dan",
                "probability": 0.3
            },
            {
                "name": "George Simion",
                "probability": 0.26
            },
            {
                "name": "Victor Ponta",
                "probability": 0.1
            },
            {
                "name": "Elena Lasconi",
                "probability": 0.02
            },
            {
                "name": "Calin Georgescu",
                "probability": 0.01
            }
        ]
    },
...
]

You can find the full demo and code at this link:

Visit this notebook.

Takeaway messages

Multimodal AI tools like Gemini 1.5 Pro now make it possible to reliably extract structured data from screen recordings. This opens up new possibilities for businesses to automate data collection from sources that were previously difficult or impossible to scrape.

Using proper workflows and tools, teams can create valuable datasets for business intelligence and decision-making.

Note: Before capturing data from screen recordings, check whether the terms and conditions of the information source explicitly prohibit this practice.

Reply

or to participate.