Extract data behind authentication
Overview
This recipe demonstrates how to use Airtop to extract data from a website using a prompt. By leveraging Airtop’s live view capabilities, you can have your users log into any of their accounts inside a browser session to provide your agents access to content that requires authentication. Airtop profiles can be used to persist a user’s login state across sessions and avoid the need to have them log in again.
The instructions below will walk through creating a script that connects to Airtop, provides a live view for a user to log into their Glassdoor account if necessary, and retrieves a list of relevant job postings from the Glassdoor website. Similar logic can be applied to any website that requires authentication.
The full source code is available on GitHub for TypeScript and Python.
Prerequisites
To get started, ensure you have:
- An Airtop API key. You can get one for free.
and the following packages installed:
Getting Started
-
Clone the repository
Start by cloning the source code from GitHub:
-
Install dependencies
Run the following command to install the necessary dependencies, including the Airtop SDK:
-
Configure your environment
You will need to provide your Airtop API key in a
.env
file. First, copy the provided example.env
file:Now edit the
.env
file to add your Airtop API key:
Script Walkthrough
The script index.ts
for TypeScript or extract_data_login.py
for Python performs the following steps:
-
Initialize the Airtop Client
First, we initialize the
AirtopClient
using your provided API key. This client will be used to create browser sessions and interact with the page content. -
Create a Browser Session
Creating a browser session will allow us to connect to and control a cloud-based browser. The API accepts an optional
profileName
parameter, which can be used to reuse a user’s previously provided credentials. If noprofileName
is given, the user will be prompted to log in at the provided live view URL (see Step 4). If the entered profile name does not exist, it will be created and saved on session termination for future use. -
Connect to the Browser
The script opens a new page and navigates to the target URL, in this case Glassdoor’s user profile page.
-
Handle Log-in Status
If the user is not logged in, it waits for the user to log in at the provided live view URL.
If the user is already logged in, they are navigated to the target URL to proceed with data extraction.
-
Navigate to the Target URL
After logging in, the script navigates to the target URL, which in this case is a Glassdoor search page for software engineering jobs in San Francisco.
-
Query the AI to Extract Data
We construct a prompt that asks the AI to extract data about job postings that are related to AI companies. We also define a JSON schema for the output. Note that an optimal prompt will begin by providing context about the webpage and what the model is viewing. It will also include information in the description fields of a provided JSON schema to guide the model’s output.
Utilizing Airtop’s prompt feature, the script requests data about job postings that are related to AI companies, formatted as per the provided JSON schema. The AI agent can follow pagination links to gather more results on sites with multiple pages or from a feed with infinite scrolling.
-
Clean Up
Finally, the script closes the browser and terminates the session.
Running the Script
To run the script, execute the following command in your terminal:
Summary
This recipe showcases how Airtop can be used to automate tasks that require authentication and data extraction from dynamic content. By combining Airtop’s live view feature for manual login with automated data extraction via natural language prompts, you can interact with and extract data from complex websites that require user credentials.