Scrape the SEC database for recent S-1 filings
Overview
This recipe demonstrates how to use Airtop to scrape the SEC database for recent S-1 filings. It extracts company names and CIK numbers from the filings and sends the results via Telegram. This project showcases the power of combining web scraping, AI-assisted data extraction, and automated notifications.
Prerequisites
To get started, ensure you have:
- Node.js installed on your system.
- An Airtop API key. You can get one for free.
Getting Started
-
Clone the repository
Start by opening your terminal and cloning the source code from GitHub:
-
Install dependencies
Run the following command to install the necessary dependencies, including TelegramBot, and the Airtop SDK:
-
Get your Telegram keys
-
Open Telegram application then search for
@BotFather
or just click this link @BotFather -
Click Start
-
Click Menu -> /newbot or type
/newbot
and hit Send -
Follow the instruction until we get message like so:
Get your user ID by talking to the
UserInfoBot
by clicking this link @UserInfoBot.-
Send
/start
to it and you should get a message like this:
-
-
Configure your environment
You will need to provide your Airtop API key, Telegram bot token and user ID, in a
.env
file. First, copy the provided example.env
file:Now edit the
.env
file to add your keys:
Script Walkthrough
The script in index.ts
performs the following steps:
-
Initialize the Airtop Client
First, we initialize the
AirtopClient
using your provided API key. This client will be used to create browser sessions and interact with the page content. -
Initialize the Telegram Bot
We initialize the Telegram bot using the token and user ID we got back on step 3. This bot will be used to send notifications to the user.
-
Create a function to send a message to the user
This function receives a message formatted as HTML and sends it to the user via Telegram.
-
Create a Browser Session
Creating a browser session will allow us to connect to and control a cloud-based browser.
-
Navigate to the SEC website
Next, the script navigates to the SEC’s website and waits for the page to load.
-
Prompt the AI to Extract Data
There are several results on this page, but we want to provide specific instructions to our prompt regarding the exact data we need. Since these results are returned in plain text, we also want to instruct our prompt to convert the extracted information into a structured JSON format. This will allow us to easily parse and process the data in subsequent steps of our analysis.
To achieve this, we’ll create a detailed prompt that outlines:
- The specific information to extract (company names and CIK numbers)
- The criteria for selecting entries (only S-1 filings, not S-1/A)
- How to handle special characters in company names
We also provide a JSON schema that outlines the desired output format. This will help the AI model understand the exact data we need and format the output accordingly.
By providing clear and precise instructions, we ensure that the AI model can accurately extract and format the data we need from the page in a repeatable way.
Utilizing Airtop’s AI prompt feature, the script requests data about SEC filings, formatted as per the provided JSON schema. The AI agent can follow pagination links to gather more results from the SEC database, which may contain multiple pages of filings by passing the
followPaginationLinks
parameter to thepageQuery
method. -
Process the extracted content
-
Clean Up
Finally, the script closes the browser and terminates the session.
Running the Script
To run the script, execute the following command in your terminal:
Summary
This recipe demonstrates how Airtop can be utilized to automate tasks involving plain-text data extraction. By leveraging Airtop’s AI prompt feature and pagination handling capabilities, you can efficiently retrieve and process information about S-1 filings from the SEC database. The script showcases how to interact with complex data sources, parse JSON responses, and format the extracted information for easy readability and further analysis.