Scrape the SEC database for recent S-1 filings

Overview

This recipe demonstrates how to use Airtop to scrape the SEC database for recent S-1 filings. It extracts company names and CIK numbers from the filings and sends the results via Telegram. This project showcases the power of combining web scraping, AI-assisted data extraction, and automated notifications.

Prerequisites

To get started, ensure you have:

  • Node.js installed on your system.
  • An Airtop API key. You can get one for free.

Getting Started

  1. Clone the repository

    Start by opening your terminal and cloning the source code from GitHub:

    $git clone https://github.com/airtop-ai/recipe-sec-bot
    >cd recipe-sec-bot
  2. Install dependencies

    Run the following command to install the necessary dependencies, including TelegramBot, and the Airtop SDK:

    $npm install
  3. Get your Telegram keys

    • Open Telegram application then search for @BotFather or just click this link @BotFather

    • Click Start

    • Click Menu -> /newbot or type /newbot and hit Send

    • Follow the instruction until we get message like so:

      Done! Congratulations on your new bot. You will find it at t.me/new_bot.
      You can now add a description.....
      Use this token to access the HTTP API:
      63xxxxxx71:AAFoxxxxn0hwA-2TVSxxxNf4c
      Keep your token secure and store it safely, it can be used by anyone to control your bot.
      For a description of the Bot API, see this page: https://core.telegram.org/bots/api

    Get your user ID by talking to the UserInfoBot by clicking this link @UserInfoBot.

    • Send /start to it and you should get a message like this:

      @your_user_name
      Id: 9999999999
      First: Marcos
      Lang: en
  4. Configure your environment

    You will need to provide your Airtop API key, Telegram bot token and user ID, in a .env file. First, copy the provided example .env file:

    $cp .env.example .env

    Now edit the .env file to add your keys:

    AIRTOP_API_KEY=<YOUR_API_KEY>
    TELEGRAM_BOT_TOKEN=<YOUR_TELEGRAM_BOT_TOKEN>
    TELEGRAM_USER_ID=<YOUR_TELEGRAM_USER_ID>

Script Walkthrough

The script in index.ts performs the following steps:

  1. Initialize the Airtop Client

    First, we initialize the AirtopClient using your provided API key. This client will be used to create browser sessions and interact with the page content.

    1const airtopClient = new AirtopClient({
    2 apiKey: AIRTOP_API_KEY,
    3});
  2. Initialize the Telegram Bot

    We initialize the Telegram bot using the token and user ID we got back on step 3. This bot will be used to send notifications to the user.

    1const bot = new TelegramBot(TELEGRAM_BOT_TOKEN!, { polling: false });
  3. Create a function to send a message to the user

    This function receives a message formatted as HTML and sends it to the user via Telegram.

    1async function sendTelegramMessage(message: string) {
    2 try {
    3 await bot.sendMessage(TELEGRAM_USER_ID!, message, {
    4 parse_mode: 'HTML',
    5 disable_web_page_preview: true,
    6 });
    7 } catch (error) {
    8 console.error('Error sending telegram message:', error);
    9 throw error;
    10 }
    11}
  4. Create a Browser Session

    Creating a browser session will allow us to connect to and control a cloud-based browser.

    1const createSessionResponse = await airtopClient.sessions.create();
    2
    3sessionId = createSessionResponse.data.id;
  5. Navigate to the SEC website

    Next, the script navigates to the SEC’s website and waits for the page to load.

    1const windowResponse = await airtopClient.windows.create(sessionId, {
    2 url: 'https://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=S-1&owner=include&count=80&action=getcurrent',
    3});
    4
    5const windowInfo = await airtopClient.windows.getWindowInfo(sessionId, windowResponse.data.windowId);
  6. Prompt the AI to Extract Data

    There are several results on this page, but we want to provide specific instructions to our prompt regarding the exact data we need. Since these results are returned in plain text, we also want to instruct our prompt to convert the extracted information into a structured JSON format. This will allow us to easily parse and process the data in subsequent steps of our analysis.

    To achieve this, we’ll create a detailed prompt that outlines:

    • The specific information to extract (company names and CIK numbers)
    • The criteria for selecting entries (only S-1 filings, not S-1/A)
    • How to handle special characters in company names

    We also provide a JSON schema that outlines the desired output format. This will help the AI model understand the exact data we need and format the output accordingly.

    By providing clear and precise instructions, we ensure that the AI model can accurately extract and format the data we need from the page in a repeatable way.

    1const PROMPT = `You are on the SEC website looking at a search for the latest filings.
    2Please extract the company names and their corresponding CIK numbers (which follow the company name in parentheses) from the search results table.
    3Get only the ones where the form type is S-1 and not S-1/A.
    4Company names might contain characters like backslashes, which should always be escaped.
    5
    6Examples:
    7
    8- "S-1 | Some Company Inc (0001234567)" should produce a result '{ "companyName": "Some Company Inc", "cik": "0001234567", "formType": "S-1" }'.
    9- "S-1/A | Another Company Inc (0009876543)" should be not be included because the form type is S-1/A.
    10- "S-1 | Foo Inc \\D\\E (0002468024)" should produce a result with the backslashes in the company name escaped: '{ "companyName": "Foo Inc \\\\D\\\\E, "cik": "0002468024", "formType": "S-1" }'.`;
    11
    12const SCHEMA = {
    13 $schema: 'http://json-schema.org/draft-07/schema#',
    14 type: 'object',
    15 properties: {
    16 results: {
    17 type: 'array',
    18 items: {
    19 type: 'object',
    20 properties: {
    21 companyName: {
    22 type: 'string',
    23 },
    24 cik: {
    25 type: 'string',
    26 },
    27 formType: {
    28 type: 'string',
    29 },
    30 },
    31 required: ['companyName', 'cik', 'formType'],
    32 additionalProperties: false,
    33 },
    34 },
    35 failure: {
    36 type: 'string',
    37 description: 'If you cannot fulfill the request, use this field to report the problem.',
    38 },
    39 },
    40 additionalProperties: false,
    41};
    42
    43const extractedContent = await airtopClient.windows.pageQuery(sessionId, windowInfo.data.windowId, {
    44 prompt: PROMPT,
    45 configuration: {
    46 outputSchema: SCHEMA,
    47 },
    48});

    Utilizing Airtop’s AI prompt feature, the script requests data about SEC filings, formatted as per the provided JSON schema. The AI agent can follow pagination links to gather more results from the SEC database, which may contain multiple pages of filings by passing the followPaginationLinks parameter to the pageQuery method.

  7. Process the extracted content

    1modelResponse = JSON.parse(extractedContent.data.modelResponse);
    2
    3if (modelResponse.failure) {
    4 console.log(`Airtop AI reported failure: ${modelResponse.failure}`);
    5 throw new Error(modelResponse.failure);
    6}
    7
    8// Format the results as a list instead of a table
    9const formattedResults = modelResponse.results
    10 .map(
    11 (item: { companyName: string; cik: string }, index: number) =>
    12 `${index + 1}. <b>${item.companyName}</b>\n CIK: <code>${item.cik}</code>`,
    13 )
    14 .join('\n\n');
    15
    16const message = `<b>SEC EDGAR S-1 Results</b>\n\n${formattedResults}`;
    17
    18await sendTelegramMessage(message);
  8. Clean Up

    Finally, the script closes the browser and terminates the session.

    1try {
    2 await browser.close();
    3} catch (err) {}
    4if (sessionId) {
    5 await airtopClient.sessions.terminate(sessionId);
    6}
    7console.log('Session deleted');
    8process.exit(0);

Running the Script

To run the script, execute the following command in your terminal:

$npm run start

Summary

This recipe demonstrates how Airtop can be utilized to automate tasks involving plain-text data extraction. By leveraging Airtop’s AI prompt feature and pagination handling capabilities, you can efficiently retrieve and process information about S-1 filings from the SEC database. The script showcases how to interact with complex data sources, parse JSON responses, and format the extracted information for easy readability and further analysis.

Built with