Smart Scraping
Enhance your scraping with AI
You can use the scrapeContent
method to intelligently scrape the content of a page. This method extracts the content of the page and formats it as markdown, which is easy to read and ingest into your application. It will extract headers, formatting, tables, and more and present them in a structured manner.
Additionally, this method will correctly scrape content from Office365 and Google Workspace documents. These applications are notoriously difficult to scrape due to their use virtualized DOMs and require more sophisticated methods. Not only will Airtop correctly parse text content, but also table content from Microsoft Excel and Google Sheets and present it in CSV format.
Usage example
First, you’ll need to create a session.
Next, you’ll need to create a window and load a URL.
Finally, you can request a scrape of the page.
If you inspect content.data.modelResponse.scrapedContent.text
, you’ll see the result of the scrape. Additionally, content.data.modelResponse.scrapedContent.contentType
will be the MIME type of the content, which you can use to determine how to parse the content. It is typically text/plain
, but could also be text/csv
if the page is a Google Sheet document.
Result Comparison
Here is a a quick snippet comparison first ~20 lines of a raw text scrape vs a smart scrape for this wikipedia page.
Raw Scrape
Smart Scrape
Here’s another example of a smart scrape for a google doc.
Raw Scrape
Smart Scrape
The entire document is too large to fit in the snippet, but you get the point. You’ll actually not find any of the content in the raw scrape since the content is never present in the DOM.