Smart Scraping

Enhance your scraping with AI

You can use the scrapeContent method to intelligently scrape the content of a page. This method extracts the content of the page and formats it as markdown, which is easy to read and ingest into your application. It will extract headers, formatting, tables, and more and present them in a structured manner.

Additionally, this method will correctly scrape content from Office365 and Google Workspace documents. These applications are notoriously difficult to scrape due to their use virtualized DOMs and require more sophisticated methods. Not only will Airtop correctly parse text content, but also table content from Microsoft Excel and Google Sheets and present it in CSV format.

Usage example

First, you’ll need to create a session.

1const session = await client.sessions.create();

Next, you’ll need to create a window and load a URL.

1const window = await client.windows.create(session.data.id, { url: "https://en.wikipedia.org/wiki/Margrit_Waltz" });

Finally, you can request a scrape of the page.

1const content = await client.windows.scrapeContent(session.data.id, window.data.windowId);

If you inspect content.data.modelResponse.scrapedContent.text, you’ll see the result of the scrape. Additionally, content.data.modelResponse.scrapedContent.contentType will be the MIME type of the content, which you can use to determine how to parse the content. It is typically text/plain, but could also be text/csv if the page is a Google Sheet document.

Result Comparison

Here is a a quick snippet comparison first ~20 lines of a raw text scrape vs a smart scrape for this wikipedia page.

Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us
Contribute
HelpLearn to editCommunity portalRecent changesUpload file
Search
Search
Donate
Appearance
Appearance
move to sidebar
hide
TextSmallStandardLargeThis page always uses small font sizeWidthStandardWideThe content is as wide as possible for your browser window.Color (beta)AutomaticLightDarkThis page is always in light mode.

Here’s another example of a smart scrape for a google doc.

<div class="docs-butterbar-container"><div class="docs-butterbar-wrap"><div class="jfk-butterBar jfk-butterBar-shown jfk-butterBar-warning">JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.</div></div><br /></div>(function() {function setIframeSrcdoc(iframe) {var coreJsIframePolicy; var contentsString = "\x3clink rel\x3d\x22stylesheet\x22 href\x3d\x22https:\/\/fonts.googleapis.com\/css?family\x3dGoogle+Sans:bold,normal\x26lang\x3den\x22 nonce\x3d\x22L42FF1JdHJPKWYWFGRXMNA\x22\x3e\x3clink rel\x3d\x22stylesheet\x22 href\x3d\x22https:\/\/fonts.googleapis.com\/css?family\x3dRoboto:normal\x26lang\x3den\x22 nonce\x3d\x22L42FF1JdHJPKWYWFGRXMNA\x22\x3e\x3cstyle nonce\x3d\x22L42FF1JdHJPKWYWFGRXMNA\x22\x3e:root \x7b--brand-color: #1a73e8\x7dli:nth-child(1) h4:before \x7bcontent: \x22Step 1\x22;\x7dli:nth-child(2) h4:before \x7bcontent: \x22Step 2\x22;\x7d\x3c\/style\x3e\x3cstyle\x3ebutton.text-button.brand-color\x7bbackground-color:var(--brand-color);color:#fff;border:none\x7dbody\x7bfont-family:Roboto;font-size:14px;font-weight:400;margin:0;padding:24px;overflow-x:hidden\x7dh3,h4,li,ol,p\x7bmargin:0;padding:0\x7dh3\x7bfont-family:Google Sans;font-size:22px;font-weight:400;margin:0;padding:0\x7dli\x7bmargin-top:16px\x7dp\x7bmargin-top:5px\x7dli h4:before\x7bfont-weight:700\x7dp#chrome-url-box\x7bborder:1px solid #bdc1c6;-moz-box-sizing:border-box;box-sizing:border-box;border-radius:4px;padding:6px;height:36px\x7dbutton#chrome-settings-url-copy\x7bcolor:var(--brand-color);text-decoration:none;display:inline-block;border:none;padding:0;float:right;background:none;height:24px\x7dp#chrome-url-box code\x7bfont-family:inherit;display:block;float:left;height:24px;vertical-align:middle\x7dp#buttons-row\x7btext-align:right\x7dp#buttons-row button\x7bmargin-left:16px\x7dbutton.text-button\x7bfont-family:Google Sans;font-size:14px;text-decoration:none;display:inline-block;border-radius:4px;border:1px solid transparent;height:36px;padding-left:24px;padding-right:24px;cursor:pointer\x7dbutton.text-button:disabled\x7bborder-width:1px;border-color:#bdc1c6;border-style:solid;background-color:#fff;color:gray\x7d.sr-only\x7bposition:absolute;width:1px;height:1px;margin:-1px;clip:rect(0,0,0,0)\x7dbutton#chrome-settings-url-copy svg path\x7bfill:var(--brand-color)\x7dol\x7blist-style:none;padding-left:0\x7d\n\/*# sourceMappingURL\x3dcorejserror_ltr.css.map *\/\x3c\/style\x3e\x3cbody role\x3d\x22dialog\x22 aria-labelledby\x3d\x22heading\x22 aria-describedby\x3d\x22description\x22\x3e\x3ch3 id\x3d\x22heading\x22\x3eLoading issue\x3c\/h3\x3e\x3cp id\x3d\x22description\x22\x3eTroubleshoot this issue by clearing application resources\x3c\/p\x3e\x3col\x3e\x3cli\x3e\x3ch4 class\x3d\x22step-header\x22\x3e\x3c\/h4\x3e\x3cp\x3eFollow \x3ca href\x3d\x22https:\/\/support.google.com\/accounts\/answer\/32050\x22 target\x3d\x22_blank\x22\x3ethese instructions\x3c\/a\x3e to clear your cache and cookies\x3c\/p\x3e\x3c\/li\x3e\x3cli\x3e\x3ch4 class\x3d\x22step-header\x22\x3e\x3c\/h4\x3e\x3cp\x3eThen, reload this page\x3c\/p\x3e\x3c\/li\x3e\x3c\/ol\x3e\x3cp id\x3d\x22buttons-row\x22\x3e\x3cbutton class\x3d\x22text-button\x22 id\x3d\x22send-feedback\x22 disabled\x3eSend feedback\x3c\/button\x3e\x3cbutton class\x3d\x22text-button brand-color\x22 id\x3d\x22reload-now\x22\x3eReload now\x3c\/button\x3e\x3c\/p\x3e\x3cscript nonce\x3d\x22uBfdd5ElM-eQ4cCUZ50mBw\x22\x3efunction _F_toggles_initialize(a)\x7b(typeof globalThis!\x3d\x3d\x22undefined\x22?globalThis:typeof self!\x3d\x3d\x22undefined\x22?self:this)._F_toggles\x3da||\x5b\x5d\x7d_F_toggles_initialize(\x5b\x5d);\nvar d\x3ddocument.getElementById(\x22chrome-settings-url-copy\x22);function e()\x7bvar a\x3ddocument.getElementById(\x22chrome-settings-url\x22),b\x3dnew Range;b.setStart(a,0);b.setEnd(a,1);a\x3dwindow.getSelection();a.empty();a.addRange(b);document.execCommand(\x22copy\x22);setTimeout(function()\x7bvar c\x3ddocument.createElement(\x22p\x22);c.setAttribute(\x22role\x22,\x22alert\x22);c.style.position\x3d\x22absolute\x22;c.style.top\x3d\x22-10000px\x22;c.appendChild(document.createTextNode(\x22Link copied\x22));document.body.appendChild(c)\x7d,500);d\x26\x26d.focus()\x7dd\x26\x26(d.onclick\x3de);\ndocument.getElementById(\x22reload-now\x22).onclick\x3dfunction()\x7bwindow.parent.location.reload()\x7d;document.addEventListener(\x22keydown\x22,function(a)\x7bvar b\x3ddocument.querySelectorAll(\x22a\x5bhref\x5d:not(\x5bdisabled\x5d), button:not(\x5bdisabled\x5d)\x22),c\x3db\x5b0\x5d;b\x3db\x5bb.length-1\x5d;if(a.key\x3d\x3d\x3d\x22Tab\x22||a.keyCode\x3d\x3d\x3d9)a.shiftKey\x26\x26document.activeElement\x3d\x3dc?(b.focus(),a.preventDefault()):a.shiftKey||document.activeElement!\x3db||(c.focus(),a.preventDefault())\x7d);\nwindow.onload\x3dfunction()\x7bvar a\x3dwindow.parent.document.getElementById(\x22core-js-error-dialog\x22);if(a)\x7bvar b\x3ddocument.body.scrollHeight;a.style.height\x3db+\x22px\x22;a.style\x5b\x22margin-top\x22\x5d\x3d-Math.round(b\/2)+\x22px\x22\x7d\x7d;\n\/\/ Google Inc.\n\n\/\/# sourceMappingURL\x3dcorejserror_corejserror_chunk.sourcemap\n\x3c\/script\x3e\x3cscript src\x3d\x22\/static\/document\/client\/js\/898020166-corejserrorfeedback_corejserrorfeedback_chunk.js\x22 nonce\x3d\x22uBfdd5ElM-eQ4cCUZ50mBw\x22\x3e\x3c\/script\x3e\x3c\/body\x3e"; if (self.trustedTypes && self.trustedTypes.createPolicy) {coreJsIframePolicy = trustedTypes.createPolicy( 'docsCoreJsIframePolicy',{createHTML: function(ignored) {return contentsString;}});}var contentsTt = coreJsIframePolicy ? coreJsIframePolicy.createHTML('ignored') : contentsString; if ('srcdoc' in iframe) {iframe.srcdoc = contentsTt; return;}iframe.contentWindow.document.open(); iframe.contentWindow.document.write(contentsTt); iframe.contentWindow.document.close(); if ( true && window.navigator && window.navigator.sendBeacon) {window.navigator.sendBeacon( '\/document\/jserror?jobset\x3dprod\x26error\x3dJS+binary+load+failure&context.coreJsNoSrcdoc=true&context.serviceWorkerControlled=' + !!(navigator.serviceWorker && navigator.serviceWorker.controller) + '\x26context.actionName\x3dEdit');}}function enterCoreJsErrorDialog() {if (!setIframeSrcdoc) {return;}var overlay = document.getElementById('core-js-error-dialog-overlay');var overlayPolicy; if (self.trustedTypes && self.trustedTypes.createPolicy) {overlayPolicy = trustedTypes.createPolicy( 'docsCoreJsOverlayPolicy',{createHTML: function(ignored) {return "\x3cdiv style\x3d\x22position: absolute; left: 0; top: 0; width: 100%; height: 100%; background: rgb(0, 0, 0, 0.6)\x22\x3e\x3c\/div\x3e\x3ciframe id\x3d\x22core-js-error-dialog\x22 style\x3d\x22position: absolute; left: 50%; top: 50%; width: 512px; height: 430px; margin-left: -256px; margin-top: -215px; background: white; border: none; border-radius: 8px\x22\x3e\x3c\/iframe\x3e";}});}var overlayTt = overlayPolicy ? overlayPolicy.createHTML('ignored') : "\x3cdiv style\x3d\x22position: absolute; left: 0; top: 0; width: 100%; height: 100%; background: rgb(0, 0, 0, 0.6)\x22\x3e\x3c\/div\x3e\x3ciframe id\x3d\x22core-js-error-dialog\x22 style\x3d\x22position: absolute; left: 50%; top: 50%; width: 512px; height: 430px; margin-left: -256px; margin-top: -215px; background: white; border: none; border-radius: 8px\x22\x3e\x3c\/iframe\x3e"; overlay.innerHTML = overlayTt; var iframe = document.getElementById('core-js-error-dialog'); overlay.onmousedown = function(e) {iframe.focus(); e.preventDefault();}; setIframeSrcdoc(iframe); setIframeSrcdoc = null; overlay.style.display = 'block'; iframe.focus();}window.enterCoreJsErrorDialog = enterCoreJsErrorDialog;})();Explore the Random Cats Request edit access Sign inDOCS_timing['sdb']=new Date().getTime();FileEditViewToolsHelpAccessibilityDebugDOCS_timing['edb']=new Date().getTime();DOCS_timing['che'] = new Date().getTime();DOCS_timing['chv'] = new Date().getTime();KX_resize = function() {if (KX_kixApp) {KX_kixApp.resize();}}; Gemini created these notes. They can contain errors so should be double-checked. How Gemini takes notes OutlineOutlineDocument tabs Headings you add to the document will appear here. Changes bythis.gbar_=this.gbar_||{};(function(_){var window=this;
try{
_.Pd=function(a,b,c){if(!a.j)if(c instanceof Array)for(var d of c)_.Pd(a,b,d);else{d=(0,_.z)(a.C,a,b);const e=a.v+c;a.v++;b.dataset.eqid=e;a.B[e]=d;b&&b.addEventListener?b.addEventListener(c,d,!1):b&&b.attachEvent?b.attachEvent("on"+c,d):a.o.log(Error("B`"+b))}};
}catch(e){_._DumpException(e)}
try{
var Qd=document.querySelector(".gb_I .gb_A"),Rd=document.querySelector("#gb.gb_Sc");Qd&&!Rd&&_.Pd(_.zd,Qd,"click");
}catch(e){_._DumpException(e)}
try{

The entire document is too large to fit in the snippet, but you get the point. You’ll actually not find any of the content in the raw scrape since the content is never present in the DOM.

Built with