Querying a Page

Interact with a page using LLMs

You can use the pageQuery method to interact with a page using LLMs. You might want to use this method to scrape a page for specific information, or even ask a more general question about the page.

Examples:

  • On a company’s website, ask if they have a certain job opening.
  • On a foreign news website, ask for a translation.
  • On a product page, ask for list of product, but with the price in different currencies.
  • etc.

Usage example

First, you’ll need to create a session.

1const session = await client.sessions.create();

Next, you’ll need to create a window and load a URL.

1const window = await client.windows.create(session.data.id, { url: "https://news.ycombinator.com/news" });

Finally, you can query the page.

1const result = await client.windows.pageQuery(session.data.id, window.data.windowId, {
2 prompt:
3 'Give me a list of all the titles of the articles on this page along with the number of comments each article has.',
4});
5const content = result.data.modelResponse;

Example output:

1. California bans legacy admissions at private universities - 71 comments
2. Bop Spotter - 209 comments
3. Apple No Longer in Talks to Invest in ChatGPT Maker OpenAI - 7 comments
4. Paramotorists soar across remote Peru desert to collect threatened plants - 13 comments
5. Launch HN: Inkeep (YC W23) – Copilot for Support (think Cursor for help desks) - 39 comments
6. Phrase matching in Marginalia Search - 16 comments
7. Engineers investigate another malfunction on SpaceX's Falcon 9 rocket - 11 comments
8. Gavin Newsom vetoes SB 1047 - 414 comments
9. EasyPost (YC S13) Is Hiring - No comments listed
10. The Physics of Colliding Balls - 13 comments
11. Show HN: A macOS app to prevent sound quality degradation on AirPods - 78 comments
12. Product Hunt isn't dying, it's becoming gentrified - 16 comments
13. Two new books on John Calhoun and his rodent experiments - 41 comments
14. GnuCash 5.9 Released - 6 comments
15. The fight to save Chile's white strawberry - 10 comments
16. Keep Track: 3D Satellite Toolkit - 29 comments
17. Normans and Slavery: Breaking the Bonds - 51 comments
18. Screenpipe: 24/7 local AI screen and mic recording - 86 comments
19. Peer Calls: WebRTC peer to peer calls for everyone - 11 comments
20. How we built ngrok's data platform - 31 comments
21. The best browser bookmarking system is files - 66 comments
22. No such thing as exactly-once delivery - 2 comments
23. Tips for Building and Deploying Robots - 15 comments
24. Map with public fruit trees - 130 comments
25. NotebookLM's automatically generated podcasts are surprisingly effective - 389 comments
26. Generate pip requirements.txt file based on imports of any project - 55 comments
27. New research on anesthesia and microtubules gives new clues about consciousness - 139 comments
28. Liquid Foundation Models: Our First Series of Generative AI Models - 125 comments
29. Sitina1 Open-Source Camera - 116 comments
30. Do AI companies work? - 170 comments

Paginated Results

If you’re scraping a paginated page, Airtop will automatically handle pagination for you. You just need to pass the followPaginationLinks: true option and specify the number of pages or results you want to scrape.

1const result = await client.windows.pageQuery(session.data.id, windowInfo.data.windowId, {
2 prompt:
3 'You are on the Hacker News website. Please scan articles 1-100, referencing the article numbers on the left side of the page, and provide a list of the articles that have over 100 comments. You may need to page through the articles using the pagination controls at the bottom of the page to get to article number 100. You should ignore articles beyond number 100, but be sure to scan the final partial list of articles up to number 100 when you reach it. Return the article number, title, number of comments, and the user who posted the article.',
4 followPaginationLinks: true,
5});

Example output:

1. **Article Number:** 2
**Title:** Too much efficiency makes everything worse (2022)
**Comments:** 204
**User:** feyman_r
2. **Article Number:** 8
**Title:** SpaceX launches mission for 2 NASA astronauts who are stuck on the ISS
**Comments:** 250
**User:** JumpCrisscross
3. **Article Number:** 23
**Title:** The perils of transition to 64-bit time_t
**Comments:** 177
**User:** todsacerdoti
4. **Article Number:** 36
**Title:** Floating megabomb heaves to near the English coast
**Comments:** 176
**User:** itronitron
5. **Article Number:** 51
**Title:** Legalizing sports gambling was a mistake
**Comments:** 1109
**User:** jimbob45
6. **Article Number:** 52
**Title:** Automatic Content Recognition Tracking in Smart TVs
**Comments:** 133
**User:** some_furry
7. **Article Number:** 53
**Title:** Amusing Ourselves to Death (2014)
**Comments:** 293
**User:** yamrzou
8. **Article Number:** 68
**Title:** Everything you need to know about Python 3.13 – JIT and GIL went up the hill
**Comments:** 190
**User:** chmaynard
9. **Article Number:** 84
**Title:** Notion's mid-life crisis
**Comments:** 130
**User:** krishna2
10. **Article Number:** 91
**Title:** I Am Tired of AI
**Comments:** 1086
**User:** Liriel
11. **Article Number:** 94
**Title:** Hacking Kia: Remotely controlling cars with just a license plate
**Comments:** 355
**User:** speckx
12. **Article Number:** 95
**Title:** If WordPress is to survive, Matt Mullenweg must be removed
**Comments:** 224
**User:** graeme
13. **Article Number:** 98
**Title:** CNN and USA Today have fake websites, I believe Forbes Marketplace runs them
**Comments:** 259
**User:** greg_V

Using JSON Schemas

You can use JSON schemas to guide the AI’s response and force it to return JSON. This can be useful if you want a structured response that is more suitable for automated processing.

1const jsonSchema = {
2 "$schema": "http://json-schema.org/draft-07/schema#",
3 "type": "object",
4 "properties": {
5 "results": {
6 "type": "array",
7 "items": {
8 "type": "object",
9 "properties": {
10 "title": {
11 "type": "string",
12 "description": "Article title"
13 },
14 "score": {
15 "type": "number",
16 "description": "Number of points the article has received"
17 },
18 "author": {
19 "type": "string",
20 "description": "Author of the post"
21 }
22 },
23 "required": ["title", "score", "author"],
24 "additionalProperties": false
25 }
26 },
27 "error": {
28 "type": "string",
29 "description": "Error message in case the request cannot be fulfilled.",
30 "minLength": 1
31 }
32 }
33}
34
35const result = await client.windows.pageQuery(session.data.id, windowInfo.data.windowId, {
36 prompt:
37 'Give me a list of all the titles of the articles on this page along with the number of comments each article has and the author of the article. Use the error field to report if you cannot fulfill the request.',
38 configuration: {
39 outputSchema: jsonSchema,
40 },
41});

Prompting Tips

Like any LLM based tool, the quality of the results depends heavily on the quality of the prompt. Here are some tips to get the best results:

Basic Prompting Tips

  1. Provide the AI with some context by telling it a little bit about the web page or content it’s looking at.

  2. Be clear about your goals and what you want the AI to do.

  3. If, to complete the request, more content than is originally visible must be loaded (i.e. paginated results or infinite scrolling or “Load More” controls), be sure to include a clear limit on when the AI should stop. It can also be helpful to be explicit about how more content should be loaded.

  4. Include a draft-07 JSON schema if you want a structured response that is more suitable for automated processing.

  5. Include a few guiding examples of how you would like the AI to respond in different scenarios. Good examples can significantly improve effectiveness and consistency.

  6. If you feel like the LLM isn’t being as diligent as you’d like about evaluating a particular content field, try explicitly including that field in your output schema even if you don’t need it. This tends to force the LLM to pay more attention.

Tips for using JSON Schemas:

  1. Take advantage of the description fields to give the AI additional clarity and instructions about how to populate the property, and consider even adding examples (especially if you would prefer certain formatting).

  2. Don’t mark a property as required unless you’re certain it will always be possible to provide. If the AI feels compelled to provide data that doesn’t exist, it’s very likely to hallucinate it.

  3. Be sure to include a valid way for the AI to report back failure if it cannot fulfill the objective. If your schema does not allow the AI to report failure and something happens, it may feel compelled to return a natural language response instead, or even hallucinate results in order to honor the schema and your request to use it for responses.

  4. You can sometimes use schema constraints to guide the AI response. For example, if you find that it includes an empty string or array for an optional property when unavailable, and you’d rather see that property omitted instead, you can add a constraint of minLength: 1 (or minItems: 1 for an array). Of course, make sure those properties are not marked as required.

  5. Most major LLMs are quite good at generating JSON schemas from examples (or even natural language descriptions) if you’d rather not write them by hand.

  6. Note that some JSON schema features are not supported by the structured outputs API. For example, the oneOf keyword is not supported. If you receive an error that the AI response does not match the output schema, you may need to revise your schema.

Built with