Skip to main content

What is extract()?

page.extract("extract the name of the repository");
extract grabs structured data from a webpage. You can define your schema with zod (TypeScript) or pydantic (Python). If you do not want to define a schema, you can also call extract with just a natural language prompt, or call extract with no parameters.

Why use extract()?

For TypeScript, the extract schemas are defined using zod schemas.For Python, the extract schemas are defined using pydantic models.

Using extract()

Single object Extraction

Here is how an extract call might look for a single object:
const item = await page.extract({
  instruction: "extract the price of the item",
  schema: z.object({
    price: z.number(),
  }),
});
Your output schema will look like:
{ price: number }

List of objects Extraction

Here is how an extract call might look for a list of objects.
const apartments = await page.extract({
  instruction:
    "Extract ALL the apartment listings and their details, including address, price, and square feet.",
  schema: z.object({
    list_of_apartments: z.array(
      z.object({
        address: z.string(),
        price: z.string(),
        square_feet: z.string(),
      }),
    ),
  })
})

console.log("the apartment list is: ", apartments);
Your output schema will look like:
list_of_apartments: [
    {
      address: "street address here",
      price: "$1234.00",
      square_feet: "700"
    },
    {
        address: "another address here",
        price: "1010.00",
        square_feet: "500"
    },
    ...
]

Prompt-only Extraction

You can call extract with just a natural language prompt:
const result = await page.extract("extract the name of the repository");
When you call extract with just a prompt, your output schema will look like:
{ extraction: string }

Extract with no parameters

Here is how you can call extract with no parameters.
const pageText = await page.extract();
Output schema:
{ page_text: string }
Calling extract with no parameters will return hierarchical tree representation of the root DOM. This will not be passed through an LLM. It will look something like this:
Accessibility Tree:
[0-2] RootWebArea: What is Stagehand? - 🤘 Stagehand
  [0-37] scrollable
    [0-118] body
      [0-241] scrollable
        [0-242] div
          [0-244] link: 🤘 Stagehand home page light logo
            [0-245] span
              [0-246] StaticText: 🤘 Stagehand
              [0-247] StaticText: home page

Best practices

Extract with Context

You can provide additional context to your schema to help the model extract the data more accurately.
const apartments = await page.extract({
 instruction:
   "Extract ALL the apartment listings and their details, including address, price, and square feet.",
 schema: z.object({
   list_of_apartments: z.array(
     z.object({
       address: z.string().describe("the address of the apartment"),
       price: z.string().describe("the price of the apartment"),
       square_feet: z.string().describe("the square footage of the apartment"),
     }),
   ),
 })
})
To extract links or URLs, in the TypeScript version of Stagehand, you’ll need to define the relevant field as z.string().url(). In Python, you’ll need to define it as HttpUrl.
Here is how an extract call might look for extracting a link or URL. This also works for image links.
const extraction = await page.extract({
  instruction: "extract the link to the 'contact us' page",
  schema: z.object({
    link: z.string().url(), // note the usage of z.string().url() here
  }),
});

console.log("the link to the contact us page is: ", extraction.link);
Inside Stagehand, extracting links works by asking the LLM to select an ID. Stagehand looks up that ID in a mapping of IDs -> URLs. When logging the LLM trace, you should expect to see IDs. The actual URLs will be included in the final ExtractResult.

Troubleshooting

Problem: extract() returns empty or incomplete dataSolutions:
  • Check your instruction clarity: Make sure your instruction is specific and describes exactly what data you want to extract
  • Verify the data exists: Use page.observe() first to confirm the data is present on the page
  • Wait for dynamic content: If the page loads content dynamically, use page.act("wait for the content to load") before extracting
Solution: Wait for content before extracting
// Wait for content before extracting
await page.act("wait for the product listings to load");
const products = await page.extract({
  instruction: "extract all product names and prices",
  schema: z.object({
    products: z.array(z.object({
      name: z.string(),
      price: z.string()
    }))
  })
});
Problem: Getting schema validation errors or type mismatchesSolutions:
  • Use optional fields: Make fields optional with z.optional() (TypeScript) or Optional[type] (Python) if the data might not always be present
  • Use flexible types: Consider using z.string() instead of z.number() for prices that might include currency symbols
  • Add descriptions: Use .describe() (TypeScript) or Field(description="...") (Python) to help the model understand field requirements
Solution: More flexible schema
const schema = z.object({
  price: z.string().describe("price including currency symbol, e.g., '$19.99'"),
  availability: z.string().optional().describe("stock status if available"),
  rating: z.number().optional()
});
Problem: Extraction results vary between runsSolutions:
  • Be more specific in instructions: Instead of “extract prices”, use “extract the numerical price value for each item”
  • Use context in schema descriptions: Add field descriptions to guide the model
  • Combine with observe: Use page.observe() to understand the page structure first
Solution: Validate with observe first
// First observe to understand the page structure
const elements = await page.observe("find all product listings");
console.log("Found elements:", elements.map(e => e.description));

// Then extract with specific targeting
const products = await page.extract({
  instruction: "extract name and price from each product listing shown on the page",
  schema: z.object({
    products: z.array(z.object({
      name: z.string().describe("the product title or name"),
      price: z.string().describe("the price as displayed, including currency")
    }))
  })
});
Problem: Extraction is slow or timing outSolutions:
  • Reduce scope: Extract smaller chunks of data in multiple calls rather than everything at once
  • Use targeted instructions: Be specific about which part of the page to focus on
  • Consider pagination: For large datasets, extract one page at a time
  • Increase timeout: Use timeoutMs parameter for complex extractions
Solution: Break down large extractions
// Instead of extracting everything at once
const allData = [];
const pageNumbers = [1, 2, 3, 4, 5];

for (const pageNum of pageNumbers) {
  await page.act(`navigate to page ${pageNum}`);
  
  const pageData = await page.extract({
    instruction: "extract product data from the current page only",
    schema: ProductPageSchema,
    timeoutMs: 60000 // 60 second timeout
  });
  
  allData.push(...pageData.products);
}

Next steps

I