Extract

What is `extract()`?

page.extract("extract the name of the repository");

extract grabs structured data from a webpage. You can define your schema with zod (TypeScript) or pydantic (Python). If you do not want to define a schema, you can also call extract with just a natural language prompt, or call extract with no parameters.

Why use `extract()`?

Structured

Turn messy webpage data into clean objects that follow a schema.

Resilient

Build resilient extractions that don’t break when the website changes

For TypeScript, the extract schemas are defined using zod schemas.For Python, the extract schemas are defined using pydantic models.

Using `extract()`

Single object Extraction

Here is how an extract call might look for a single object:

const item = await page.extract({
  instruction: "extract the price of the item",
  schema: z.object({
    price: z.number(),
  }),
});

Your output schema will look like:

{ price: number }

List of objects Extraction

Here is how an extract call might look for a list of objects.

const apartments = await page.extract({
  instruction:
    "Extract ALL the apartment listings and their details, including address, price, and square feet.",
  schema: z.object({
    list_of_apartments: z.array(
      z.object({
        address: z.string(),
        price: z.string(),
        square_feet: z.string(),
      }),
    ),
  })
})

console.log("the apartment list is: ", apartments);

Your output schema will look like:

list_of_apartments: [
    {
      address: "street address here",
      price: "$1234.00",
      square_feet: "700"
    },
    {
        address: "another address here",
        price: "1010.00",
        square_feet: "500"
    },
    ...
]

Prompt-only Extraction

You can call extract with just a natural language prompt:

const result = await page.extract("extract the name of the repository");

When you call extract with just a prompt, your output schema will look like:

{ extraction: string }

Extract with no parameters

Here is how you can call extract with no parameters.

const pageText = await page.extract();

Output schema:

{ page_text: string }

Calling extract with no parameters will return hierarchical tree representation of the root DOM. This will not be passed through an LLM. It will look something like this:

Accessibility Tree:
[0-2] RootWebArea: What is Stagehand? - 🤘 Stagehand
  [0-37] scrollable
    [0-118] body
      [0-241] scrollable
        [0-242] div
          [0-244] link: 🤘 Stagehand home page light logo
            [0-245] span
              [0-246] StaticText: 🤘 Stagehand
              [0-247] StaticText: home page

Best practices

Extract with Context

You can provide additional context to your schema to help the model extract the data more accurately.

const apartments = await page.extract({
 instruction:
   "Extract ALL the apartment listings and their details, including address, price, and square feet.",
 schema: z.object({
   list_of_apartments: z.array(
     z.object({
       address: z.string().describe("the address of the apartment"),
       price: z.string().describe("the price of the apartment"),
       square_feet: z.string().describe("the square footage of the apartment"),
     }),
   ),
 })
})

Link Extraction

To extract links or URLs, in the TypeScript version of Stagehand, you’ll need to define the relevant field as z.string().url(). In Python, you’ll need to define it as HttpUrl.

Here is how an extract call might look for extracting a link or URL. This also works for image links.

const extraction = await page.extract({
  instruction: "extract the link to the 'contact us' page",
  schema: z.object({
    link: z.string().url(), // note the usage of z.string().url() here
  }),
});

console.log("the link to the contact us page is: ", extraction.link);

Inside Stagehand, extracting links works by asking the LLM to select an ID. Stagehand looks up that ID in a mapping of IDs -> URLs. When logging the LLM trace, you should expect to see IDs. The actual URLs will be included in the final ExtractResult.

Troubleshooting

Empty or partial results

Problem: extract() returns empty or incomplete dataSolutions:

Check your instruction clarity: Make sure your instruction is specific and describes exactly what data you want to extract
Verify the data exists: Use page.observe() first to confirm the data is present on the page
Wait for dynamic content: If the page loads content dynamically, use page.act("wait for the content to load") before extracting

Solution: Wait for content before extracting

// Wait for content before extracting
await page.act("wait for the product listings to load");
const products = await page.extract({
  instruction: "extract all product names and prices",
  schema: z.object({
    products: z.array(z.object({
      name: z.string(),
      price: z.string()
    }))
  })
});

Schema validation errors

Problem: Getting schema validation errors or type mismatchesSolutions:

Use optional fields: Make fields optional with z.optional() (TypeScript) or Optional[type] (Python) if the data might not always be present
Use flexible types: Consider using z.string() instead of z.number() for prices that might include currency symbols
Add descriptions: Use .describe() (TypeScript) or Field(description="...") (Python) to help the model understand field requirements

Solution: More flexible schema

const schema = z.object({
  price: z.string().describe("price including currency symbol, e.g., '$19.99'"),
  availability: z.string().optional().describe("stock status if available"),
  rating: z.number().optional()
});

Inconsistent results

Problem: Extraction results vary between runsSolutions:

Be more specific in instructions: Instead of “extract prices”, use “extract the numerical price value for each item”
Use context in schema descriptions: Add field descriptions to guide the model
Combine with observe: Use page.observe() to understand the page structure first

Solution: Validate with observe first

// First observe to understand the page structure
const elements = await page.observe("find all product listings");
console.log("Found elements:", elements.map(e => e.description));

// Then extract with specific targeting
const products = await page.extract({
  instruction: "extract name and price from each product listing shown on the page",
  schema: z.object({
    products: z.array(z.object({
      name: z.string().describe("the product title or name"),
      price: z.string().describe("the price as displayed, including currency")
    }))
  })
});

Performance issues

Problem: Extraction is slow or timing outSolutions:

Reduce scope: Extract smaller chunks of data in multiple calls rather than everything at once
Use targeted instructions: Be specific about which part of the page to focus on
Consider pagination: For large datasets, extract one page at a time
Increase timeout: Use timeoutMs parameter for complex extractions

Solution: Break down large extractions

// Instead of extracting everything at once
const allData = [];
const pageNumbers = [1, 2, 3, 4, 5];

for (const pageNum of pageNumbers) {
  await page.act(`navigate to page ${pageNum}`);
  
  const pageData = await page.extract({
    instruction: "extract product data from the current page only",
    schema: ProductPageSchema,
    timeoutMs: 60000 // 60 second timeout
  });
  
  allData.push(...pageData.products);
}

First Steps

The Basics

Configuration

Best Practices

Integrations

Reference

What is `extract()`?

Why use `extract()`?

Structured

Resilient

Using `extract()`

Single object Extraction

List of objects Extraction

Prompt-only Extraction

Extract with no parameters

Best practices

Extract with Context

Link Extraction

Troubleshooting

Next steps

Act

Observe

First Steps

The Basics

Configuration

Best Practices

Integrations

Reference

​What is extract()?

​Why use extract()?

Structured

Resilient

​Using extract()

​Single object Extraction

​List of objects Extraction

​Prompt-only Extraction

​Extract with no parameters

​Best practices

​Extract with Context

​Link Extraction

​Troubleshooting

​Next steps

Act

Observe

What is `extract()`?

Why use `extract()`?

Using `extract()`

Single object Extraction

List of objects Extraction

Prompt-only Extraction

Extract with no parameters

Best practices

Extract with Context

Link Extraction

Troubleshooting

Next steps