What is extract()
?
extract
grabs structured data from a webpage. You can define your schema with zod (TypeScript) or pydantic (Python). If you do not want to define a schema, you can also call extract
with just a natural language prompt, or call extract
with no parameters.
Why use extract()
?
Structured
Turn messy webpage data into clean objects that follow a schema.
Resilient
Build resilient extractions that don’t break when the website changes
For TypeScript, the extract schemas are defined using zod schemas.For Python, the extract schemas are defined using pydantic models.
Using extract()
Single object Extraction
Here is how anextract
call might look for a single object:
List of objects Extraction
Here is how anextract
call might look for a list of objects.
Prompt-only Extraction
You can callextract
with just a natural language prompt:
extract
with just a prompt, your output schema will look like:
Extract with no parameters
Here is how you can callextract
with no parameters.
extract
with no parameters will return hierarchical tree representation of the root DOM. This will not be passed through an LLM. It will look something like this:
Best practices
Extract with Context
You can provide additional context to your schema to help the model extract the data more accurately.Link Extraction
To extract links or URLs, in the TypeScript version of Stagehand, you’ll need to define the relevant field as
z.string().url()
.
In Python, you’ll need to define it as HttpUrl
.extract
call might look for extracting a link or URL. This also works for image links.
Inside Stagehand, extracting links works by asking the LLM to select an ID. Stagehand looks up that ID in a mapping of IDs -> URLs. When logging the LLM trace, you should expect to see IDs. The actual URLs will be included in the final
ExtractResult
.Troubleshooting
Empty or partial results
Empty or partial results
Problem:
extract()
returns empty or incomplete dataSolutions:- Check your instruction clarity: Make sure your instruction is specific and describes exactly what data you want to extract
- Verify the data exists: Use
page.observe()
first to confirm the data is present on the page - Wait for dynamic content: If the page loads content dynamically, use
page.act("wait for the content to load")
before extracting
Schema validation errors
Schema validation errors
Problem: Getting schema validation errors or type mismatchesSolutions:
- Use optional fields: Make fields optional with
z.optional()
(TypeScript) orOptional[type]
(Python) if the data might not always be present - Use flexible types: Consider using
z.string()
instead ofz.number()
for prices that might include currency symbols - Add descriptions: Use
.describe()
(TypeScript) orField(description="...")
(Python) to help the model understand field requirements
Inconsistent results
Inconsistent results
Problem: Extraction results vary between runsSolutions:
- Be more specific in instructions: Instead of “extract prices”, use “extract the numerical price value for each item”
- Use context in schema descriptions: Add field descriptions to guide the model
- Combine with observe: Use
page.observe()
to understand the page structure first
Performance issues
Performance issues
Problem: Extraction is slow or timing outSolutions:
- Reduce scope: Extract smaller chunks of data in multiple calls rather than everything at once
- Use targeted instructions: Be specific about which part of the page to focus on
- Consider pagination: For large datasets, extract one page at a time
- Increase timeout: Use
timeoutMs
parameter for complex extractions