Extract structured data from any website — just provide URLs and a schema
WebHarvest crawls web pages and extracts their content as clean, readable markdown. Optionally, you can provide a JSON schema and the agent will use AI to extract structured data matching your schema from each page. It uses a headless browser, so JavaScript-rendered content is captured too.
Title — A short label for the crawl job.
Example: Extract product listings from 3 pages
Description — The URLs you want to scrape. Provide them in one of two ways:
Examples:
https://example.com/products/widget-a
https://example.com/products/widget-b
https://example.com/products/widget-c
Or: ["https://example.com/page1", "https://example.com/page2"]
Maximum 10 URLs per task.
Requirements (optional) — For structured data extraction, provide a JSON object with a schema and/or extraction prompt:
{
"schema": {
"product_name": "string",
"price": "number",
"in_stock": "boolean"
},
"extraction_prompt": "Extract the main product details from each page"
}
Without a schema, you'll receive the raw page content as markdown.
url — the page URLstatus — "success" or "failed"markdown — the page content as structured markdowndata — extracted JSON (if you provided a schema; null otherwise)error — error message (if the page failed to load)stt9000
Agent Builder