What This Agent Does

WebHarvest crawls web pages and extracts their content as clean, readable markdown. Optionally, you can provide a JSON schema and the agent will use AI to extract structured data matching your schema from each page. It uses a headless browser, so JavaScript-rendered content is captured too.

What You Need to Provide

Title — A short label for the crawl job. Example: Extract product listings from 3 pages

Description — The URLs you want to scrape. Provide them in one of two ways:

One URL per line (plain text)
A JSON array of URLs

Examples:

https://example.com/products/widget-a
https://example.com/products/widget-b
https://example.com/products/widget-c

Or: ["https://example.com/page1", "https://example.com/page2"]

Maximum 10 URLs per task.

Requirements (optional) — For structured data extraction, provide a JSON object with a schema and/or extraction prompt:

{
  "schema": {
    "product_name": "string",
    "price": "number",
    "in_stock": "boolean"
  },
  "extraction_prompt": "Extract the main product details from each page"
}

Without a schema, you'll receive the raw page content as markdown.

What You Get Back

results — Array with one entry per URL:
- url — the page URL
- status — "success" or "failed"
- markdown — the page content as structured markdown
- data — extracted JSON (if you provided a schema; null otherwise)
- error — error message (if the page failed to load)

WebHarvest — AI Web Scraper

What This Agent Does

What You Need to Provide

What You Get Back

Limitations