Why the shift happened
Three forces aligned in 2024–2025. LLMs got good enough at structured extraction — the NEXT-EVAL benchmark showed F1 > 0.95 when the input is properly formatted. Token costs dropped, and Markdown output uses about 67% fewer tokens than raw HTML, which compounds significantly across thousands of pages. MCP (Model Context Protocol) shipped, letting Claude, Cursor, and Codex scrape via tool calls with no code on the LLM side. The result is a workflow where you describe the data and the pipeline adapts when sites redesign.
The leading tools
Firecrawl — managed and self-hostable. URL in → clean Markdown or JSON out. The FIRE-1 agent autonomously navigates JS-heavy sites; an /interact endpoint clicks and fills forms. Native LangChain and LlamaIndex integrations. 500 free scrapes/month. Used by SAP, Zapier, Deloitte.
Crawl4AI — open source, Apache 2.0 licensed, the "Scrapy of the LLM era". Runs on your infrastructure, supports Ollama for local models. Adaptive crawling learns selectors over time. Full data sovereignty.
ScrapeGraphAI — describe what you want, an LLM builds and executes a graph-based extraction pipeline. Self-healing: when site structure changes, re-describe and it adapts.
The production pattern
Raw LLM extraction is unreliable in production — ask one for a price across 10,000 articles and you get $40, 40 dollars, "forty", and occasionally invented values. The fix is schema-validated extraction with Pydantic + Instructor. Define a Pydantic model for what you want; pass it to the LLM via Instructor (which patches the SDK to return typed objects, not strings); Instructor retries on validation failures and rejects malformed output before it enters your pipeline. If the LLM hallucinates "competitive" for a salary field, validation fails, the call retries, and you get either a real number or None — never garbage.
When classical NLP still wins: at scale on a fixed schema (10M+ documents, e-commerce / classifieds), spaCy NER + dependency parsing costs effectively zero after model load and runs in sub-millisecond latency. The hybrid production pattern is classical NLP to pre-filter and tag, LLM only for ambiguous cases.
