Why the shift happened
Three things came together in 2024–2025. First, LLMs got good enough at structured extraction — that is, reliably turning messy text into clean fields like price or title. The NEXT-EVAL benchmark showed F1 > 0.95 (F1 is an accuracy score from 0 to 1) when the input is properly formatted. Second, token costs dropped — you pay LLMs per token (a token is roughly a word-piece), and Markdown output uses about 67% fewer tokens than raw HTML, which adds up fast across thousands of pages. Third, MCP (Model Context Protocol) shipped — a standard way to hand tools to an AI, so Claude, Cursor, and Codex can scrape directly with no code on the LLM side. The result is a workflow where you describe the data once and the pipeline keeps working even when a site is redesigned.
The leading tools
Firecrawl — a hosted service you can also run yourself. You give it a URL and it returns clean Markdown or JSON. Its FIRE-1 agent navigates JavaScript-heavy sites on its own, and an /interact endpoint can click buttons and fill forms. It plugs straight into LangChain and LlamaIndex (popular AI app frameworks). 500 free scrapes/month. Used by SAP, Zapier, Deloitte.
Crawl4AI — open source under the Apache 2.0 license, often called the "Scrapy of the LLM era". You run it on your own servers, and it supports Ollama so the AI model runs locally too. Its adaptive crawling learns a site's selectors over time. You keep full control of your data.
ScrapeGraphAI — you describe what you want, and an LLM builds and runs a graph-based extraction pipeline (a series of connected steps) to get it. It is self-healing: when a site's structure changes, you just re-describe what you need and it adapts.
The production pattern
Asking an LLM for data raw is too unreliable for real production use. Ask one for a price across 10,000 articles and you get $40, 40 dollars, "forty", and occasionally numbers it simply made up. The fix is schema-validated extraction with Pydantic + Instructor. A schema is just a definition of the exact shape you expect. You define that shape as a Pydantic model (Pydantic is a Python library that checks data against a defined type), then pass it to the LLM through Instructor, which makes the LLM return a typed object instead of free text. Instructor retries when the output does not match and throws away malformed results before they reach your pipeline. So if the LLM puts "competitive" in a salary field, validation fails, the call retries, and you end up with either a real number or None — never garbage.
Sometimes the old-school approach still wins. At large scale on a fixed schema (10M+ documents, e-commerce / classifieds), classical NLP — spaCy NER (named entity recognition: spotting things like names, prices, dates) plus dependency parsing — costs effectively nothing after the model loads and runs in under a millisecond per item. The common production setup is a hybrid: use classical NLP to pre-filter and tag everything, and call the LLM only for the ambiguous cases.
