Why raw LLM extraction fails in production
Ask GPT-4 or Claude for a salary across 10,000 job-board pages and you will get $40,000, 40 dollars, 40k USD, "forty thousand", and occasionally null or a wholly invented number. Your database cannot ingest that. Add a date field and you discover the more dangerous failure: when a scraped article does not contain a publication date, the LLM sometimes invents one that fits the tone of the article. The fabrication passes into your pipeline as a fact.
The fix is not "better prompting" — it is structural. Separate semantic understanding (what the LLM does well) from structural guarantees (what schema validation does well).
How Instructor adds value over raw API
Anthropic and OpenAI both support structured outputs via JSON schema directly. Instructor wraps those primitives with three things that matter in production:
- Automatic retries on validation failure. If the LLM returns a string where you specified an int, Instructor re-prompts with the validation error message until either it gets valid output or hits
max_retries. You do not write retry logic. - Multi-provider abstraction. Same Pydantic schema works with OpenAI, Anthropic, Mistral, Cohere, Gemini, and local Ollama. Swap providers without rewriting extraction logic.
- Streaming + partial validation. For large schemas, Instructor streams partial Pydantic objects as the LLM generates them. Useful for low-latency UIs.
When classical NLP still wins
LLM extraction costs roughly $0.001–$0.01 per article on modern Claude or GPT-class models. Classical NLP (spaCy NER, dependency parsing) costs effectively zero after model load. Use classical NLP when: scraping millions of consistent documents with a fixed schema, latency budget under 5 ms, cost matters more than edge-case nuance. Use LLM + Instructor when: heterogeneous sources, context disambiguation matters ("Apple" the company vs. the fruit), schema may evolve, semantic equivalences need resolving ("FTE" = "full-time" = "permanent" = "direct hire").
The production pattern Bloomberg, Reuters Refinitiv, and FactSet actually use: classical NLP as a fast pre-filter that tags 95% of documents cheaply, LLM only for the ambiguous 5%. That hybrid is the difference between $50 and $5,000 in extraction cost on a million-document corpus.
