Why raw LLM extraction fails in production
Ask GPT-4 or Claude for a salary across 10,000 job-board pages and the same field comes back in many shapes: $40,000, 40 dollars, 40k USD, "forty thousand", and sometimes null or a number it simply made up. A database cannot store that mess. Add a date field and you hit a worse problem: when a scraped article has no publication date, the LLM may invent one that fits the article's tone. That fabrication then flows into your pipeline as if it were a real fact.
The fix is not "better prompting" — it is structural. Keep the two jobs separate: semantic understanding (reading the page, which the LLM is good at) and structural guarantees (enforcing the exact shape, which schema validation is good at).
How Instructor adds value over raw API
Anthropic and OpenAI both let you request structured output directly using a JSON schema (a description of the expected fields and types). Instructor wraps those built-in features and adds three things that matter in production:
- Automatic retries on validation failure. If the LLM returns a string where you asked for an int, Instructor re-asks the model — sending the validation error back as a hint — until it gets valid output or hits
max_retries. You do not write retry logic. - Multi-provider abstraction. The same Pydantic schema works with OpenAI, Anthropic, Mistral, Cohere, Gemini, and local Ollama. Switch providers without rewriting your extraction code.
- Streaming + partial validation. For large schemas, Instructor streams partial Pydantic objects as the LLM produces them — handy for low-latency UIs that show results as they arrive.
When classical NLP still wins
LLM extraction costs roughly $0.001–$0.01 per article on modern Claude or GPT-class models. Classical NLP — older, rule- and statistics-based text tools like spaCy NER (named-entity recognition) and dependency parsing (working out grammatical structure) — costs effectively zero once the model is loaded. Use classical NLP when you are scraping millions of consistent documents with a fixed schema, your latency budget is under 5 ms, or cost matters more than handling edge cases. Use LLM + Instructor when sources vary, when meaning depends on context ("Apple" the company vs. the fruit), when the schema may change, or when you need to resolve equivalent phrasings ("FTE" = "full-time" = "permanent" = "direct hire").
The pattern Bloomberg, Reuters Refinitiv, and FactSet actually use is a hybrid: cheap classical NLP as a fast pre-filter that tags 95% of documents, with the LLM reserved for the ambiguous 5%. On a million-document corpus that hybrid is the difference between $50 and $5,000 in extraction cost.
