Web Scraping APIs

What Is Schema-Validated LLM Extraction?

What Is Schema-Validated LLM Extraction? — conceptual illustration
On this page

Schema-validated LLM extraction is the production pattern for AI scraping: define a Pydantic schema for what you want, pass it to the LLM via the Instructor library, and get back a validated typed Python object instead of a string. LLMs are excellent at understanding messy HTML but terrible at returning data in a consistent shape. Schema validation catches hallucinations, normalises currencies and units, rejects malformed output, and retries automatically — without you writing retry logic.

Quick facts

StackPydantic (schemas) + Instructor (LLM-output validation) + any LLM
Providers supportedAnthropic, OpenAI, Mistral, Cohere, Gemini, Ollama (local)
What it catchesType mismatches, hallucinated dates, malformed JSON, missing required fields
Failure mode handledLLM returns "competitive" for an int field → Instructor retries automatically
Cost overhead~0–2× base LLM cost depending on retry rate

Why raw LLM extraction fails in production

Ask GPT-4 or Claude for a salary across 10,000 job-board pages and you will get $40,000, 40 dollars, 40k USD, "forty thousand", and occasionally null or a wholly invented number. Your database cannot ingest that. Add a date field and you discover the more dangerous failure: when a scraped article does not contain a publication date, the LLM sometimes invents one that fits the tone of the article. The fabrication passes into your pipeline as a fact.

The fix is not "better prompting" — it is structural. Separate semantic understanding (what the LLM does well) from structural guarantees (what schema validation does well).

How Instructor adds value over raw API

Anthropic and OpenAI both support structured outputs via JSON schema directly. Instructor wraps those primitives with three things that matter in production:

  1. Automatic retries on validation failure. If the LLM returns a string where you specified an int, Instructor re-prompts with the validation error message until either it gets valid output or hits max_retries. You do not write retry logic.
  2. Multi-provider abstraction. Same Pydantic schema works with OpenAI, Anthropic, Mistral, Cohere, Gemini, and local Ollama. Swap providers without rewriting extraction logic.
  3. Streaming + partial validation. For large schemas, Instructor streams partial Pydantic objects as the LLM generates them. Useful for low-latency UIs.

When classical NLP still wins

LLM extraction costs roughly $0.001–$0.01 per article on modern Claude or GPT-class models. Classical NLP (spaCy NER, dependency parsing) costs effectively zero after model load. Use classical NLP when: scraping millions of consistent documents with a fixed schema, latency budget under 5 ms, cost matters more than edge-case nuance. Use LLM + Instructor when: heterogeneous sources, context disambiguation matters ("Apple" the company vs. the fruit), schema may evolve, semantic equivalences need resolving ("FTE" = "full-time" = "permanent" = "direct hire").

The production pattern Bloomberg, Reuters Refinitiv, and FactSet actually use: classical NLP as a fast pre-filter that tags 95% of documents cheaply, LLM only for the ambiguous 5%. That hybrid is the difference between $50 and $5,000 in extraction cost on a million-document corpus.

Code example

python
# pip install instructor pydantic anthropic
from pydantic import BaseModel, Field
import instructor, anthropic

class JobPosting(BaseModel):
    title: str
    company: str
    salary_min_usd: int | None = Field(
        description="Floor of salary range in USD. Convert from other currencies if needed."
    )
    salary_max_usd: int | None
    years_experience_min: int
    location: str
    remote: bool

client = instructor.from_anthropic(anthropic.Anthropic())

result = client.messages.create(
    model="claude-sonnet-4-6",
    response_model=JobPosting,
    messages=[{"role": "user", "content": f"Extract from:\n\n{scraped_html}"}],
    max_retries=3,
)
# result is a validated JobPosting object — Instructor caught any type
# mismatches and retried until output was valid.
print(result.salary_min_usd, type(result.salary_min_usd))  # 95000 <class 'int'>

Related terms

Concept map

How Schema-Validated LLM Extraction connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

Is Instructor a wrapper around the LLM API?

Yes — it patches the official SDK (Anthropic, OpenAI, etc.) to accept a response_model argument that points at a Pydantic class. The library handles prompt construction, JSON-schema injection, response parsing, validation, and retry. Your code stays as if you were calling the SDK directly.

What if the LLM keeps failing validation?

Instructor retries up to max_retries (default 3, configurable). After that it raises a ValidationError so you can fall back to manual handling, queue for human review, or skip the record. In practice retry rates above ~2% suggest your schema is too strict for the source data or your prompt is unclear.

Does this work with local LLMs?

Yes — Instructor supports Ollama and other local-model runners. Smaller local models have higher retry rates than Claude or GPT-4, so cost-of-retry can offset the cost-of-API-calls. Benchmark for your specific use case.

Why not just use the OpenAI structured outputs feature directly?

You can — but Instructor adds the retry layer, multi-provider abstraction, and streaming on top. For one-off extractions the direct API is fine; for production pipelines the retry logic and provider portability pay off quickly.

Last updated: 2026-05-26