What Is Schema-Validated LLM Extraction?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

What Is Schema-Validated LLM Extraction? — conceptual illustration

On this page

Schema-validated LLM extraction is the standard production pattern for AI scraping: you describe the data you want as a Pydantic schema (a Python class that defines field names and types), hand it to the LLM through the Instructor library, and get back a checked, typed Python object instead of a loose string. Large language models (LLMs) are great at reading messy HTML but unreliable at returning data in a consistent shape. The schema check catches made-up values, normalises currencies and units, rejects malformed output, and retries automatically — so you do not have to write that retry code yourself.

Stack	Pydantic (schemas) + Instructor (LLM-output validation) + any LLM
Providers supported	Anthropic, OpenAI, Mistral, Cohere, Gemini, Ollama (local)
What it catches	Type mismatches, hallucinated dates, malformed JSON, missing required fields
Failure mode handled	LLM returns "competitive" for an int field → Instructor retries automatically
Cost overhead	~0–2× base LLM cost depending on retry rate

Why raw LLM extraction fails in production

Ask GPT-4 or Claude for a salary across 10,000 job-board pages and the same field comes back in many shapes: $40,000, 40 dollars, 40k USD, "forty thousand", and sometimes null or a number it simply made up. A database cannot store that mess. Add a date field and you hit a worse problem: when a scraped article has no publication date, the LLM may invent one that fits the article's tone. That fabrication then flows into your pipeline as if it were a real fact.

The fix is not "better prompting" — it is structural. Keep the two jobs separate: semantic understanding (reading the page, which the LLM is good at) and structural guarantees (enforcing the exact shape, which schema validation is good at).

How Instructor adds value over raw API

Anthropic and OpenAI both let you request structured output directly using a JSON schema (a description of the expected fields and types). Instructor wraps those built-in features and adds three things that matter in production:

Automatic retries on validation failure. If the LLM returns a string where you asked for an int, Instructor re-asks the model — sending the validation error back as a hint — until it gets valid output or hits max_retries. You do not write retry logic.
Multi-provider abstraction. The same Pydantic schema works with OpenAI, Anthropic, Mistral, Cohere, Gemini, and local Ollama. Switch providers without rewriting your extraction code.
Streaming + partial validation. For large schemas, Instructor streams partial Pydantic objects as the LLM produces them — handy for low-latency UIs that show results as they arrive.

When classical NLP still wins

LLM extraction costs roughly $0.001–$0.01 per article on modern Claude or GPT-class models. Classical NLP — older, rule- and statistics-based text tools like spaCy NER (named-entity recognition) and dependency parsing (working out grammatical structure) — costs effectively zero once the model is loaded. Use classical NLP when you are scraping millions of consistent documents with a fixed schema, your latency budget is under 5 ms, or cost matters more than handling edge cases. Use LLM + Instructor when sources vary, when meaning depends on context ("Apple" the company vs. the fruit), when the schema may change, or when you need to resolve equivalent phrasings ("FTE" = "full-time" = "permanent" = "direct hire").

The pattern Bloomberg, Reuters Refinitiv, and FactSet actually use is a hybrid: cheap classical NLP as a fast pre-filter that tags 95% of documents, with the LLM reserved for the ambiguous 5%. On a million-document corpus that hybrid is the difference between $50 and $5,000 in extraction cost.

Code example

python

# pip install instructor pydantic anthropic
from pydantic import BaseModel, Field
import instructor, anthropic

class JobPosting(BaseModel):
    title: str
    company: str
    salary_min_usd: int | None = Field(
        description="Floor of salary range in USD. Convert from other currencies if needed."
    )
    salary_max_usd: int | None
    years_experience_min: int
    location: str
    remote: bool

client = instructor.from_anthropic(anthropic.Anthropic())

result = client.messages.create(
    model="claude-sonnet-4-6",
    response_model=JobPosting,
    messages=[{"role": "user", "content": f"Extract from:\n\n{scraped_html}"}],
    max_retries=3,
)
# result is a validated JobPosting object — Instructor caught any type
# mismatches and retried until output was valid.
print(result.salary_min_usd, type(result.salary_min_usd))  # 95000 <class 'int'>

Related terms

What Is AI Web Scraping?

AI web scraping is an approach that replaces CSS selectors with natural-language prompts, LLM-based extraction, and Markdown-first output. N…

What Is Firecrawl?

Firecrawl is a web-scraping API built for AI: you hand it a URL and it hands back clean Markdown or JSON — no CSS selectors, no XPath, no HT…

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

What Is Web Scraping?

Web scraping is the automated extraction of structured data from websites. Instead of a person copying and pasting, a program (a "scraper") …

What Is a Self-Healing Scraper?

A self-healing scraper is a scraper that notices, while it is running, that the rules it uses to find data on a page have stopped working — …

Best Web Scraping API for LLM Training Data

The best web scraping API for LLM training data delivers clean, deduplicated, license-aware text at the scale training pipelines need — boil…

Concept map

How Schema-Validated LLM Extraction connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping APIs

Tools & solutions for this topic

Frequently asked questions

Is Instructor a wrapper around the LLM API?

Yes. It patches the official SDK (Anthropic, OpenAI, and others) so you can pass a response_model argument that points at a Pydantic class. The library then handles building the prompt, injecting the JSON schema, parsing the response, validating it, and retrying. Your code still looks almost exactly like a direct SDK call.

What if the LLM keeps failing validation?

Instructor retries up to max_retries (default 3, configurable). After that it raises a ValidationError, so you can fall back to manual handling, queue the item for human review, or skip the record. In practice, a retry rate above about 2% usually means your schema is too strict for the source data or your prompt is unclear.

Does this work with local LLMs?

Yes. Instructor supports Ollama and other tools that run models on your own hardware. Smaller local models fail validation more often than Claude or GPT-4, so the extra cost of those retries can cancel out the savings from not paying for API calls. Benchmark it for your specific use case.

Why not just use the OpenAI structured outputs feature directly?

You can — but Instructor layers the automatic retries, multi-provider support, and streaming on top. For one-off extractions the direct API is fine; for production pipelines the retry logic and the freedom to swap providers pay off quickly.

Last updated: 2026-05-31