What Is a Self-Healing Scraper?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

What Is a Self-Healing Scraper? — conceptual illustration

On this page

A self-healing scraper is a scraper that notices, while it is running, that the rules it uses to find data on a page have stopped working — and then fixes those rules on its own. When it breaks, it sends the page's HTML to an LLM (a large language model like Claude Haiku or GPT-4o-mini, both cheap to run) and asks for corrected selectors — the small instructions, like "the price is inside this tag", that tell the scraper where each value lives. The new selectors are written automatically, with no code deployment. This matters because changes to selectors are the single largest failure mode for long-running production spiders: sites get redesigned and the old rules silently miss everything. Pair this with a Pydantic + Instructor extraction layer downstream (a step that checks the extracted data matches the shape you expect) and you have a pipeline that survives most site redesigns on its own.

What it fixes	CSS / XPath selector changes (~80% of spider breakages)
What it does not fix	Anti-bot escalation (site upgraded to Cloudflare), schema-level changes
Cost per heal	~$0.0003 with Claude Haiku, ~$0.01 with Sonnet
Detection trigger	Item count drops to zero (or below threshold) after a run
Required guardrail	Pydantic schema validation on healed output before persisting

The architecture

Five parts work together:

Scrapy extension hook. It listens for the item_scraped and spider_closed signals (events Scrapy fires when it grabs an item and when a run ends). It counts the items found and keeps one full page of the broken HTML in memory, in case healing is needed.
Failure detector. When a run ends with zero items (or fewer than a threshold you set), it triggers the heal flow.
LLM call. Send the old selectors plus a trimmed copy of the page HTML (the first 8K characters is usually enough) to the LLM. The prompt asks it to return corrected selectors as JSON.
Selector updater. Read the LLM's answer and write the new selectors straight into the spider's YAML or JSON config. The config lives in Git, so every heal is auditable and easy to revert.
Validation. Re-run the spider. If items come back, it healed — notify Slack with the diff. If the heal returns items that look plausible but are wrong (Pydantic flags the wrong data type), escalate to a human instead of blindly trusting the fix.

Why it works for selector changes

Selector changes are the textbook case where LLMs beat plain regex (pattern matching on raw text). Real-world HTML is messy: inconsistent, minified, sometimes deliberately scrambled. An LLM reads it the way a person would: "the title is the text inside the first <h1> whose class contains 'product'". The model hands back h1[class*='product']::text and you keep scraping.

The cost math is favourable. A Claude Haiku heal costs roughly $0.0003 per call. Even if a fleet of 50 spiders each break once per quarter, you spend less than a dollar a year on healing — and nobody gets paged at 3am. Set that against one engineer-hour of manual selector debugging and the ROI is obvious.

What this pattern does not do

Anti-bot upgrades. If the spider broke because the site added Cloudflare protection where there was none, no selector change helps — the spider needs a new TLS or browser layer (a way to look like a real browser, TLS being the encryption behind https). The heal flow should spot this case (an HTTP 403 instead of HTML, or a Cloudflare challenge page in the response) and send it to a different alert rather than asking the LLM to write selectors.
Schema-level changes. If the site renamed price to current_price in its JSON-LD (structured product data embedded in the page), the selector may still find an element, but the field itself has changed. Selector healing plus Pydantic-validated extraction catch this together: the selector finds the element, the LLM call extracts what looks like a price, the schema checks the type, and a normalisation step renames the field. Three layers.
LLM hallucinations. Without schema validation on healed output, you can ingest made-up data. Always validate. If the heal returns strings where you expected integers, fail the heal and escalate.

Code example

python

# Scrapy extension that heals selectors when items drop to zero
from scrapy import signals
import anthropic, json, yaml

HEAL_PROMPT = """You are a web scraping expert. A Scrapy spider broke because the site
changed its HTML.

Old selectors (no longer working):
  title: {title}
  price: {price}
  image: {image}

New page HTML (truncated):
{html}

Return ONLY a JSON object with corrected CSS selectors:
{{"title": "...", "price": "...", "image": "..."}}
"""

class SelfHeal:
    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def __init__(self):
        self.item_count = 0
        self.broken_page = None

    def item_scraped(self, item, response, spider):
        self.item_count += 1
        if self.broken_page is None:
            self.broken_page = response.text

    def spider_closed(self, spider, reason):
        if self.item_count > 0 or not self.broken_page:
            return
        old = yaml.safe_load(open(f"selectors/{spider.name}.yml"))
        client = anthropic.Anthropic()
        resp = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=512,
            messages=[{"role": "user", "content": HEAL_PROMPT.format(
                **old, html=self.broken_page[:8000]
            )}],
        )
        new = json.loads(resp.content[0].text)
        yaml.safe_dump(new, open(f"selectors/{spider.name}.yml", "w"))
        # Slack-notify, then trigger re-run

Schema-validated LLM extraction is the standard production pattern for AI scraping: you describe the data you want as a Pydantic schema (a P…

What Is AI Web Scraping?

AI web scraping is an approach that replaces CSS selectors with natural-language prompts, LLM-based extraction, and Markdown-first output. N…

What Is Web Scraping?

Web scraping is the automated extraction of structured data from websites. Instead of a person copying and pasting, a program (a "scraper") …

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

What Is a Computer Use Agent?

A Computer Use Agent (CUA) is an AI agent that acts like a person at a keyboard: it logs into a portal as the user, clicks through the scree…

What Is jsoup?

jsoup is a free Java library that reads HTML and lets you pull data out of it. You give it a web page, and it turns the raw HTML into a DOM …

What Are Request Retries?

Request retries are the practice of automatically re-sending an HTTP request that failed, instead of giving up on the first error. Networks …

Concept map

How Self-Healing Scraper connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping APIs

Tools & solutions for this topic

Frequently asked questions

What if the LLM writes wrong selectors?

The Pydantic validation layer downstream catches it. If the healed selectors return strings where you expected integers, null where a value is required, or anything that fails your schema, the validation step rejects the heal, the spider is marked broken, and a human reviews it. Without validation, the heal is dangerous. With validation, it is safe.

Can I do this without Scrapy?

Yes — the same pattern works with any scraping framework. You need four building blocks: a way to count items per run, a way to hold one broken page in memory, an LLM client, and a selector config file you can rewrite. Crawlee (Node), Crawl4AI, even a hand-rolled requests + BeautifulSoup spider can do it.

How often does this trigger?

It depends on the target. E-commerce and news sites redesign often (every few months); enterprise SaaS portals change slowly (years). Across a 50-spider fleet, expect a handful of heals per quarter. Most are fixed in minutes by the LLM, with no human ever paged.

What about anti-bot escalations?

A site that goes from "no anti-bot" to "Cloudflare" returns an HTTP 403 or a challenge page rather than HTML with missing selectors. The heal flow should detect this (by the response status and body pattern) and route to a different alert: "spider needs anti-bot upgrade", not "spider needs new selectors". Two failure modes, two playbooks.

Last updated: 2026-05-31