Web Scraping APIs

What Is a Self-Healing Scraper?

What Is a Self-Healing Scraper? — conceptual illustration
On this page

A self-healing scraper detects mid-run that its selectors stopped working, sends the broken page HTML to an LLM (typically Claude Haiku or GPT-4o-mini for cost), and writes corrected selectors automatically — without a code deployment. The pattern works because selector changes are the single largest failure mode for long-running production spiders. Pair this with a Pydantic+Instructor extraction layer downstream and you have a pipeline that survives most site redesigns autonomously.

Quick facts

What it fixesCSS / XPath selector changes (~80% of spider breakages)
What it does not fixAnti-bot escalation (site upgraded to Cloudflare), schema-level changes
Cost per heal~$0.0003 with Claude Haiku, ~$0.01 with Sonnet
Detection triggerItem count drops to zero (or below threshold) after a run
Required guardrailPydantic schema validation on healed output before persisting

The architecture

Five components compose:

  1. Scrapy extension hook. Listens for item_scraped and spider_closed signals. Counts items. Keeps one full page of broken HTML in memory in case healing is needed.
  2. Failure detector. When the spider closes with zero items (or below a configured threshold), the heal flow fires.
  3. LLM call. Send the old selectors + truncated page HTML (typically the first 8K characters is enough) to the LLM. Prompt asks for corrected selectors as JSON.
  4. Selector updater. Parse the LLM response, write the new selectors to the spider's YAML or JSON config in place. Version-controlled in Git so heals are auditable and revertable.
  5. Validation. Re-run the spider. If items come back, healed. Notify Slack with the diff. If the heal returns plausible-looking but wrong items (Pydantic validation fails on type mismatches), escalate to human review instead of trusting the heal blindly.

Why it works for selector changes

Selector changes are the textbook case where LLMs outperform regex. The HTML is heterogeneous, minified, sometimes obfuscated. Claude or GPT reads it more like a human reads it: "the title is the text inside the first <h1> with class containing 'product'". The model returns h1[class*='product']::text and you keep scraping.

The cost math is favourable. A Claude Haiku heal is roughly $0.0003 per call. Even if your fleet of 50 spiders each break once per quarter, you spend less than a dollar a year on healing — and you do not page a human at 3am. Compare that to one engineer-hour of manual selector debugging and the ROI is obvious.

What this pattern does not do

  • Anti-bot upgrades. If the spider broke because the site went from no protection to Cloudflare, no selector change will help — the spider needs a new TLS or browser layer. The heal flow should detect this (HTTP 403 instead of HTML, Cloudflare challenge page in response) and route to a different alert path rather than asking the LLM to write selectors.
  • Schema-level changes. If the site renamed price to current_price in their JSON-LD, the selector might still find an element but the field has changed structurally. Selector healing plus Pydantic-validated extraction together catch this: the selector finds the element, the LLM call extracts what looks like a price, the schema validates the type, and a normalisation step renames the field. Three layers.
  • LLM hallucinations. Without schema validation on healed output, you can ingest fabricated data. Always validate. If the heal produces strings where you expected integers, fail the heal and escalate.

Code example

python
# Scrapy extension that heals selectors when items drop to zero
from scrapy import signals
import anthropic, json, yaml

HEAL_PROMPT = """You are a web scraping expert. A Scrapy spider broke because the site
changed its HTML.

Old selectors (no longer working):
  title: {title}
  price: {price}
  image: {image}

New page HTML (truncated):
{html}

Return ONLY a JSON object with corrected CSS selectors:
{{"title": "...", "price": "...", "image": "..."}}
"""

class SelfHeal:
    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def __init__(self):
        self.item_count = 0
        self.broken_page = None

    def item_scraped(self, item, response, spider):
        self.item_count += 1
        if self.broken_page is None:
            self.broken_page = response.text

    def spider_closed(self, spider, reason):
        if self.item_count > 0 or not self.broken_page:
            return
        old = yaml.safe_load(open(f"selectors/{spider.name}.yml"))
        client = anthropic.Anthropic()
        resp = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=512,
            messages=[{"role": "user", "content": HEAL_PROMPT.format(
                **old, html=self.broken_page[:8000]
            )}],
        )
        new = json.loads(resp.content[0].text)
        yaml.safe_dump(new, open(f"selectors/{spider.name}.yml", "w"))
        # Slack-notify, then trigger re-run

Related terms

Concept map

How Self-Healing Scraper connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

What if the LLM writes wrong selectors?

The Pydantic validation layer downstream catches this. If the healed selectors return strings where you expected integers, null where required, or values that fail your schema, the validation step rejects the heal, the spider is marked broken, and you escalate to human review. Without validation, the heal is dangerous. With validation, it is safe.

Can I do this without Scrapy?

Yes — the same pattern works with any scraping framework. The four primitives you need are: a way to count items per run, a way to retain one broken page in memory, an LLM client, and a config file for selectors that you can rewrite. Crawlee (Node), Crawl4AI, even a hand-rolled requests + BeautifulSoup spider can implement it.

How often does this trigger?

Depends on the target. E-commerce and news sites redesign frequently (months); enterprise SaaS portals slowly (years). Across a 50-spider fleet, expect a handful of heals per quarter. Most are fixed in minutes by the LLM without a human ever being paged.

What about anti-bot escalations?

A site that went from "no anti-bot" to "Cloudflare" will return HTTP 403 or a challenge page rather than HTML with missing selectors. The heal flow should detect this (response status, response body pattern) and route to a different alert: "spider needs anti-bot upgrade", not "spider needs new selectors". Two failure modes, two playbooks.

Last updated: 2026-05-26