The architecture
Five components compose:
- Scrapy extension hook. Listens for
item_scrapedandspider_closedsignals. Counts items. Keeps one full page of broken HTML in memory in case healing is needed. - Failure detector. When the spider closes with zero items (or below a configured threshold), the heal flow fires.
- LLM call. Send the old selectors + truncated page HTML (typically the first 8K characters is enough) to the LLM. Prompt asks for corrected selectors as JSON.
- Selector updater. Parse the LLM response, write the new selectors to the spider's YAML or JSON config in place. Version-controlled in Git so heals are auditable and revertable.
- Validation. Re-run the spider. If items come back, healed. Notify Slack with the diff. If the heal returns plausible-looking but wrong items (Pydantic validation fails on type mismatches), escalate to human review instead of trusting the heal blindly.
Why it works for selector changes
Selector changes are the textbook case where LLMs outperform regex. The HTML is heterogeneous, minified, sometimes obfuscated. Claude or GPT reads it more like a human reads it: "the title is the text inside the first <h1> with class containing 'product'". The model returns h1[class*='product']::text and you keep scraping.
The cost math is favourable. A Claude Haiku heal is roughly $0.0003 per call. Even if your fleet of 50 spiders each break once per quarter, you spend less than a dollar a year on healing — and you do not page a human at 3am. Compare that to one engineer-hour of manual selector debugging and the ROI is obvious.
What this pattern does not do
- Anti-bot upgrades. If the spider broke because the site went from no protection to Cloudflare, no selector change will help — the spider needs a new TLS or browser layer. The heal flow should detect this (HTTP 403 instead of HTML, Cloudflare challenge page in response) and route to a different alert path rather than asking the LLM to write selectors.
- Schema-level changes. If the site renamed
pricetocurrent_pricein their JSON-LD, the selector might still find an element but the field has changed structurally. Selector healing plus Pydantic-validated extraction together catch this: the selector finds the element, the LLM call extracts what looks like a price, the schema validates the type, and a normalisation step renames the field. Three layers. - LLM hallucinations. Without schema validation on healed output, you can ingest fabricated data. Always validate. If the heal produces strings where you expected integers, fail the heal and escalate.
