The architecture
Five parts work together:
- Scrapy extension hook. It listens for the
item_scrapedandspider_closedsignals (events Scrapy fires when it grabs an item and when a run ends). It counts the items found and keeps one full page of the broken HTML in memory, in case healing is needed. - Failure detector. When a run ends with zero items (or fewer than a threshold you set), it triggers the heal flow.
- LLM call. Send the old selectors plus a trimmed copy of the page HTML (the first 8K characters is usually enough) to the LLM. The prompt asks it to return corrected selectors as JSON.
- Selector updater. Read the LLM's answer and write the new selectors straight into the spider's YAML or JSON config. The config lives in Git, so every heal is auditable and easy to revert.
- Validation. Re-run the spider. If items come back, it healed — notify Slack with the diff. If the heal returns items that look plausible but are wrong (Pydantic flags the wrong data type), escalate to a human instead of blindly trusting the fix.
Why it works for selector changes
Selector changes are the textbook case where LLMs beat plain regex (pattern matching on raw text). Real-world HTML is messy: inconsistent, minified, sometimes deliberately scrambled. An LLM reads it the way a person would: "the title is the text inside the first <h1> whose class contains 'product'". The model hands back h1[class*='product']::text and you keep scraping.
The cost math is favourable. A Claude Haiku heal costs roughly $0.0003 per call. Even if a fleet of 50 spiders each break once per quarter, you spend less than a dollar a year on healing — and nobody gets paged at 3am. Set that against one engineer-hour of manual selector debugging and the ROI is obvious.
What this pattern does not do
- Anti-bot upgrades. If the spider broke because the site added Cloudflare protection where there was none, no selector change helps — the spider needs a new TLS or browser layer (a way to look like a real browser, TLS being the encryption behind https). The heal flow should spot this case (an HTTP 403 instead of HTML, or a Cloudflare challenge page in the response) and send it to a different alert rather than asking the LLM to write selectors.
- Schema-level changes. If the site renamed
pricetocurrent_pricein its JSON-LD (structured product data embedded in the page), the selector may still find an element, but the field itself has changed. Selector healing plus Pydantic-validated extraction catch this together: the selector finds the element, the LLM call extracts what looks like a price, the schema checks the type, and a normalisation step renames the field. Three layers. - LLM hallucinations. Without schema validation on healed output, you can ingest made-up data. Always validate. If the heal returns strings where you expected integers, fail the heal and escalate.
