Why pure Scrapy fails on protected sites
Scrapy is built on Twisted, which uses OpenSSL with default cipher suites and a stock HTTP/2 implementation. The JA4 fingerprint is not Chrome's, the HTTP/2 SETTINGS frame is not Chrome's, and the pseudo-header order is not Chrome's. Anti-bot vendors score the handshake before any HTML is served, so the most-correct User-Agent and the most-rotated proxy can't help — the request is already classified as bot.
Adding curl_cffi to a Scrapy spider via a custom downloader works for medium-strength deployments. For Akamai's harder customers it fails because curl_cffi's impersonation profiles lag the latest Chrome by a few versions, and Akamai's sensor detects the gap. The Go path uses utls — a TLS library actively maintained against Chrome master — and stays current within days of a Chrome release.
The architecture
Three processes, one network:
| Component | Role | Lives where |
|---|---|---|
| Scrapy spider | URL queue, retries, item pipeline, deduplication | Worker container (Python) |
| Go TLS sidecar | Issues actual HTTPS requests with Chrome TLS via utls | Same pod / container, localhost:8080 |
| Proxy pool | ISP/residential IPs; sticky per Scrapy session ID | External (Bright Data, Decodo, custom) |
Scrapy's spider issues a request as normal. A custom DOWNLOAD_HANDLERS entry replaces the default HTTPS handler with one that POSTs {url, method, headers, body, session_id, proxy} to localhost:8080/fetch. The Go process maintains a pool of sessions keyed by session_id — each session has its own utls connection, cookie jar, and proxy binding. Scrapy receives the response body and headers as if it had fetched directly.
The session pool matters: Akamai's _abck cookie accumulates trust across requests on the same TLS connection. A new TLS connection per request resets the trust score. Each Scrapy spider instance is mapped to one Go session, which holds its TLS connection and cookies for the entire crawl.
Why Go specifically
Three reasons Go is the default choice for the sidecar:
- utls is the gold standard for Chrome TLS impersonation. The Go ecosystem maintains it, the Chinese scraping community contributes Chrome-version-specific profiles upstream weekly, and the library tracks Chrome master closer than any Python equivalent (
curl_cffiwraps a forked curl which lags by Chrome version). - Concurrency for free. A Scrapy worker issues bursts of requests; goroutines absorb the load with no thread-pool tuning. A Python sidecar would need asyncio or threading — both work but neither is as cheap as goroutines.
- HTTP/2 framing is exposed. Libraries like
azuretlslet you specify the SETTINGS frame values, the WINDOW_UPDATE delta, and the pseudo-header order directly. Python'shttpxhides these behind its own HTTP/2 implementation.
The TypeScript ecosystem has caught up with cycle-tls (Node.js, bundles Go under the hood). Rust's webclaw is the newer alternative. Both work; Go remains the default because the production case studies that prove the pattern were written in Go.
Tradeoffs and when to skip this pattern
The sidecar is operational overhead. Two processes per worker, one more deployment artifact, one more thing that can crash. For most scraping problems this is unnecessary. Skip it when:
- The target uses no anti-bot or only Cloudflare Bot Fight Mode — pure Scrapy with a residential proxy works.
- The target uses DataDome — per-request scoring means session continuity matters less than IP quality.
curl_cffi+ Scrapy is enough. - The volume is below ~10k requests/day — a managed scraping API is cheaper than running this infrastructure.
Use it when the target is Akamai, Cloudflare Bot Management Enterprise, or PerimeterX, and the volume justifies running your own infrastructure (~50k+ requests/day). Below that volume, a managed API at $0.50–$3 per 1000 requests is cheaper than the engineering time to build and operate this pattern.
