Why pure Scrapy fails on protected sites
Scrapy is built on Twisted, which uses OpenSSL with default cipher suites and a stock HTTP/2 implementation. The problem is the fingerprints this produces. The JA4 fingerprint — a short signature derived from the TLS handshake — is not Chrome's, the HTTP/2 SETTINGS frame is not Chrome's, and the pseudo-header order is not Chrome's. Anti-bot vendors score the handshake before any HTML is served, so even a perfect User-Agent and a freshly rotated proxy can't help — the request is already classified as bot.
Adding curl_cffi to a Scrapy spider via a custom downloader works for medium-strength deployments. For Akamai's harder customers it fails because curl_cffi's impersonation profiles lag the latest Chrome by a few versions, and Akamai's sensor detects the gap. The Go path uses utls — a TLS library actively maintained against Chrome master — and stays current within days of a Chrome release.
The architecture
Three processes share one network. Here is who does what:
| Component | Role | Lives where |
|---|---|---|
| Scrapy spider | URL queue, retries, item pipeline, deduplication | Worker container (Python) |
| Go TLS sidecar | Issues actual HTTPS requests with Chrome TLS via utls | Same pod / container, localhost:8080 |
| Proxy pool | ISP/residential IPs; sticky per Scrapy session ID | External (Bright Data, Decodo, custom) |
The flow is straightforward. Scrapy's spider issues a request as normal. A custom DOWNLOAD_HANDLERS entry replaces the default HTTPS handler with one that POSTs {url, method, headers, body, session_id, proxy} to localhost:8080/fetch (a service on the same machine). The Go process keeps a pool of sessions keyed by session_id — each session has its own utls connection, cookie jar, and proxy binding. Scrapy receives the response body and headers as if it had fetched directly.
The session pool matters because of how Akamai builds trust. Its _abck cookie accumulates trust across requests sent over the same TLS connection. Opening a new TLS connection for every request resets that trust score. So each Scrapy spider instance is mapped to one Go session, which holds its TLS connection and cookies for the entire crawl.
Why Go specifically
Three reasons Go is the default choice for the sidecar:
- utls is the gold standard for Chrome TLS impersonation. The Go ecosystem maintains it, the Chinese scraping community contributes Chrome-version-specific profiles upstream weekly, and the library tracks Chrome master closer than any Python equivalent (
curl_cffiwraps a forked curl which lags by Chrome version). - Concurrency for free. A Scrapy worker fires bursts of requests; goroutines (Go's lightweight threads) absorb the load with no thread-pool tuning. A Python sidecar would need asyncio or threading — both work but neither is as cheap as goroutines.
- HTTP/2 framing is exposed. Libraries like
azuretlslet you set the SETTINGS frame values, the WINDOW_UPDATE delta, and the pseudo-header order directly — all low-level details Chrome sends a specific way. Python'shttpxhides these behind its own HTTP/2 implementation.
The TypeScript ecosystem has caught up with cycle-tls (Node.js, bundles Go under the hood). Rust's webclaw is the newer alternative. Both work; Go remains the default because the production case studies that prove the pattern were written in Go.
Tradeoffs and when to skip this pattern
The sidecar is operational overhead. You run two processes per worker, ship one more deployment artifact, and add one more thing that can crash. For most scraping problems this is unnecessary. Skip it when:
- The target uses no anti-bot or only Cloudflare Bot Fight Mode — pure Scrapy with a residential proxy works.
- The target uses DataDome — per-request scoring means session continuity matters less than IP quality.
curl_cffi+ Scrapy is enough. - The volume is below ~10k requests/day — a managed scraping API is cheaper than running this infrastructure.
Use it when the target is Akamai, Cloudflare Bot Management Enterprise, or PerimeterX, and the volume justifies running your own infrastructure (~50k+ requests/day). Below that volume, a managed API at $0.50–$3 per 1000 requests is cheaper than the engineering time to build and operate this pattern.
