Web Scraping APIs

Scrapy + Go TLS Sidecar — Production Architecture for Hard Targets

Scrapy + Go TLS Sidecar — Production Architecture for Hard Targets — conceptual illustration
On this page

The Scrapy + Go TLS sidecar architecture is the most common production pattern for scraping Akamai- and Cloudflare-protected sites at scale. Scrapy provides orchestration, queueing, retries, and pipelines — but its underlying HTTP stack (Twisted) cannot impersonate Chrome's TLS handshake. A small Go HTTP service runs as a sidecar, exposes a POST /fetch endpoint, and uses a Chrome-exact TLS library (utls) to issue the actual request. Scrapy talks to the sidecar over local HTTP via a custom downloader middleware. The result is Chrome-perfect JA4 + HTTP/2 fingerprints with all the productivity of Scrapy's framework.

Quick facts

ComponentsScrapy (Python) + Go HTTP sidecar with utls / azuretls
BridgeCustom DOWNLOAD_HANDLERS middleware that POSTs to sidecar
Session modelPool of N sidecar connections, sticky session ID → connection map
Typical throughput20–50 req/min per session against Akamai; 200+ req/min against unprotected sites
When to useAkamai, Cloudflare BM, PerimeterX at scale. Skip for unprotected sites — pure Scrapy is simpler.

Why pure Scrapy fails on protected sites

Scrapy is built on Twisted, which uses OpenSSL with default cipher suites and a stock HTTP/2 implementation. The JA4 fingerprint is not Chrome's, the HTTP/2 SETTINGS frame is not Chrome's, and the pseudo-header order is not Chrome's. Anti-bot vendors score the handshake before any HTML is served, so the most-correct User-Agent and the most-rotated proxy can't help — the request is already classified as bot.

Adding curl_cffi to a Scrapy spider via a custom downloader works for medium-strength deployments. For Akamai's harder customers it fails because curl_cffi's impersonation profiles lag the latest Chrome by a few versions, and Akamai's sensor detects the gap. The Go path uses utls — a TLS library actively maintained against Chrome master — and stays current within days of a Chrome release.

The architecture

Three processes, one network:

ComponentRoleLives where
Scrapy spiderURL queue, retries, item pipeline, deduplicationWorker container (Python)
Go TLS sidecarIssues actual HTTPS requests with Chrome TLS via utlsSame pod / container, localhost:8080
Proxy poolISP/residential IPs; sticky per Scrapy session IDExternal (Bright Data, Decodo, custom)

Scrapy's spider issues a request as normal. A custom DOWNLOAD_HANDLERS entry replaces the default HTTPS handler with one that POSTs {url, method, headers, body, session_id, proxy} to localhost:8080/fetch. The Go process maintains a pool of sessions keyed by session_id — each session has its own utls connection, cookie jar, and proxy binding. Scrapy receives the response body and headers as if it had fetched directly.

The session pool matters: Akamai's _abck cookie accumulates trust across requests on the same TLS connection. A new TLS connection per request resets the trust score. Each Scrapy spider instance is mapped to one Go session, which holds its TLS connection and cookies for the entire crawl.

Why Go specifically

Three reasons Go is the default choice for the sidecar:

  1. utls is the gold standard for Chrome TLS impersonation. The Go ecosystem maintains it, the Chinese scraping community contributes Chrome-version-specific profiles upstream weekly, and the library tracks Chrome master closer than any Python equivalent (curl_cffi wraps a forked curl which lags by Chrome version).
  2. Concurrency for free. A Scrapy worker issues bursts of requests; goroutines absorb the load with no thread-pool tuning. A Python sidecar would need asyncio or threading — both work but neither is as cheap as goroutines.
  3. HTTP/2 framing is exposed. Libraries like azuretls let you specify the SETTINGS frame values, the WINDOW_UPDATE delta, and the pseudo-header order directly. Python's httpx hides these behind its own HTTP/2 implementation.

The TypeScript ecosystem has caught up with cycle-tls (Node.js, bundles Go under the hood). Rust's webclaw is the newer alternative. Both work; Go remains the default because the production case studies that prove the pattern were written in Go.

Tradeoffs and when to skip this pattern

The sidecar is operational overhead. Two processes per worker, one more deployment artifact, one more thing that can crash. For most scraping problems this is unnecessary. Skip it when:

  • The target uses no anti-bot or only Cloudflare Bot Fight Mode — pure Scrapy with a residential proxy works.
  • The target uses DataDome — per-request scoring means session continuity matters less than IP quality. curl_cffi + Scrapy is enough.
  • The volume is below ~10k requests/day — a managed scraping API is cheaper than running this infrastructure.

Use it when the target is Akamai, Cloudflare Bot Management Enterprise, or PerimeterX, and the volume justifies running your own infrastructure (~50k+ requests/day). Below that volume, a managed API at $0.50–$3 per 1000 requests is cheaper than the engineering time to build and operate this pattern.

Code example

python
# scrapy custom downloader handler — bridges Scrapy → Go sidecar
import json
import requests
from scrapy.core.downloader.handlers.http import HTTPDownloadHandler
from scrapy.http import HtmlResponse
from twisted.internet.threads import deferToThread

SIDECAR = "http://localhost:8080/fetch"

class GoTLSDownloadHandler(HTTPDownloadHandler):
    def download_request(self, request, spider):
        payload = {
            "url": request.url,
            "method": request.method,
            "headers": dict(request.headers.to_unicode_dict()),
            "body": request.body.decode("utf-8", "replace"),
            "session_id": request.meta.get("session_id", "default"),
            "proxy": request.meta.get("proxy"),
        }
        def _go():
            r = requests.post(SIDECAR, json=payload, timeout=60)
            r.raise_for_status()
            out = r.json()
            return HtmlResponse(
                url=out["final_url"],
                status=out["status"],
                headers=out["headers"],
                body=out["body"].encode(),
                request=request,
            )
        return deferToThread(_go)

# settings.py
DOWNLOAD_HANDLERS = {
    "https": "myproject.handlers.GoTLSDownloadHandler",
    "http":  "myproject.handlers.GoTLSDownloadHandler",
}

Related terms

Concept map

How Scrapy + Go TLS Sidecar Architecture connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

Can I do this with curl_cffi in Python instead of Go?

For medium-strength targets, yes — curl_cffi as a Scrapy downloader handler is a valid simpler architecture. The reason production teams pick Go is that utls tracks Chrome master closer than curl_cffi tracks BoringSSL, and Akamai's detection model rewards the freshest TLS profile. If you can accept being a few Chrome versions behind, the Python-only path is fine.

Why not just use a real headless browser like Camoufox?

Throughput. A Camoufox instance uses 200-400MB of RAM and handles ~5 requests/min with the kind of warm-up and pacing Akamai expects. A Go TLS sidecar handles 50+ requests/min per session in 20MB of RAM. For high-volume scraping where the protection is TLS-and-HTTP/2-centric (not JS-heavy), the sidecar is 10× cheaper to operate.

How do I rotate sessions without losing _abck trust?

Pre-warm. Before retiring a session, spin up the next session with the same proxy IP and warm it by visiting the homepage, waiting a few seconds, and visiting one product page. The new session's _abck will reach ~0~ before you need it. Maintain a small ring of pre-warmed sessions ahead of the queue and the throughput stays smooth.

Where does the proxy go — Scrapy side or Go side?

Go side. Scrapy passes the proxy URL through the request meta, and the Go sidecar uses it for both the TLS handshake and the upstream connection. Putting the proxy on the Scrapy side defeats the whole architecture because the TLS handshake then originates from Scrapy, not from utls.

Last updated: 2026-05-27