Web Scraping APIs

Scrapy + Go TLS Sidecar — Production Architecture for Hard Targets

Scrapy + Go TLS Sidecar — Production Architecture for Hard Targets — conceptual illustration
On this page

The Scrapy + Go TLS sidecar architecture is the most common production pattern for scraping Akamai- and Cloudflare-protected sites at scale. The idea is a division of labor: Scrapy (the Python scraping framework) handles orchestration, queueing, retries, and pipelines (its chain of post-processing steps) — but its underlying HTTP stack, Twisted, cannot impersonate Chrome's TLS handshake. TLS is the encryption layer behind https, and the handshake is the opening negotiation that gives away whether you are a real browser. So a small Go HTTP service runs alongside Scrapy as a sidecar (a helper process that does one job), exposes a POST /fetch endpoint, and uses a Chrome-exact TLS library (utls) to make the actual request. Scrapy talks to the sidecar over local HTTP through a custom downloader middleware. The result is Chrome-perfect JA4 + HTTP/2 fingerprints with all the productivity of Scrapy's framework.

Quick facts

ComponentsScrapy (Python) + Go HTTP sidecar with utls / azuretls
BridgeCustom DOWNLOAD_HANDLERS middleware that POSTs to sidecar
Session modelPool of N sidecar connections, sticky session ID → connection map
Typical throughput20–50 req/min per session against Akamai; 200+ req/min against unprotected sites
When to useAkamai, Cloudflare BM, PerimeterX at scale. Skip for unprotected sites — pure Scrapy is simpler.

Why pure Scrapy fails on protected sites

Scrapy is built on Twisted, which uses OpenSSL with default cipher suites and a stock HTTP/2 implementation. The problem is the fingerprints this produces. The JA4 fingerprint — a short signature derived from the TLS handshake — is not Chrome's, the HTTP/2 SETTINGS frame is not Chrome's, and the pseudo-header order is not Chrome's. Anti-bot vendors score the handshake before any HTML is served, so even a perfect User-Agent and a freshly rotated proxy can't help — the request is already classified as bot.

Adding curl_cffi to a Scrapy spider via a custom downloader works for medium-strength deployments. For Akamai's harder customers it fails because curl_cffi's impersonation profiles lag the latest Chrome by a few versions, and Akamai's sensor detects the gap. The Go path uses utls — a TLS library actively maintained against Chrome master — and stays current within days of a Chrome release.

The architecture

Three processes share one network. Here is who does what:

ComponentRoleLives where
Scrapy spiderURL queue, retries, item pipeline, deduplicationWorker container (Python)
Go TLS sidecarIssues actual HTTPS requests with Chrome TLS via utlsSame pod / container, localhost:8080
Proxy poolISP/residential IPs; sticky per Scrapy session IDExternal (Bright Data, Decodo, custom)

The flow is straightforward. Scrapy's spider issues a request as normal. A custom DOWNLOAD_HANDLERS entry replaces the default HTTPS handler with one that POSTs {url, method, headers, body, session_id, proxy} to localhost:8080/fetch (a service on the same machine). The Go process keeps a pool of sessions keyed by session_id — each session has its own utls connection, cookie jar, and proxy binding. Scrapy receives the response body and headers as if it had fetched directly.

The session pool matters because of how Akamai builds trust. Its _abck cookie accumulates trust across requests sent over the same TLS connection. Opening a new TLS connection for every request resets that trust score. So each Scrapy spider instance is mapped to one Go session, which holds its TLS connection and cookies for the entire crawl.

Why Go specifically

Three reasons Go is the default choice for the sidecar:

  1. utls is the gold standard for Chrome TLS impersonation. The Go ecosystem maintains it, the Chinese scraping community contributes Chrome-version-specific profiles upstream weekly, and the library tracks Chrome master closer than any Python equivalent (curl_cffi wraps a forked curl which lags by Chrome version).
  2. Concurrency for free. A Scrapy worker fires bursts of requests; goroutines (Go's lightweight threads) absorb the load with no thread-pool tuning. A Python sidecar would need asyncio or threading — both work but neither is as cheap as goroutines.
  3. HTTP/2 framing is exposed. Libraries like azuretls let you set the SETTINGS frame values, the WINDOW_UPDATE delta, and the pseudo-header order directly — all low-level details Chrome sends a specific way. Python's httpx hides these behind its own HTTP/2 implementation.

The TypeScript ecosystem has caught up with cycle-tls (Node.js, bundles Go under the hood). Rust's webclaw is the newer alternative. Both work; Go remains the default because the production case studies that prove the pattern were written in Go.

Tradeoffs and when to skip this pattern

The sidecar is operational overhead. You run two processes per worker, ship one more deployment artifact, and add one more thing that can crash. For most scraping problems this is unnecessary. Skip it when:

  • The target uses no anti-bot or only Cloudflare Bot Fight Mode — pure Scrapy with a residential proxy works.
  • The target uses DataDome — per-request scoring means session continuity matters less than IP quality. curl_cffi + Scrapy is enough.
  • The volume is below ~10k requests/day — a managed scraping API is cheaper than running this infrastructure.

Use it when the target is Akamai, Cloudflare Bot Management Enterprise, or PerimeterX, and the volume justifies running your own infrastructure (~50k+ requests/day). Below that volume, a managed API at $0.50–$3 per 1000 requests is cheaper than the engineering time to build and operate this pattern.

Code example

python
# scrapy custom downloader handler — bridges Scrapy → Go sidecar
import json
import requests
from scrapy.core.downloader.handlers.http import HTTPDownloadHandler
from scrapy.http import HtmlResponse
from twisted.internet.threads import deferToThread

SIDECAR = "http://localhost:8080/fetch"

class GoTLSDownloadHandler(HTTPDownloadHandler):
    def download_request(self, request, spider):
        payload = {
            "url": request.url,
            "method": request.method,
            "headers": dict(request.headers.to_unicode_dict()),
            "body": request.body.decode("utf-8", "replace"),
            "session_id": request.meta.get("session_id", "default"),
            "proxy": request.meta.get("proxy"),
        }
        def _go():
            r = requests.post(SIDECAR, json=payload, timeout=60)
            r.raise_for_status()
            out = r.json()
            return HtmlResponse(
                url=out["final_url"],
                status=out["status"],
                headers=out["headers"],
                body=out["body"].encode(),
                request=request,
            )
        return deferToThread(_go)

# settings.py
DOWNLOAD_HANDLERS = {
    "https": "myproject.handlers.GoTLSDownloadHandler",
    "http":  "myproject.handlers.GoTLSDownloadHandler",
}

Related terms

Concept map

How Scrapy + Go TLS Sidecar Architecture connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

Can I do this with curl_cffi in Python instead of Go?

For medium-strength targets, yes — using curl_cffi as a Scrapy downloader handler is a valid, simpler architecture. The reason production teams pick Go is that utls tracks Chrome master closer than curl_cffi tracks BoringSSL (the TLS engine curl_cffi imitates), and Akamai's detection model rewards the freshest TLS profile. If you can accept being a few Chrome versions behind, the Python-only path is fine.

Why not just use a real headless browser like Camoufox?

Throughput. A Camoufox instance uses 200-400MB of RAM and handles ~5 requests/min with the kind of warm-up and pacing Akamai expects. A Go TLS sidecar handles 50+ requests/min per session in 20MB of RAM. For high-volume scraping where the protection is TLS-and-HTTP/2-centric (not JS-heavy), the sidecar is 10× cheaper to operate.

How do I rotate sessions without losing _abck trust?

Pre-warm them. Before retiring a session, spin up the next one with the same proxy IP and warm it by visiting the homepage, waiting a few seconds, then visiting one product page. That way the new session's _abck cookie is already trusted before you need it. Keep a small ring of pre-warmed sessions ahead of the queue and throughput stays smooth.

Where does the proxy go — Scrapy side or Go side?

Go side. Scrapy passes the proxy URL through the request meta, and the Go sidecar uses it for both the TLS handshake and the upstream connection. Putting the proxy on the Scrapy side defeats the whole architecture, because the TLS handshake would then originate from Scrapy, not from utls.

Last updated: 2026-05-31