Scrapy + Go TLS Sidecar — Production Architecture for Hard Targets

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

Scrapy + Go TLS Sidecar — Production Architecture for Hard Targets — conceptual illustration

On this page

The Scrapy + Go TLS sidecar architecture is the most common production pattern for scraping Akamai- and Cloudflare-protected sites at scale. The idea is a division of labor: Scrapy (the Python scraping framework) handles orchestration, queueing, retries, and pipelines (its chain of post-processing steps) — but its underlying HTTP stack, Twisted, cannot impersonate Chrome's TLS handshake. TLS is the encryption layer behind https, and the handshake is the opening negotiation that gives away whether you are a real browser. So a small Go HTTP service runs alongside Scrapy as a sidecar (a helper process that does one job), exposes a POST /fetch endpoint, and uses a Chrome-exact TLS library (utls) to make the actual request. Scrapy talks to the sidecar over local HTTP through a custom downloader middleware. The result is Chrome-perfect JA4 + HTTP/2 fingerprints with all the productivity of Scrapy's framework.

Components	Scrapy (Python) + Go HTTP sidecar with utls / azuretls
Bridge	Custom DOWNLOAD_HANDLERS middleware that POSTs to sidecar
Session model	Pool of N sidecar connections, sticky session ID → connection map
Typical throughput	20–50 req/min per session against Akamai; 200+ req/min against unprotected sites
When to use	Akamai, Cloudflare BM, PerimeterX at scale. Skip for unprotected sites — pure Scrapy is simpler.

Why pure Scrapy fails on protected sites

Scrapy is built on Twisted, which uses OpenSSL with default cipher suites and a stock HTTP/2 implementation. The problem is the fingerprints this produces. The JA4 fingerprint — a short signature derived from the TLS handshake — is not Chrome's, the HTTP/2 SETTINGS frame is not Chrome's, and the pseudo-header order is not Chrome's. Anti-bot vendors score the handshake before any HTML is served, so even a perfect User-Agent and a freshly rotated proxy can't help — the request is already classified as bot.

Adding curl_cffi to a Scrapy spider via a custom downloader works for medium-strength deployments. For Akamai's harder customers it fails because curl_cffi's impersonation profiles lag the latest Chrome by a few versions, and Akamai's sensor detects the gap. The Go path uses utls — a TLS library actively maintained against Chrome master — and stays current within days of a Chrome release.

The architecture

Three processes share one network. Here is who does what:

Component	Role	Lives where
Scrapy spider	URL queue, retries, item pipeline, deduplication	Worker container (Python)
Go TLS sidecar	Issues actual HTTPS requests with Chrome TLS via utls	Same pod / container, localhost:8080
Proxy pool	ISP/residential IPs; sticky per Scrapy session ID	External (Bright Data, Decodo, custom)

The flow is straightforward. Scrapy's spider issues a request as normal. A custom DOWNLOAD_HANDLERS entry replaces the default HTTPS handler with one that POSTs {url, method, headers, body, session_id, proxy} to localhost:8080/fetch (a service on the same machine). The Go process keeps a pool of sessions keyed by session_id — each session has its own utls connection, cookie jar, and proxy binding. Scrapy receives the response body and headers as if it had fetched directly.

The session pool matters because of how Akamai builds trust. Its _abck cookie accumulates trust across requests sent over the same TLS connection. Opening a new TLS connection for every request resets that trust score. So each Scrapy spider instance is mapped to one Go session, which holds its TLS connection and cookies for the entire crawl.

Why Go specifically

Three reasons Go is the default choice for the sidecar:

utls is the gold standard for Chrome TLS impersonation. The Go ecosystem maintains it, the Chinese scraping community contributes Chrome-version-specific profiles upstream weekly, and the library tracks Chrome master closer than any Python equivalent (curl_cffi wraps a forked curl which lags by Chrome version).
Concurrency for free. A Scrapy worker fires bursts of requests; goroutines (Go's lightweight threads) absorb the load with no thread-pool tuning. A Python sidecar would need asyncio or threading — both work but neither is as cheap as goroutines.
HTTP/2 framing is exposed. Libraries like azuretls let you set the SETTINGS frame values, the WINDOW_UPDATE delta, and the pseudo-header order directly — all low-level details Chrome sends a specific way. Python's httpx hides these behind its own HTTP/2 implementation.

The TypeScript ecosystem has caught up with cycle-tls (Node.js, bundles Go under the hood). Rust's webclaw is the newer alternative. Both work; Go remains the default because the production case studies that prove the pattern were written in Go.

Tradeoffs and when to skip this pattern

The sidecar is operational overhead. You run two processes per worker, ship one more deployment artifact, and add one more thing that can crash. For most scraping problems this is unnecessary. Skip it when:

The target uses no anti-bot or only Cloudflare Bot Fight Mode — pure Scrapy with a residential proxy works.
The target uses DataDome — per-request scoring means session continuity matters less than IP quality. curl_cffi + Scrapy is enough.
The volume is below ~10k requests/day — a managed scraping API is cheaper than running this infrastructure.

Use it when the target is Akamai, Cloudflare Bot Management Enterprise, or PerimeterX, and the volume justifies running your own infrastructure (~50k+ requests/day). Below that volume, a managed API at $0.50–$3 per 1000 requests is cheaper than the engineering time to build and operate this pattern.

Code example

python

# scrapy custom downloader handler — bridges Scrapy → Go sidecar
import json
import requests
from scrapy.core.downloader.handlers.http import HTTPDownloadHandler
from scrapy.http import HtmlResponse
from twisted.internet.threads import deferToThread

SIDECAR = "http://localhost:8080/fetch"

class GoTLSDownloadHandler(HTTPDownloadHandler):
    def download_request(self, request, spider):
        payload = {
            "url": request.url,
            "method": request.method,
            "headers": dict(request.headers.to_unicode_dict()),
            "body": request.body.decode("utf-8", "replace"),
            "session_id": request.meta.get("session_id", "default"),
            "proxy": request.meta.get("proxy"),
        }
        def _go():
            r = requests.post(SIDECAR, json=payload, timeout=60)
            r.raise_for_status()
            out = r.json()
            return HtmlResponse(
                url=out["final_url"],
                status=out["status"],
                headers=out["headers"],
                body=out["body"].encode(),
                request=request,
            )
        return deferToThread(_go)

# settings.py
DOWNLOAD_HANDLERS = {
    "https": "myproject.handlers.GoTLSDownloadHandler",
    "http":  "myproject.handlers.GoTLSDownloadHandler",
}

Related terms

What Is Akamai Bot Manager?

Akamai Bot Manager is an enterprise tool that websites use to tell real visitors apart from bots, and it guards roughly 30% of the Fortune 5…

What Is TLS Fingerprinting (JA3/JA4)?

TLS fingerprinting is a way to recognize what software made a connection just by looking at how it sets up encryption — before the server re…

What Is HTTP/2 Fingerprinting?

HTTP/2 fingerprinting identifies an HTTP client from its SETTINGS frame and frame-level behaviour, independent of the TLS layer. Think of it…

What Is the Web Scraping Decision Flow?

The web scraping decision flow is a six-step checklist, ordered cheapest-first, that experienced engineers run through on every new target t…

Anti-Bot Vendor Detection Cheatsheet

A useful first step when working with any protected site you are authorized to access is identifying which anti-bot vendor sits in front of …

What Is an ISP Proxy?

An ISP proxy (also called a \"static residential\" proxy) is a fixed IP address that physically sits in a datacenter but is registered to a …

What Is Scrapy?

Scrapy is the industry-default crawler framework for Python. It does everything around the actual HTTP request so you don't have to: it keep…

Web Scraping With Go (Golang): A Complete 2026 Guide

Web scraping with Go (Golang) means using net/http or the Colly framework to fetch pages and goquery to extract data with jQuery-like select…

Concept map

How Scrapy + Go TLS Sidecar Architecture connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping APIs

Tools & solutions for this topic

Frequently asked questions

Can I do this with curl_cffi in Python instead of Go?

For medium-strength targets, yes — using curl_cffi as a Scrapy downloader handler is a valid, simpler architecture. The reason production teams pick Go is that utls tracks Chrome master closer than curl_cffi tracks BoringSSL (the TLS engine curl_cffi imitates), and Akamai's detection model rewards the freshest TLS profile. If you can accept being a few Chrome versions behind, the Python-only path is fine.

Why not just use a real headless browser like Camoufox?

Throughput. A Camoufox instance uses 200-400MB of RAM and handles ~5 requests/min with the kind of warm-up and pacing Akamai expects. A Go TLS sidecar handles 50+ requests/min per session in 20MB of RAM. For high-volume scraping where the protection is TLS-and-HTTP/2-centric (not JS-heavy), the sidecar is 10× cheaper to operate.

How do I rotate sessions without losing _abck trust?

Pre-warm them. Before retiring a session, spin up the next one with the same proxy IP and warm it by visiting the homepage, waiting a few seconds, then visiting one product page. That way the new session's _abck cookie is already trusted before you need it. Keep a small ring of pre-warmed sessions ahead of the queue and throughput stays smooth.

Where does the proxy go — Scrapy side or Go side?

Go side. Scrapy passes the proxy URL through the request meta, and the Go sidecar uses it for both the TLS handshake and the upstream connection. Putting the proxy on the Scrapy side defeats the whole architecture, because the TLS handshake would then originate from Scrapy, not from utls.

Last updated: 2026-05-31