What Scrapy gives you that a script can't
For a 100-URL scrape, a single Python script with curl_cffi and a loop is fine. Past ~1000 URLs the problems pile up: what to retry, how to avoid scraping the same URL twice (dedupe), where to write results, how to pace requests per site, and how to pick up again after a crash. Scrapy handles all of this out of the box:
- Built-in queue with priority, depth tracking, and disk-backed persistence (so you can resume a crawl after killing it).
- Per-domain throttling via
AUTOTHROTTLE— automatically slows down or speeds up based on how fast the site responds. - Request deduplication — Scrapy fingerprints each URL so it never fetches the same one twice, even across restarts.
- Item pipelines — chain steps like validators, deduplicators, and database writers together with a single declaration.
- Settings layering — project defaults can be overridden per spider, which can be overridden again by command-line flags.
- The downloader-middleware abstraction — the hook where every modern stealth tool plugs in, including the Go TLS sidecar pattern.
Rebuilding all this for any non-trivial crawl is weeks of work. Scrapy is mature, BSD-licensed, and one pip install away.
Why bare Scrapy fails on protected sites
Scrapy's built-in downloader (Twisted-based, supporting HTTP/1.1 and HTTP/2) has never looked like Chrome, and that is exactly what gives it away. Its JA4 TLS fingerprint isn't Chrome's (TLS is the encryption behind https, and JA4 is a label derived from how a client opens that connection - it acts like a signature), its HTTP/2 SETTINGS frame isn't Chrome's, and its default User-Agent literally says "Scrapy/X.Y". Any anti-bot vendor blocks this at Layer 1 (see the four-layer model) before a single line of HTML is served.
The fix lives in the downloader-middleware system. Two production patterns:
- scrapy-impersonate / scrapy-curl-cffi — swaps Scrapy's downloader for curl_cffi, which reproduces a real browser's TLS handshake. Works with medium-strength anti-bot configurations and is easy to set up.
- Scrapy + Go TLS sidecar — full Chrome impersonation via utls in a separate Go service. Produces a Chrome-consistent handshake at the network layer. More moving parts to run, but worth it for high-volume authorized scraping of protected sites you are permitted to access. See the dedicated entry.
For sites that need JavaScript to run, scrapy-playwright or scrapy-camoufox swap the downloader for a real browser on a per-request basis. Browsers are expensive, so apply browser middleware only to the specific requests that need it via meta={"playwright": True}.
Scaling Scrapy beyond one machine
By default Scrapy runs in a single process. Three ways to scale out:
- scrapy-redis — pulls URLs from a shared Redis queue. Multiple workers across machines draw from the same queue and write to the same dedup set. The simplest way to distribute Scrapy.
- Scrapyd — a daemon that deploys packaged spiders (eggs) and runs them through an HTTP API. Handy for cron-driven crawls and as a stepping stone toward Kubernetes.
- Zyte (Scrapy Cloud) — managed Scrapy hosting from the original Scrapy team. You deploy a spider with one command and the platform handles queueing, retries, and monitoring.
At enterprise scale, the more common choices are estela (a Kubernetes-native Scrapy orchestrator) or a self-hosted scrapy-cluster (backed by Kafka). The framework itself scales fine — the real work is wiring up the surrounding queue and storage infrastructure to match.
