What Scrapy gives you that a script can't
For 100-URL scrapes a single Python script with curl_cffi and a loop is fine. Past ~1000 URLs the operational surface explodes — what to retry, what to dedupe, where to write the results, how to throttle per-domain, how to resume after a crash. Scrapy gives all of this out of the box:
- Built-in queue with priority, depth tracking, and disk-backed persistence (resume crawl after kill).
- Per-domain throttling via
AUTOTHROTTLE— adapts request rate to response latency. - Request deduplication on URL fingerprints; survives across crawl restarts.
- Item pipelines — validators, deduplicators, database writers chain together with one declaration.
- Settings layering — project defaults overridden by spider-specific settings overridden by command-line flags.
- The downloader-middleware abstraction — every modern stealth tool plugs in here, including the Go TLS sidecar pattern.
Reimplementing this for any non-trivial crawl is the kind of work that takes weeks. Scrapy is mature, BSD-licensed, and one pip install away.
Why bare Scrapy fails on protected sites
Scrapy ships with a Twisted-based HTTP/1.1 + HTTP/2 implementation that has never been Chrome-shaped. The JA4 TLS fingerprint is not Chrome's, the HTTP/2 SETTINGS frame is not Chrome's, and the User-Agent is "Scrapy/X.Y" by default. Any anti-bot vendor blocks this at Layer 1 (see the four-layer model) before any HTML is served.
The fix is the downloader-middleware system. Two production patterns:
- scrapy-impersonate / scrapy-curl-cffi — replaces the downloader with curl_cffi. Defeats medium-strength anti-bot. Easy to set up.
- Scrapy + Go TLS sidecar — full Chrome impersonation via utls in a Go service. Defeats Akamai and Cloudflare BM. Operational overhead but worth it for high-volume protected scraping. See the dedicated entry.
For sites that require JS execution, scrapy-playwright or scrapy-camoufox swap the downloader for a browser per request. Browsers are expensive — apply browser middleware only to specific requests via meta={"playwright": True}.
Scaling Scrapy beyond one machine
Scrapy is single-process by default. Three scaling patterns:
- scrapy-redis — pulls URLs from a Redis queue. Multiple workers across machines pull from the same queue and write to the same dedup set. Simplest distributed-Scrapy pattern.
- Scrapyd — daemon that deploys spider eggs and runs them with an HTTP API. Useful for cron-driven crawls and as a stepping stone to Kubernetes.
- Zyte (Scrapy Cloud) — managed Scrapy hosting from the original Scrapy team. Spiders deploy with one command; the platform handles queueing, retries, and monitoring.
At enterprise scale, estela (Kubernetes-native Scrapy orchestrator) or self-hosted scrapy-cluster (Kafka-backed) are the more common patterns. The framework scales — the work is wiring up the surrounding queue and storage infrastructure to match.
