Web Scraping APIs

What Is Scrapy?

What Is Scrapy? — conceptual illustration
On this page

Scrapy is the industry-default crawler framework for Python. It provides everything around the actual HTTP request — a URL queue, retry logic, request deduplication, item-pipeline processing, throttling, concurrency, and a middleware system for plugging in proxies, fingerprinting, and stealth tooling. The bare HTTP layer (Twisted-based) is unsuitable for protected sites in 2026, but the surrounding framework is genuinely irreplaceable for crawls bigger than a few thousand URLs.

Quick facts

VendorScrapy project (originally Zyte / formerly Scrapinghub); BSD-3 license
LanguagePython (>= 3.9)
Built-inQueue, retries, dedup, pipelines, throttling, concurrency, settings layering
Ecosystem200+ middleware packages — scrapy-playwright, scrapy-camoufox, scrapy-redis, scrapy-stealth
Where it losesTwisted-based default HTTP layer fails against modern anti-bot

What Scrapy gives you that a script can't

For 100-URL scrapes a single Python script with curl_cffi and a loop is fine. Past ~1000 URLs the operational surface explodes — what to retry, what to dedupe, where to write the results, how to throttle per-domain, how to resume after a crash. Scrapy gives all of this out of the box:

  • Built-in queue with priority, depth tracking, and disk-backed persistence (resume crawl after kill).
  • Per-domain throttling via AUTOTHROTTLE — adapts request rate to response latency.
  • Request deduplication on URL fingerprints; survives across crawl restarts.
  • Item pipelines — validators, deduplicators, database writers chain together with one declaration.
  • Settings layering — project defaults overridden by spider-specific settings overridden by command-line flags.
  • The downloader-middleware abstraction — every modern stealth tool plugs in here, including the Go TLS sidecar pattern.

Reimplementing this for any non-trivial crawl is the kind of work that takes weeks. Scrapy is mature, BSD-licensed, and one pip install away.

Why bare Scrapy fails on protected sites

Scrapy ships with a Twisted-based HTTP/1.1 + HTTP/2 implementation that has never been Chrome-shaped. The JA4 TLS fingerprint is not Chrome's, the HTTP/2 SETTINGS frame is not Chrome's, and the User-Agent is "Scrapy/X.Y" by default. Any anti-bot vendor blocks this at Layer 1 (see the four-layer model) before any HTML is served.

The fix is the downloader-middleware system. Two production patterns:

  • scrapy-impersonate / scrapy-curl-cffi — replaces the downloader with curl_cffi. Defeats medium-strength anti-bot. Easy to set up.
  • Scrapy + Go TLS sidecar — full Chrome impersonation via utls in a Go service. Defeats Akamai and Cloudflare BM. Operational overhead but worth it for high-volume protected scraping. See the dedicated entry.

For sites that require JS execution, scrapy-playwright or scrapy-camoufox swap the downloader for a browser per request. Browsers are expensive — apply browser middleware only to specific requests via meta={"playwright": True}.

Scaling Scrapy beyond one machine

Scrapy is single-process by default. Three scaling patterns:

  • scrapy-redis — pulls URLs from a Redis queue. Multiple workers across machines pull from the same queue and write to the same dedup set. Simplest distributed-Scrapy pattern.
  • Scrapyd — daemon that deploys spider eggs and runs them with an HTTP API. Useful for cron-driven crawls and as a stepping stone to Kubernetes.
  • Zyte (Scrapy Cloud) — managed Scrapy hosting from the original Scrapy team. Spiders deploy with one command; the platform handles queueing, retries, and monitoring.

At enterprise scale, estela (Kubernetes-native Scrapy orchestrator) or self-hosted scrapy-cluster (Kafka-backed) are the more common patterns. The framework scales — the work is wiring up the surrounding queue and storage infrastructure to match.

Code example

python
# A minimal Scrapy spider with curl_cffi for TLS impersonation
import scrapy
from curl_cffi import requests

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://target.com/category/widgets"]
    custom_settings = {
        "DOWNLOAD_DELAY": 1.0,
        "AUTOTHROTTLE_ENABLED": True,
        "ITEM_PIPELINES": {"myproject.pipelines.DedupePipeline": 300},
    }

    def parse(self, response):
        for link in response.css("a.product-tile::attr(href)").getall():
            yield response.follow(link, callback=self.parse_product)
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_product(self, response):
        yield {
            "url": response.url,
            "title": response.css("h1::text").get(),
            "price": response.css(".price::text").get(),
        }
# For protected sites: add a curl_cffi or Go-sidecar downloader middleware.

Related terms

Concept map

How Scrapy connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

When should I use Scrapy vs just a Python script?

Use a script for one-off scrapes under ~1000 URLs. Use Scrapy when the crawl is recurring, multi-thousand URLs, needs retries and dedup, or will outlive your interest in maintaining the queue logic. The boilerplate is heavier; the operational payoff is huge.

Can Scrapy use a headless browser?

Yes, via scrapy-playwright or scrapy-camoufox. These wrap the browser as a downloader middleware so you can mark specific requests as needing browser rendering and let the rest go through the cheap HTTP path. Mixing browser and non-browser requests in one spider is the typical production pattern.

Is Scrapy still maintained?

Yes. Zyte (founded by the original Scrapy team) sponsors active development, Python 3.13 support landed in 2024, and major releases continue on a roughly annual cadence. The Twisted dependency raises eyebrows but is stable and well-tested.

Last updated: 2026-05-27