Web Scraping APIs

What Is Scrapy?

What Is Scrapy? — conceptual illustration
On this page

Scrapy is the industry-default crawler framework for Python. It does everything around the actual HTTP request so you don't have to: it keeps a queue of URLs to visit, retries failures, skips duplicate URLs, runs the scraped data through processing steps (item pipelines), paces requests (throttling), runs many requests at once (concurrency), and offers a middleware system where you plug in proxies, fingerprinting, and stealth tools. The bare HTTP layer underneath (built on Twisted, a Python networking library) is too easy to detect for protected sites in 2026 - but the framework wrapped around it is genuinely irreplaceable once a crawl grows past a few thousand URLs.

Quick facts

VendorScrapy project (originally Zyte / formerly Scrapinghub); BSD-3 license
LanguagePython (>= 3.9)
Built-inQueue, retries, dedup, pipelines, throttling, concurrency, settings layering
Ecosystem200+ middleware packages — scrapy-playwright, scrapy-camoufox, scrapy-redis, scrapy-stealth
Where it losesTwisted-based default HTTP layer fails against modern anti-bot

What Scrapy gives you that a script can't

For a 100-URL scrape, a single Python script with curl_cffi and a loop is fine. Past ~1000 URLs the problems pile up: what to retry, how to avoid scraping the same URL twice (dedupe), where to write results, how to pace requests per site, and how to pick up again after a crash. Scrapy handles all of this out of the box:

  • Built-in queue with priority, depth tracking, and disk-backed persistence (so you can resume a crawl after killing it).
  • Per-domain throttling via AUTOTHROTTLE — automatically slows down or speeds up based on how fast the site responds.
  • Request deduplication — Scrapy fingerprints each URL so it never fetches the same one twice, even across restarts.
  • Item pipelines — chain steps like validators, deduplicators, and database writers together with a single declaration.
  • Settings layering — project defaults can be overridden per spider, which can be overridden again by command-line flags.
  • The downloader-middleware abstraction — the hook where every modern stealth tool plugs in, including the Go TLS sidecar pattern.

Rebuilding all this for any non-trivial crawl is weeks of work. Scrapy is mature, BSD-licensed, and one pip install away.

Why bare Scrapy fails on protected sites

Scrapy's built-in downloader (Twisted-based, supporting HTTP/1.1 and HTTP/2) has never looked like Chrome, and that is exactly what gives it away. Its JA4 TLS fingerprint isn't Chrome's (TLS is the encryption behind https, and JA4 is a label derived from how a client opens that connection - it acts like a signature), its HTTP/2 SETTINGS frame isn't Chrome's, and its default User-Agent literally says "Scrapy/X.Y". Any anti-bot vendor blocks this at Layer 1 (see the four-layer model) before a single line of HTML is served.

The fix lives in the downloader-middleware system. Two production patterns:

  • scrapy-impersonate / scrapy-curl-cffi — swaps Scrapy's downloader for curl_cffi, which reproduces a real browser's TLS handshake. Works with medium-strength anti-bot configurations and is easy to set up.
  • Scrapy + Go TLS sidecar — full Chrome impersonation via utls in a separate Go service. Produces a Chrome-consistent handshake at the network layer. More moving parts to run, but worth it for high-volume authorized scraping of protected sites you are permitted to access. See the dedicated entry.

For sites that need JavaScript to run, scrapy-playwright or scrapy-camoufox swap the downloader for a real browser on a per-request basis. Browsers are expensive, so apply browser middleware only to the specific requests that need it via meta={"playwright": True}.

Scaling Scrapy beyond one machine

By default Scrapy runs in a single process. Three ways to scale out:

  • scrapy-redis — pulls URLs from a shared Redis queue. Multiple workers across machines draw from the same queue and write to the same dedup set. The simplest way to distribute Scrapy.
  • Scrapyd — a daemon that deploys packaged spiders (eggs) and runs them through an HTTP API. Handy for cron-driven crawls and as a stepping stone toward Kubernetes.
  • Zyte (Scrapy Cloud) — managed Scrapy hosting from the original Scrapy team. You deploy a spider with one command and the platform handles queueing, retries, and monitoring.

At enterprise scale, the more common choices are estela (a Kubernetes-native Scrapy orchestrator) or a self-hosted scrapy-cluster (backed by Kafka). The framework itself scales fine — the real work is wiring up the surrounding queue and storage infrastructure to match.

Code example

python
# A minimal Scrapy spider with curl_cffi for TLS impersonation
import scrapy
from curl_cffi import requests

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://target.com/category/widgets"]
    custom_settings = {
        "DOWNLOAD_DELAY": 1.0,
        "AUTOTHROTTLE_ENABLED": True,
        "ITEM_PIPELINES": {"myproject.pipelines.DedupePipeline": 300},
    }

    def parse(self, response):
        for link in response.css("a.product-tile::attr(href)").getall():
            yield response.follow(link, callback=self.parse_product)
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_product(self, response):
        yield {
            "url": response.url,
            "title": response.css("h1::text").get(),
            "price": response.css(".price::text").get(),
        }
# For protected sites: add a curl_cffi or Go-sidecar downloader middleware.

Related terms

What Is the Scrapy + Go TLS Sidecar Architecture?
The Scrapy + Go TLS sidecar architecture is the most common production pattern for scraping Akamai- and Cloudflare-protected sites at scale.…
What Is a Web Scraping API?
A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…
What Is curl_cffi?
curl_cffi is a Python HTTP client whose TLS fingerprint looks exactly like real Chrome, Firefox, or Safari. TLS is the encryption layer behi…
What Is the Web Scraping Decision Flow?
The web scraping decision flow is a six-step checklist, ordered cheapest-first, that experienced engineers run through on every new target t…
Web Scraping Tools 2026 — A Comparison
"Web scraping tools" is the whole family of software you use to pull data off websites — and in 2026 that family is big but neatly sorted in…
What Is Playwright?
Playwright is a cross-browser automation framework from Microsoft that drives Chromium, Firefox, and WebKit through a single API. An automat…
Synchronous vs Asynchronous Web Scraping
Synchronous web scraping sends one request at a time and waits ("blocks") until each one finishes before starting the next; asynchronous scr…
What Is a CSS Selector?
A CSS selector is a pattern that picks out specific elements in an HTML document by matching their tag, class, id, attributes, or position. …
What Is an XPath Selector?
XPath (XML Path Language) is a query language for navigating the tree structure of an HTML or XML document to select elements by their path,…

Concept map

How Scrapy connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

When should I use Scrapy vs just a Python script?

Use a plain script for one-off scrapes under ~1000 URLs. Reach for Scrapy when the crawl recurs, spans many thousands of URLs, needs retries and dedupe, or will outlive your patience for maintaining the queue logic yourself. There's more boilerplate up front, but the operational payoff is huge.

Can Scrapy use a headless browser?

Yes, via scrapy-playwright or scrapy-camoufox. These wrap a browser as a downloader middleware, so you can flag the specific requests that need browser rendering and let everything else take the cheap HTTP path. Mixing browser and non-browser requests in one spider is the typical production setup.

Is Scrapy still maintained?

Yes. Zyte (founded by the original Scrapy team) sponsors active development, Python 3.13 support landed in 2024, and major releases keep coming on a roughly annual cadence. The Twisted dependency raises eyebrows, but it's stable and well-tested.

Last updated: 2026-05-31