Python Web Scraping

Scrapy vs Playwright: When to Use Each

By the Scrappey Research Team

Scrapy vs Playwright: When to Use Each — conceptual illustration
On this page

Scrapy and Playwright solve different halves of web scraping: Scrapy is an asynchronous crawl framework that fetches and parses HTML over plain HTTP at high throughput, while Playwright drives a real browser (Chromium, Firefox, or WebKit) so JavaScript actually runs. Reach for Scrapy when you need to crawl many static pages cheaply with built-in queueing, retries, deduplication, and item pipelines. Reach for Playwright when content only appears after client-side JavaScript executes. They are not strictly competitors - the scrapy-playwright plugin lets one spider use cheap HTTP for most requests and a browser only for the pages that need it.

Quick facts

Scrapy isAn async HTTP crawl framework (Twisted)
Playwright isA real-browser automation library
JavaScriptScrapy: no JS; Playwright: full JS rendering
ThroughputScrapy: thousands/min; Playwright: page-bound, heavy
Combine themscrapy-playwright (meta playwright True per request)

Scrapy: an async crawl framework, not a browser

Scrapy is built for crawling lots of pages over plain HTTP, fast. It never opens a browser - it sends raw requests through an asynchronous engine (built on Twisted, a Python networking library) so many requests are in flight at once instead of one after another. Around that engine it ships the parts you would otherwise rebuild for every project:

  • A scheduler queue with priorities, depth tracking, and optional disk-backed persistence so a killed crawl can resume.
  • Request deduplication - Scrapy fingerprints each URL so it never fetches the same one twice, even across restarts.
  • Retry and AutoThrottle middleware that re-sends failed requests and automatically paces per-domain load based on response time.
  • Item pipelines - chained steps for validating, cleaning, deduping, and writing records to a file or database.
  • Downloader middleware - the hook where proxies, custom headers, and TLS-impersonation tools plug in.

The catch: Scrapy parses the HTML that the server returns verbatim. If a site builds its content with client-side JavaScript - prices, listings, or reviews injected after load - Scrapy sees an empty shell. It also can't click buttons, scroll, or wait for network-driven updates. For static or server-rendered HTML at scale, that is a feature, not a limitation: skipping the browser is exactly what makes it cheap and fast.

Playwright: a real browser for JavaScript-driven pages

Playwright (from Microsoft) automates a real browser engine through a single API, with official bindings for Python, JavaScript/TypeScript, Java, and .NET. It can drive Chromium, Firefox, and WebKit in headed or headless mode, which means JavaScript actually executes and the DOM you read matches what a human would see. That makes it the right tool when:

  • Content is rendered client-side by a single-page app and is absent from the initial HTML.
  • You need to interact - click, type into forms, scroll to trigger lazy loading, or follow a multi-step flow.
  • You must wait on real events with page.wait_for_selector() or page.wait_for_load_state("networkidle") instead of guessing at timing.
  • You need a screenshot, a PDF, or to run in-page JavaScript via page.evaluate().

Playwright's auto-waiting (it waits for elements to be actionable before interacting) makes scripts far less flaky than older sleep-based automation. The cost is weight: each browser instance consumes significant CPU and memory, page loads are seconds rather than milliseconds, and you cannot run thousands of concurrent browser contexts on one machine the way Scrapy runs thousands of concurrent HTTP requests. Use a browser only where the page genuinely requires one.

Combining them, cost, and how to decide

The most efficient production pattern is often both, via the scrapy-playwright plugin. It registers a downloader handler so you keep Scrapy's queue, retries, dedup, and pipelines, but route only the requests that need rendering through a browser by setting meta={"playwright": True} on those requests. You pick the engine with PLAYWRIGHT_BROWSER_TYPE, tune PLAYWRIGHT_LAUNCH_OPTIONS for headless mode, and cap concurrent pages with PLAYWRIGHT_MAX_PAGES_PER_CONTEXT. Page interactions run through PageMethod objects passed in playwright_page_methods (for example a scroll or a wait), so most of your spider stays plain HTTP and only the expensive pages pay the browser tax. Note Scrapy must use the asyncio reactor, and on Windows Playwright runs in a separate thread for subprocess compatibility.

FactorScrapy (HTTP)Playwright (browser)
Renders JavaScriptNoYes
Throughput per machineVery high (async HTTP)Low (CPU/RAM bound)
Cost per pageCheapExpensive
Interaction (click/scroll/forms)NoYes
Built-in queue/retry/dedupYesNo (you build it)

Decision guide: default to Scrapy and only add a browser where the page truly needs one. If you are managing proxies, browser pools, TLS handshakes, and retries yourself across many sites, a managed web-data API such as Scrappey can fold proxies, a real browser, and retries into a single call so you skip the browser-pool plumbing entirely. See web scraping tools 2026 for the broader landscape, and crawling vs scraping for why the crawl loop and the extraction step are separate problems.

Code example

python
# One Scrapy spider, two paths: plain HTTP for listings,
# Playwright only for the JS-rendered product detail pages.
import scrapy
from scrapy_playwright.page import PageMethod

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/category/widgets"]
    custom_settings = {
        # Scrapy must use the asyncio reactor for scrapy-playwright
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_BROWSER_TYPE": "chromium",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True},
        "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 8,
        "AUTOTHROTTLE_ENABLED": True,
    }

    def parse(self, response):
        # Listing pages are static HTML -> cheap, no browser needed.
        for href in response.css("a.product-tile::attr(href)").getall():
            yield response.follow(
                href,
                callback=self.parse_product,
                meta={
                    "playwright": True,  # render this request in a browser
                    "playwright_page_methods": [
                        PageMethod("wait_for_selector", ".price"),
                    ],
                },
            )
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_product(self, response):
        # By now JS has run; .price exists in the rendered DOM.
        yield {
            "url": response.url,
            "title": response.css("h1::text").get(),
            "price": response.css(".price::text").get(),
        }

Related terms

Concept map

How Scrapy vs Playwright: When to Use Each connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Python Web Scraping
Building map…

Frequently asked questions

Is Scrapy or Playwright faster for web scraping?

For static or server-rendered HTML, Scrapy is dramatically faster because it sends asynchronous HTTP requests with no browser overhead, easily handling thousands of pages per minute on one machine. Playwright is slower per page because it launches a real browser and waits for JavaScript to execute, but it is the only option when the data simply isn't in the raw HTML.

Can I use Scrapy and Playwright together?

Yes. The scrapy-playwright plugin adds a download handler so a single spider keeps Scrapy's queue, retries, and pipelines while routing only the requests that need rendering through a browser via meta={"playwright": True}. This hybrid setup keeps most requests on the cheap HTTP path and reserves the expensive browser for the few pages that genuinely require it.

When should I choose Playwright over Scrapy?

Choose Playwright when content is rendered client-side by JavaScript, when you must interact with the page (clicking, typing into forms, scrolling to trigger lazy loading, or stepping through a multi-step flow), or when you need screenshots or to run in-page JavaScript. If the page returns complete HTML on the first request, Scrapy alone is the cheaper and faster choice.

Does Playwright include a crawling framework like Scrapy?

No. Playwright is a browser-automation library, so it has no built-in URL queue, deduplication, retry middleware, or item pipelines - you would build those yourself. Scrapy provides all of that out of the box, which is why many teams run Scrapy as the crawl framework and call Playwright only for the pages that need a browser.

Last updated: 2026-06-16 · Facts last verified: 2026-06-16