Scrapy: an async crawl framework, not a browser
Scrapy is built for crawling lots of pages over plain HTTP, fast. It never opens a browser - it sends raw requests through an asynchronous engine (built on Twisted, a Python networking library) so many requests are in flight at once instead of one after another. Around that engine it ships the parts you would otherwise rebuild for every project:
- A scheduler queue with priorities, depth tracking, and optional disk-backed persistence so a killed crawl can resume.
- Request deduplication - Scrapy fingerprints each URL so it never fetches the same one twice, even across restarts.
- Retry and AutoThrottle middleware that re-sends failed requests and automatically paces per-domain load based on response time.
- Item pipelines - chained steps for validating, cleaning, deduping, and writing records to a file or database.
- Downloader middleware - the hook where proxies, custom headers, and TLS-impersonation tools plug in.
The catch: Scrapy parses the HTML that the server returns verbatim. If a site builds its content with client-side JavaScript - prices, listings, or reviews injected after load - Scrapy sees an empty shell. It also can't click buttons, scroll, or wait for network-driven updates. For static or server-rendered HTML at scale, that is a feature, not a limitation: skipping the browser is exactly what makes it cheap and fast.
Playwright: a real browser for JavaScript-driven pages
Playwright (from Microsoft) automates a real browser engine through a single API, with official bindings for Python, JavaScript/TypeScript, Java, and .NET. It can drive Chromium, Firefox, and WebKit in headed or headless mode, which means JavaScript actually executes and the DOM you read matches what a human would see. That makes it the right tool when:
- Content is rendered client-side by a single-page app and is absent from the initial HTML.
- You need to interact - click, type into forms, scroll to trigger lazy loading, or follow a multi-step flow.
- You must wait on real events with
page.wait_for_selector()orpage.wait_for_load_state("networkidle")instead of guessing at timing. - You need a screenshot, a PDF, or to run in-page JavaScript via
page.evaluate().
Playwright's auto-waiting (it waits for elements to be actionable before interacting) makes scripts far less flaky than older sleep-based automation. The cost is weight: each browser instance consumes significant CPU and memory, page loads are seconds rather than milliseconds, and you cannot run thousands of concurrent browser contexts on one machine the way Scrapy runs thousands of concurrent HTTP requests. Use a browser only where the page genuinely requires one.
Combining them, cost, and how to decide
The most efficient production pattern is often both, via the scrapy-playwright plugin. It registers a downloader handler so you keep Scrapy's queue, retries, dedup, and pipelines, but route only the requests that need rendering through a browser by setting meta={"playwright": True} on those requests. You pick the engine with PLAYWRIGHT_BROWSER_TYPE, tune PLAYWRIGHT_LAUNCH_OPTIONS for headless mode, and cap concurrent pages with PLAYWRIGHT_MAX_PAGES_PER_CONTEXT. Page interactions run through PageMethod objects passed in playwright_page_methods (for example a scroll or a wait), so most of your spider stays plain HTTP and only the expensive pages pay the browser tax. Note Scrapy must use the asyncio reactor, and on Windows Playwright runs in a separate thread for subprocess compatibility.
| Factor | Scrapy (HTTP) | Playwright (browser) |
|---|---|---|
| Renders JavaScript | No | Yes |
| Throughput per machine | Very high (async HTTP) | Low (CPU/RAM bound) |
| Cost per page | Cheap | Expensive |
| Interaction (click/scroll/forms) | No | Yes |
| Built-in queue/retry/dedup | Yes | No (you build it) |
Decision guide: default to Scrapy and only add a browser where the page truly needs one. If you are managing proxies, browser pools, TLS handshakes, and retries yourself across many sites, a managed web-data API such as Scrappey can fold proxies, a real browser, and retries into a single call so you skip the browser-pool plumbing entirely. See web scraping tools 2026 for the broader landscape, and crawling vs scraping for why the crawl loop and the extraction step are separate problems.
