Web Scraping APIs

What Is Playwright?

What Is Playwright? — conceptual illustration
On this page

Playwright is a cross-browser automation framework from Microsoft that drives Chromium, Firefox, and WebKit through a single API. An automation framework means your code can control a real browser - opening pages, clicking, typing - instead of just downloading raw HTML. Released in 2020 as a Puppeteer successor, it added auto-waiting (it waits for elements to be ready so you don't have to guess), parallel browser contexts (multiple isolated sessions at once), and first-class support for Python, .NET, and Java alongside Node.js. In scraping it is the default browser-automation choice when JavaScript execution is required - but it ships with default fingerprints (identifying traits a site can read) that anti-bot vendors detect immediately, so production scrapers run a patched variant (Camoufox, PatchRight, CloakBrowser) rather than vanilla Playwright.

Quick facts

VendorMicrosoft (open-source, Apache 2.0)
LanguagesPython, Node.js / TypeScript, .NET, Java
BrowsersChromium, Firefox, WebKit (via patched binaries it ships)
ProtocolChrome DevTools Protocol (CDP) for Chromium; bidirectional WebSocket
Default detectionBlock-grade on Akamai, Kasada, Cloudflare BM out of the box

Where Playwright fits in scraping

Playwright is the right tool when the data is rendered client-side - built by JavaScript in the browser - so a plain HTTP client can't reach it: single-page apps that fetch via XHR (background requests) after the first paint, infinite-scroll lists, OAuth login flows, anything that requires real DOM events like clicks. The trade-off is weight: it runs ~200MB of RAM per browser context - far heavier than a lightweight HTTP client like curl_cffi - so use it only when the lighter approach doesn't work.

The Python API is the most common in scraping. async_playwright integrates with asyncio (Python's async system) cleanly, and scrapy-playwright wraps it as a Scrapy downloader middleware, so a crawl uses a real browser only on the specific pages that need one. The Node.js version is the original and slightly ahead on features, but the Python one is feature-stable enough to match.

Why default Playwright gets blocked

Vanilla (unmodified) Playwright is detected on multiple surfaces at once - each one a separate giveaway:

  • navigator.webdriver === true — the most-checked flag; it openly announces "a browser is being automated" and is set by Playwright and Selenium alike.
  • CDP connection signal — the channel Playwright uses to control Chrome leaves traces; anti-bot scripts probe for window.cdc_ properties and Runtime.evaluate timing artifacts.
  • Headless mode tells — running without a visible window leaves gaps a real browser wouldn't have: missing chrome.runtime, missing plugins, a languages array of length 1, no permissions API.
  • Function.toString() inspection — a site can ask a browser function to print its own source; any stealth plugin that patches methods at the JS level fails this check (see the toString inspection entry).
  • Default Playwright User-Agent includes "HeadlessChrome" unless explicitly overridden, which flags the request instantly.

Setting headless: false and overriding the User-Agent removes the cheapest signals, but the CDP signal and toString inspection still fire. Presenting a consistent fingerprint in production generally requires a patched fork rather than runtime configuration.

Playwright vs Puppeteer vs Selenium

Picking between the three:

  • Playwright — multi-browser, multi-language, modern auto-wait API. Default choice for new scrapers in Python or Node. Fastest learning curve.
  • Puppeteer — Node-only, Chromium-only. Smaller API surface, mature ecosystem, slightly faster startup. Pick if you're Node-only and don't need Firefox/WebKit.
  • Selenium — widest browser support (Safari, Edge, even mobile WebDriver), oldest API. Pick if you need Safari testing or have an existing Selenium codebase. Most detectable of the three.

All three are equally easy to detect on a default install. Patched variants exist for Playwright/Puppeteer (Camoufox, PatchRight, undetected-chromedriver, SeleniumBase UC), so the stealth ecosystem is the practical tiebreaker.

Code example

python
# Async Playwright with a residential proxy and useragent override
from playwright.async_api import async_playwright

async def scrape(url, proxy_url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,                          # run with a visible browser process
            proxy={"server": proxy_url},
        )
        ctx = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/131.0.0.0 Safari/537.36",
            locale="en-US",
            viewport={"width": 1920, "height": 1080},
        )
        page = await ctx.new_page()
        await page.goto(url, wait_until="domcontentloaded")
        await page.wait_for_timeout(2000)             # let XHRs settle
        html = await page.content()
        await browser.close()
        return html
# Works on simple sites; recognized by stronger systems like Akamai/Kasada — Camoufox or PatchRight present more consistent fingerprints.

Related terms

What Is the Chrome DevTools Protocol (CDP)?
The Chrome DevTools Protocol (CDP) is the low-level interface for instrumenting and controlling Chromium-based browsers. Low-level means it …
What Is Headless Browser Detection?
Headless browser detection is the set of probes anti-bot systems use to distinguish a headless or instrumented Chrome session from a real us…
What Is Camoufox?
Camoufox is a fork of Firefox with anti-fingerprinting patches applied at the C++ build level. That phrase matters: most anti-fingerprinting…
What Is PatchRight?
PatchRight is a browser-automation library that edits Playwright's own Python code before Chrome launches, instead of injecting JavaScript i…
Web Scraping Tools 2026 — A Comparison
"Web scraping tools" is the whole family of software you use to pull data off websites — and in 2026 that family is big but neatly sorted in…
What Is Puppeteer?
Puppeteer is Google's Node.js library for driving a Chromium browser from code, over the Chrome DevTools Protocol (CDP) - the same channel C…
What Is Selenium?
Selenium is the original cross-browser automation framework — the W3C WebDriver standard predates Puppeteer by a decade. In plain terms, it …
What Is Botasaurus?
Botasaurus is a free, open-source (MIT-licensed) Python framework for building web scrapers. You wrap your scraping functions with one of th…
Browser Automation Engine Benchmarks
A browser-automation-engine benchmark drives several automation stacks through the same set of targets and records, side by side, how often …
Web Scraping With Node.js: A Complete 2026 Guide
Web scraping with Node.js means fetching a page (with Axios or the built-in fetch) and parsing it with Cheerio for static sites, or driving …
What Is JavaScript Rendering?
JavaScript rendering is the process of executing a page's JavaScript in a real browser engine so that content built on the client side appea…

Concept map

How Playwright connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

Is Playwright better than Puppeteer for scraping?

For Node-only Chromium scraping they're interchangeable - pick by team familiarity. Playwright wins if you need Python or Firefox/WebKit. Both lose to anti-bot systems on default settings, and both have patched variants that fix it.

How does Playwright interact with Cloudflare-protected sites?

On free-tier Cloudflare and Bot Fight Mode it often works with a residential proxy (a real-looking home IP address). Against Cloudflare Bot Management Enterprise it is typically flagged: the JA4 (a TLS handshake fingerprint) plus CDP signals are recognized. Production setups for authorized access tend to use a patched fork such as Camoufox or a managed API.

Why use scrapy-playwright instead of just Playwright?

When the crawl is bigger than ~1000 URLs and you want Scrapy's queue, retries, deduplication, and item pipelines, but only some pages need a browser. scrapy-playwright lets you mark specific requests as needing a browser; the rest go through the cheap, fast HTTP path.

Last updated: 2026-05-31