How to Scrape JavaScript-Heavy Websites

Pim · Scrappey Research

June 16, 2026 5 min read

Paste into ChatGPT, Claude, or any LLM

How to Scrape JavaScript-Heavy Websites — conceptual illustration

On this page

JavaScript-heavy websites build their content in the browser after the first response, so a plain HTTP request returns an almost-empty HTML shell; to scrape them you either call the hidden JSON API the page uses to fetch its data, or render the page in a headless browser like Playwright and read the DOM once it has loaded. The hidden-API route is almost always faster and cleaner, so the first move is to inspect the network tab and check whether the data arrives as JSON. Only when that endpoint is unavailable or requires signed, short-lived tokens that are impractical to reproduce do you reach for a real browser, where the hard part becomes waiting for the right content to appear before you read it.

Symptom	requests/curl returns an empty shell; data appears only in a real browser
Best route	Find the background JSON/XHR or GraphQL endpoint and call it directly
Fallback	Headless browser (Playwright, Puppeteer, Selenium) that renders the JS
Key skill	Waiting correctly: wait_for_selector or networkidle, not fixed sleeps
Frameworks	React, Vue, Angular, Svelte SPAs and lazy-loaded / infinite-scroll feeds

Detecting when JavaScript rendering is actually needed

Before reaching for a browser, confirm you even need one - many sites that feel dynamic still ship usable data in the first response. The quickest test is to compare what your HTTP client sees against what the browser shows. Fetch the URL with requests or curl and search the raw response for a value you can see on the page (a product name, a price). If it is there, no rendering is needed and you can parse the HTML directly. If the response is a near-empty <div id="root"></div> shell with a bundle of scripts, the content is built client-side.

A second confirmation: open DevTools, disable JavaScript (Command Palette - Disable JavaScript), and reload. If the page goes blank or shows a "Please enable JavaScript" notice, the content is JS-rendered. Single-page apps built with React, Vue, Angular, or Svelte are the common case - they serve a thin shell and populate it after the bundle executes. "View Page Source" shows the original server HTML, while "Inspect" shows the live DOM after scripts run; a large gap between the two is the clearest signal that rendering happens in the browser.

The hidden-API approach: inspect the network tab first

The data a SPA renders almost always arrives over the wire as JSON, and calling that endpoint directly is the best route - no browser, no DOM parsing, far less to break. Open DevTools, go to the Network tab, filter to Fetch/XHR, and reload (or scroll, click, or paginate to trigger the action you care about). Watch for requests that return JSON - often paths like /api/products, /graphql, or a versioned /v2/... endpoint. Click one, check the Response tab, and you will usually find clean structured data with the exact fields you want.

From there, replicate the request from your code. Copy the request as cURL (right-click - Copy - Copy as cURL) to capture its method, query parameters, and headers, then translate that into Python. Pay attention to headers the server requires: Accept: application/json, an X-CSRF-Token or Authorization bearer token, a Referer, and sometimes an API key embedded in the page. Many of these endpoints are paginated with a page, offset, or cursor parameter, so you loop through them to collect everything. This is faster and more stable than scraping rendered HTML, because JSON field names change far less often than CSS class names and layout.

When you must render: headless browsers and waiting strategies

If the endpoint is signed, short-lived, hidden behind WebSockets, or only returns pre-rendered HTML fragments, render the page in a headless browser. Playwright is a strong default - it drives Chromium, Firefox, and WebKit from one API, with mature Python, Node, and .NET bindings; Puppeteer (Chrome-focused, Node) and Selenium (the widest language and legacy-browser support) are reasonable alternatives depending on your stack.

The part people get wrong is waiting. A fixed sleep(5) is both slow and flaky. Prefer event-driven waits: page.wait_for_selector('.product-card') blocks until the specific element you need exists, while page.wait_for_load_state('networkidle') waits until background requests settle - useful for AJAX-driven pages, though it can hang on sites that poll continuously, so always pair it with a timeout. For interactive content you trigger the action (click "Load more", scroll a feed) and then wait for the new node. Rendering is resource-heavy at scale, so a managed web-data API such as Scrappey can render the page and return the final HTML or a screenshot in a single call, handling the browser, proxies, and retries for you when running your own headless fleet is more than you want to maintain.

Code example

python

import requests
from playwright.sync_api import sync_playwright

# --- Route 1 (preferred): call the hidden JSON API directly ---
# Found in DevTools -> Network -> Fetch/XHR while the SPA loads its data.
def scrape_via_hidden_api():
    headers = {
        "Accept": "application/json",
        "Referer": "https://shop.example.com/",
        # "X-CSRF-Token": "..."  # add tokens/keys the endpoint requires
    }
    items = []
    page = 1
    while True:
        r = requests.get(
            "https://shop.example.com/api/products",
            params={"page": page, "limit": 50},
            headers=headers,
            timeout=20,
        )
        r.raise_for_status()
        data = r.json()
        items.extend(data["results"])
        if not data.get("has_next"):
            break
        page += 1
    return items


# --- Route 2 (fallback): render the SPA when there is no clean API ---
def scrape_via_browser():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto("https://shop.example.com/products", wait_until="domcontentloaded")
        # Wait for the actual content, not a fixed sleep.
        page.wait_for_selector(".product-card", timeout=15000)
        cards = page.query_selector_all(".product-card")
        rows = [{
            "name": c.query_selector(".name").inner_text(),
            "price": c.query_selector(".price").inner_text(),
        } for c in cards]
        browser.close()
        return rows


if __name__ == "__main__":
    try:
        records = scrape_via_hidden_api()   # try the fast path first
    except Exception:
        records = scrape_via_browser()      # render only if needed
    print(len(records), "records")

Related terms

What Is a Headless Browser?

A headless browser is a real web browser — Chrome, Firefox, or WebKit — that runs without a visible window, driven entirely by code instead …

What Is Playwright?

Playwright is a cross-browser automation framework from Microsoft that drives Chromium, Firefox, and WebKit through a single API. An automat…

Best Web Scraping API for JavaScript-Rendered Sites

The best web scraping API for JavaScript-rendered sites runs a real headless browser per request, executes the page's JavaScript, waits for …

How to Scrape Infinite-Scroll Pages

Infinite scroll is the page design where new content keeps loading on its own as you scroll down (like a social feed that never ends). To sc…

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

Concept map

How How to Scrape JavaScript-Heavy Websites connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping APIs

Tools & solutions for this topic

Frequently asked questions

How do I know if a website needs JavaScript rendering?

Fetch the page with a plain HTTP client like requests or curl and search the raw response for a value you can see in the browser. If the value is present, you can parse the HTML directly; if you only get an empty shell with script tags, the content is built client-side and you will need the hidden API or a headless browser. Disabling JavaScript in DevTools and reloading is a fast confirmation - if the content vanishes, it is JS-rendered.

Why is calling the hidden API better than using a browser?

The background JSON endpoint returns clean, structured data with the exact fields you want, so there is no DOM to render and no fragile CSS selectors to maintain. It is faster, uses far less memory and CPU than spinning up a browser, and tends to be more stable because JSON field names change less often than page layout. Use a headless browser only when the endpoint is signed, encrypted, short-lived, or simply not exposed.

What is the right way to wait for content in Playwright?

Use event-driven waits instead of fixed sleeps. page.wait_for_selector targets the specific element you need and continues the moment it appears, while page.wait_for_load_state('networkidle') waits for background requests to settle, which suits AJAX-heavy pages. Always set a timeout so your script fails fast rather than hanging if the condition is never met.

Can a managed scraping API handle JavaScript-heavy sites for me?

Yes. A managed web-data API renders the page in a real browser on its own infrastructure and returns the final HTML, JSON, or a screenshot, while handling proxies, browser sessions, and retries in a single request. This is convenient when running and scaling your own headless browser fleet is more overhead than you want, though for simple cases the direct hidden-API approach is still the cheapest and fastest.

Last updated: 2026-06-16 · Facts last verified: 2026-06-16