The 6-Step Web Scraping Decision Flow

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

The 6-Step Web Scraping Decision Flow — conceptual illustration

On this page

The web scraping decision flow is a six-step checklist, ordered cheapest-first, that experienced engineers run through on every new target they are permitted to access. You try the steps in order and stop at the first one that works. Each step you move down to costs more engineering effort, more infrastructure, and more money per request. In practice, most production scraping is handled by steps 1-3 (the mobile API, an XHR endpoint, or JSON already sitting inside the HTML). The general principle: start with the simplest option, because the mobile app usually talks to the same backend through a more direct endpoint.

Step 0	Identify the anti-bot vendor (Wappalyzer, wafw00f, Burp + MCP)
Step 1	Find the mobile API (HTTPToolkit on rooted AVD)
Step 2	Find the XHR / GraphQL endpoint (DevTools, Burp)
Step 3	Look for JSON embedded in HTML (__NEXT_DATA__, chompjs)
Step 4	HTTP scraping with curl_cffi + residential proxy
Step 5	Browser with C++ patches (Camoufox / CloakBrowser / PatchRight)
Step 6	Managed scraping API (Scrappey / Bright Data / Zyte)

Why the order matters

The steps get harder as you go down, so the earlier you stop the easier your life is. Step 1 (mobile API) hands you clean JSON over a relaxed HTTP endpoint, and the only price is an afternoon learning HTTPToolkit (a tool that lets you watch an app's network traffic). Step 2 (XHR — the background requests a page makes to fetch data) also gives you JSON, but from an endpoint that might be guarded. Step 3 (JSON-in-HTML) is the same data as a plain string you parse, with no browser at all. Steps 4-6 each pile on more infrastructure and budget.

The cost ladder is real. Step 4 needs residential proxies (~$3–10/GB). Step 5 needs a patched-browser binary plus 200MB RAM per instance plus proxies. Step 6 is per-request pricing on managed APIs ($0.20–$3 per 1,000). Starting at step 5 when step 1 would have worked is a recurring waste of engineering time — but it is common when teams don't consciously walk the flow.

Step-by-step walkthrough

Step 0 — Recon. Before anything, identify the stack. Install Wappalyzer (a Chrome extension) and visit the target; it names the anti-bot vendor in one click. Or run wafw00f https://target.com from the command line. With Burp Suite MCP attached to Claude Code, one prompt traces the cookie lifecycle and recommends which step to use.

Step 1 — Mobile API. Run the app inside a rooted Android Studio emulator (AVD) and capture its traffic with HTTPToolkit. The mobile app often talks to a separate backend with a different configuration. For example, a retailer's mobile app may use a direct GraphQL endpoint that is served by a different backend than the web frontend's Akamai + DataDome stack.

Step 2 — XHR. Open Chrome DevTools → Network → Fetch/XHR. Many single-page apps load everything from one undocumented JSON endpoint you can request directly.

Step 3 — JSON in HTML. Many sites ship their data right inside the page source. Next.js sites embed full state in __NEXT_DATA__; React SPAs often expose window.__INITIAL_STATE__. For example, some product pages ship 100KB+ of product data in __NEXT_DATA__, which can be read directly because no JS executes.

Step 4 — HTTP + curl_cffi. Send plain HTTP requests with a TLS handshake (the encryption setup behind https) that matches a real browser via impersonate="chrome131", plus a residential proxy. This works for many targets where server-side scoring is light.

Step 5 — Patched browser. A real browser configured for a consistent fingerprint: Camoufox, CloakBrowser, and PatchRight. Each addresses a specific layer (canvas/WebGL, extension probes, or function-source inspection) that JS-level runtime patching cannot reach.

Step 6 — Managed API. Hand the problem to a paid service. This is common for sites with a custom JS VM such as F5 Shape, where a DIY approach is impractical. Once you are spending more than ~2 engineer-days/month on maintenance, the managed API is cheaper than the engineer.

Cost progression — when to escalate

Step	Cost	Maintenance burden
1 — Mobile API	Free	Low (token refresh)
2 — XHR / GraphQL	Free	Low–medium
3 — JSON-in-HTML	Free	Low
4 — HTTP + curl_cffi	Proxy only (~$2–10/GB residential)	Medium (TLS profile rotation)
5 — Patched browser	Proxy + 200MB RAM/instance	Medium–high (per-target tuning)
6 — Managed API	$0.20–$3 per 1,000 requests	Zero

Code example

python

# Skeleton: try steps 3 and 4 before launching a browser.
import re, chompjs
from curl_cffi import requests

URL = "https://target.com/product/123"

s = requests.Session(impersonate="chrome131")
r = s.get(URL, proxies={"https": "http://user:pass@residential:port"})

# Step 3: JSON-in-HTML — often the entire dataset is here
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', r.text, re.S)
if m:
    data = chompjs.parse_js_object(m.group(1))
    print("Step 3 succeeded.")
else:
    # Step 4 fallback: parse the HTML with selectolax/BeautifulSoup
    print("No embedded state — fall through to step 4 HTML scraping.")
# Only reach for steps 5–6 if 1–4 have all failed.

Related terms

What Is Mobile API Scraping?

Mobile API scraping means watching the traffic a vendor's phone app sends to its servers, then making those same requests yourself from Pyth…

What Is curl_cffi?

curl_cffi is a Python HTTP client whose TLS fingerprint looks exactly like real Chrome, Firefox, or Safari. TLS is the encryption layer behi…

What Is Camoufox?

Camoufox is a fork of Firefox with anti-fingerprinting patches applied at the C++ build level. That phrase matters: most anti-fingerprinting…

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

What Is an MCP Server for Scraping?

An MCP server for scraping is a Model Context Protocol endpoint that exposes scraping tools (fetch, screenshot, parse, search) as callable f…

What Is the Scrapy + Go TLS Sidecar Architecture?

The Scrapy + Go TLS sidecar architecture is the most common production pattern for scraping Akamai- and Cloudflare-protected sites at scale.…

Web Scraping Tools 2026 — A Comparison

"Web scraping tools" is the whole family of software you use to pull data off websites — and in 2026 that family is big but neatly sorted in…

Concept map

How Web Scraping Decision Flow connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping APIs

Tools & solutions for this topic

Frequently asked questions

When should I start at step 6 (managed API)?

Start there in three cases: the target uses a custom JS VM such as F5 Shape (which makes a DIY approach impractical); your team is small and scraping isn't your core product; or maintenance would cost more than ~2 engineer-days per month. For everything else, walking up from step 1 is cheaper in the long run, even if your first scraper takes a day longer to build.

Is step 1 (mobile API) always available?

Most brands with a mobile app have a mobile API, but not all of them are softer than the web frontend. Some apps pin SSL certificates, which means you need Frida or objection to intercept their traffic. Others have heavy jailbreak detection and may crash on emulators. For the ~30% of targets where step 1 doesn't work, walk to step 2 or 3.

How do I know which step a target is on?

Step 0 recon tells you. Wappalyzer names the anti-bot vendor, and inspecting the cookies confirms it. Once you know the vendor, you know which steps are worth trying: a custom JS VM such as F5 Shape generally points to step 6, function-source inspection points to step 5 (PatchRight), DataDome is often handled at step 3 or 4, light Cloudflare at step 4, and a site with no anti-bot vendor at all works at step 4 with plain requests.

What about scraping legality across these steps?

Legality is a separate question from which technical step you use. Scraping public data through the mobile API carries the same legal posture as scraping it through the web. Anything behind a login is a different matter entirely — that's where you should be looking at Computer Use Agents with user consent, not scraping.

Last updated: 2026-05-31