Web Scraping APIs

The 6-Step Web Scraping Decision Flow

The 6-Step Web Scraping Decision Flow — conceptual illustration
On this page

The web scraping decision flow is a six-step priority order experienced practitioners follow on any new target. Walk steps in order. Stop at the first win. Each subsequent step adds engineering cost, infrastructure complexity, and per-request expense. Most production scraping is solved at steps 1–3 (mobile API, XHR endpoint, JSON embedded in HTML). The cardinal rule: never start at step 5 — the mobile app frequently hits the same backend with zero anti-bot.

Quick facts

Step 0Identify the anti-bot vendor (Wappalyzer, wafw00f, Burp + MCP)
Step 1Find the mobile API (HTTPToolkit on rooted AVD)
Step 2Find the XHR / GraphQL endpoint (DevTools, Burp)
Step 3Look for JSON embedded in HTML (__NEXT_DATA__, chompjs)
Step 4HTTP scraping with curl_cffi + residential proxy
Step 5Browser with C++ patches (Camoufox / CloakBrowser / PatchRight)
Step 6Managed scraping API (Scrappey / Bright Data / Zyte)

Why the order matters

Each step inherits the previous step's freedom. Step 1 (mobile API) gives you JSON over a permissive HTTP endpoint at the cost of one afternoon learning HTTPToolkit. Step 2 (XHR) gives you JSON over a possibly-protected HTTP endpoint. Step 3 (JSON-in-HTML) gives you the same data as a string parse with no browser. Steps 4–6 cost progressively more infrastructure and budget.

The cost ladder is real. Step 4 needs residential proxies (~$3–10/GB). Step 5 needs a patched-browser binary plus 200MB RAM per instance plus proxies. Step 6 is per-request pricing on managed APIs ($0.20–$3 per 1,000). Starting at step 5 when step 1 would have worked is a recurring waste of engineering time — but it is what scrapers do when they don't consciously walk the flow.

Step-by-step with confirmed bypasses

Step 0 — Recon. Install Wappalyzer (Chrome extension) and visit the target. It identifies anti-bot vendor in one click. Or run wafw00f https://target.com from CLI. With Burp Suite MCP attached to Claude Code, one prompt traces cookie lifecycle and recommends the bypass step.

Step 1 — Mobile API. Rooted Android Studio AVD + HTTPToolkit. The mobile app often hits a separate backend with weaker bot protection. Confirmed in production: a direct GraphQL endpoint from a major retailer's mobile app bypassed the entire web-side Akamai + DataDome stack.

Step 2 — XHR. Chrome DevTools → Network → Fetch/XHR. Many SPAs load all data from one undocumented JSON endpoint. Confirmed: a single GraphQL endpoint bypassed all of one retailer's HTML anti-bot.

Step 3 — JSON in HTML. Next.js sites embed full state in __NEXT_DATA__. React SPAs often have window.__INITIAL_STATE__. Confirmed: Grainger.com ships 110KB of product data in __NEXT_DATA__, bypassing DataDome entirely because no JS executes.

Step 4 — HTTP + curl_cffi. impersonate="chrome131" + residential proxy. Resolves ~60% of Akamai targets where sensor.js scoring is light, almost all medium Cloudflare, most DataDome XHR endpoints.

Step 5 — Patched browser. Camoufox (reported 100% Cloudflare pass rate March 2026 on Instagram, Reddit, X, LinkedIn), CloakBrowser (Akamai's 60-extension probe), PatchRight (Kasada). Each addresses a specific layer JS-level stealth cannot reach.

Step 6 — Managed API. F5 Shape specifically — the custom JS VM makes DIY impractical. Above ~2 engineer-days/month of bypass maintenance, the managed API is cheaper than the engineer.

Cost progression — when to escalate

StepCostMaintenance burden
1 — Mobile APIFreeLow (token refresh)
2 — XHR / GraphQLFreeLow–medium
3 — JSON-in-HTMLFreeLow
4 — HTTP + curl_cffiProxy only (~$2–10/GB residential)Medium (TLS profile rotation)
5 — Patched browserProxy + 200MB RAM/instanceMedium–high (per-target tuning)
6 — Managed API$0.20–$3 per 1,000 requestsZero

Code example

python
# Skeleton: try steps 3 and 4 before launching a browser.
import re, chompjs
from curl_cffi import requests

URL = "https://target.com/product/123"

s = requests.Session(impersonate="chrome131")
r = s.get(URL, proxies={"https": "http://user:pass@residential:port"})

# Step 3: JSON-in-HTML — often the entire dataset is here
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', r.text, re.S)
if m:
    data = chompjs.parse_js_object(m.group(1))
    print("Step 3 succeeded.")
else:
    # Step 4 fallback: parse the HTML with selectolax/BeautifulSoup
    print("No embedded state — fall through to step 4 HTML scraping.")
# Only reach for steps 5–6 if 1–4 have all failed.

Related terms

Concept map

How Web Scraping Decision Flow connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

When should I start at step 6 (managed API)?

When the target uses F5 Shape (custom JS VM makes DIY impractical), when your team is small and the scraping is not your core product, or when bypass maintenance would exceed ~2 engineer-days per month. For everything else, walking from step 1 is cheaper in the long run even if the first scraper takes a day longer to build.

Is step 1 (mobile API) always available?

Most brands with a mobile app have a mobile API, but not all of them are softer than the web frontend. Apps that pin SSL certificates require Frida or objection to intercept. Apps with heavy jailbreak detection may crash on emulators. For the ~30% of targets where step 1 does not work, walk to step 2 or 3.

How do I know which step a target is on?

Step 0 recon tells you. Wappalyzer identifies the vendor; cookie inspection confirms it. From the vendor you know which steps are viable: F5 Shape forces step 6, Kasada forces step 5 (PatchRight), DataDome usually yields to step 3 or 4, light Cloudflare yields to step 4, no anti-bot vendor at all yields to step 4 with plain requests.

What about scraping legality across these steps?

Legality is orthogonal to the technical step. Scraping public data via the mobile API has the same legal posture as scraping it via the web. Anything behind authentication is a different question entirely — that's where you should be looking at Computer Use Agents with user consent, not scraping.

Last updated: 2026-05-26