Web Scraping APIs

The 6-Step Web Scraping Decision Flow

By the Scrappey Research Team

The 6-Step Web Scraping Decision Flow — conceptual illustration
On this page

The web scraping decision flow is a six-step checklist, ordered cheapest-first, that experienced engineers run through on every new target they are permitted to access. You try the steps in order and stop at the first one that works. Each step you move down to costs more engineering effort, more infrastructure, and more money per request. In practice, most production scraping is handled by steps 1-3 (the mobile API, an XHR endpoint, or JSON already sitting inside the HTML). The general principle: start with the simplest option, because the mobile app usually talks to the same backend through a more direct endpoint.

Quick facts

Step 0Identify the anti-bot vendor (Wappalyzer, wafw00f, Burp + MCP)
Step 1Find the mobile API (HTTPToolkit on rooted AVD)
Step 2Find the XHR / GraphQL endpoint (DevTools, Burp)
Step 3Look for JSON embedded in HTML (__NEXT_DATA__, chompjs)
Step 4HTTP scraping with curl_cffi + residential proxy
Step 5Browser with C++ patches (Camoufox / CloakBrowser / PatchRight)
Step 6Managed scraping API (Scrappey / Bright Data / Zyte)

Why the order matters

The steps get harder as you go down, so the earlier you stop the easier your life is. Step 1 (mobile API) hands you clean JSON over a relaxed HTTP endpoint, and the only price is an afternoon learning HTTPToolkit (a tool that lets you watch an app's network traffic). Step 2 (XHR — the background requests a page makes to fetch data) also gives you JSON, but from an endpoint that might be guarded. Step 3 (JSON-in-HTML) is the same data as a plain string you parse, with no browser at all. Steps 4-6 each pile on more infrastructure and budget.

The cost ladder is real. Step 4 needs residential proxies (~$3–10/GB). Step 5 needs a patched-browser binary plus 200MB RAM per instance plus proxies. Step 6 is per-request pricing on managed APIs ($0.20–$3 per 1,000). Starting at step 5 when step 1 would have worked is a recurring waste of engineering time — but it is common when teams don't consciously walk the flow.

Step-by-step walkthrough

Step 0 — Recon. Before anything, identify the stack. Install Wappalyzer (a Chrome extension) and visit the target; it names the anti-bot vendor in one click. Or run wafw00f https://target.com from the command line. With Burp Suite MCP attached to Claude Code, one prompt traces the cookie lifecycle and recommends which step to use.

Step 1 — Mobile API. Run the app inside a rooted Android Studio emulator (AVD) and capture its traffic with HTTPToolkit. The mobile app often talks to a separate backend with a different configuration. For example, a retailer's mobile app may use a direct GraphQL endpoint that is served by a different backend than the web frontend's Akamai + DataDome stack.

Step 2 — XHR. Open Chrome DevTools → Network → Fetch/XHR. Many single-page apps load everything from one undocumented JSON endpoint you can request directly.

Step 3 — JSON in HTML. Many sites ship their data right inside the page source. Next.js sites embed full state in __NEXT_DATA__; React SPAs often expose window.__INITIAL_STATE__. For example, some product pages ship 100KB+ of product data in __NEXT_DATA__, which can be read directly because no JS executes.

Step 4 — HTTP + curl_cffi. Send plain HTTP requests with a TLS handshake (the encryption setup behind https) that matches a real browser via impersonate="chrome131", plus a residential proxy. This works for many targets where server-side scoring is light.

Step 5 — Patched browser. A real browser configured for a consistent fingerprint: Camoufox, CloakBrowser, and PatchRight. Each addresses a specific layer (canvas/WebGL, extension probes, or function-source inspection) that JS-level runtime patching cannot reach.

Step 6 — Managed API. Hand the problem to a paid service. This is common for sites with a custom JS VM such as F5 Shape, where a DIY approach is impractical. Once you are spending more than ~2 engineer-days/month on maintenance, the managed API is cheaper than the engineer.

Cost progression — when to escalate

StepCostMaintenance burden
1 — Mobile APIFreeLow (token refresh)
2 — XHR / GraphQLFreeLow–medium
3 — JSON-in-HTMLFreeLow
4 — HTTP + curl_cffiProxy only (~$2–10/GB residential)Medium (TLS profile rotation)
5 — Patched browserProxy + 200MB RAM/instanceMedium–high (per-target tuning)
6 — Managed API$0.20–$3 per 1,000 requestsZero

Code example

python
# Skeleton: try steps 3 and 4 before launching a browser.
import re, chompjs
from curl_cffi import requests

URL = "https://target.com/product/123"

s = requests.Session(impersonate="chrome131")
r = s.get(URL, proxies={"https": "http://user:pass@residential:port"})

# Step 3: JSON-in-HTML — often the entire dataset is here
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', r.text, re.S)
if m:
    data = chompjs.parse_js_object(m.group(1))
    print("Step 3 succeeded.")
else:
    # Step 4 fallback: parse the HTML with selectolax/BeautifulSoup
    print("No embedded state — fall through to step 4 HTML scraping.")
# Only reach for steps 5–6 if 1–4 have all failed.

Related terms

Concept map

How Web Scraping Decision Flow connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

When should I start at step 6 (managed API)?

Start there in three cases: the target uses a custom JS VM such as F5 Shape (which makes a DIY approach impractical); your team is small and scraping isn't your core product; or maintenance would cost more than ~2 engineer-days per month. For everything else, walking up from step 1 is cheaper in the long run, even if your first scraper takes a day longer to build.

Is step 1 (mobile API) always available?

Most brands with a mobile app have a mobile API, but not all of them are softer than the web frontend. Some apps pin SSL certificates, which means you need Frida or objection to intercept their traffic. Others have heavy jailbreak detection and may crash on emulators. For the ~30% of targets where step 1 doesn't work, walk to step 2 or 3.

How do I know which step a target is on?

Step 0 recon tells you. Wappalyzer names the anti-bot vendor, and inspecting the cookies confirms it. Once you know the vendor, you know which steps are worth trying: a custom JS VM such as F5 Shape generally points to step 6, function-source inspection points to step 5 (PatchRight), DataDome is often handled at step 3 or 4, light Cloudflare at step 4, and a site with no anti-bot vendor at all works at step 4 with plain requests.

What about scraping legality across these steps?

Legality is a separate question from which technical step you use. Scraping public data through the mobile API carries the same legal posture as scraping it through the web. Anything behind a login is a different matter entirely — that's where you should be looking at Computer Use Agents with user consent, not scraping.

Last updated: 2026-05-31

', r.text, re.S)\nif m:\n data = chompjs.parse_js_object(m.group(1))\n print(\"Step 3 succeeded.\")\nelse:\n # Step 4 fallback: parse the HTML with selectolax/BeautifulSoup\n print(\"No embedded state — fall through to step 4 HTML scraping.\")\n# Only reach for steps 5–6 if 1–4 have all failed."},"relatedSlugs":["what-is-mobile-api-scraping","what-is-curl-cffi","what-is-camoufox","what-is-a-web-scraping-api","what-is-mcp-server-for-scraping","what-is-scrapy-go-tls-sidecar-architecture","web-scraping-tools-2026"],"faq":[{"q":"When should I start at step 6 (managed API)?","a":"Start there in three cases: the target uses a custom JS VM such as F5 Shape (which makes a DIY approach impractical); your team is small and scraping isn't your core product; or maintenance would cost more than ~2 engineer-days per month. For everything else, walking up from step 1 is cheaper in the long run, even if your first scraper takes a day longer to build."},{"q":"Is step 1 (mobile API) always available?","a":"Most brands with a mobile app have a mobile API, but not all of them are softer than the web frontend. Some apps pin SSL certificates, which means you need Frida or objection to intercept their traffic. Others have heavy jailbreak detection and may crash on emulators. For the ~30% of targets where step 1 doesn't work, walk to step 2 or 3."},{"q":"How do I know which step a target is on?","a":"Step 0 recon tells you. Wappalyzer names the anti-bot vendor, and inspecting the cookies confirms it. Once you know the vendor, you know which steps are worth trying: a custom JS VM such as F5 Shape generally points to step 6, function-source inspection points to step 5 (PatchRight), DataDome is often handled at step 3 or 4, light Cloudflare at step 4, and a site with no anti-bot vendor at all works at step 4 with plain requests."},{"q":"What about scraping legality across these steps?","a":"Legality is a separate question from which technical step you use. Scraping public data through the mobile API carries the same legal posture as scraping it through the web. Anything behind a login is a different matter entirely — that's where you should be looking at Computer Use Agents with user consent, not scraping."}],"updatedAt":"2026-05-31"}}