Web Scraping APIs

How to Reverse-Engineer API Requests for Scraping

How to Reverse-Engineer API Requests for Scraping — conceptual illustration
On this page

Reverse-engineering API requests for scraping means watching the network traffic a website makes, spotting the JSON endpoints that feed its visible UI, and calling those endpoints directly instead of scraping the rendered HTML. An API (Application Programming Interface) is the set of data requests a site supports; the JSON it returns is clean, structured data. For most modern sites this API path is dramatically faster, cheaper, and more reliable than running a browser — you skip the JavaScript, get structured data, and avoid most fingerprint-based blocking (where a site identifies and blocks automated clients by their technical traits).

Quick facts

WorkflowOpen DevTools → Network → reproduce the action → filter for XHR/fetch
Look forJSON responses, GraphQL queries, structured pagination cursors
Always copyFull URL, all headers, body — replicate exactly first, simplify after
Watch forCSRF tokens, signed query params, dynamic auth headers
When it failsEncrypted bodies, attestation tokens, mobile-only endpoints

The basic workflow

Open DevTools (your browser's built-in developer panel, usually F12) and switch to the Network tab, then filter to Fetch/XHR — these are the background data requests the page makes. Now do the action you want to scrape: load a page, scroll, run a search. Scan the requests for ones that return structured JSON containing the data you want. Right-click that request and choose "Copy as cURL" (cURL is a command-line tool for making HTTP requests) — you now have a known-good copy. Paste it into a script, confirm it works, then remove headers one by one to find the minimum set the server actually needs.

Handling auth and CSRF

Most internal APIs want proof of who you are: usually a session cookie (a token tying requests to your login), a CSRF token from the initial page (a one-time value that proves the request came from the real site, not a forgery), or an auth header. Session cookies: load the public page first, grab the cookie, reuse it. CSRF tokens: pull the token out of the initial HTML (usually a meta tag or a hidden form input) and include it in later API calls. Bearer tokens: log in once through the normal flow, capture the token, and refresh it as needed.

When reverse-engineering fails

Some endpoints fight back. They might sign each request with an HMAC (a tamper-proof checksum) computed in deliberately scrambled, or obfuscated, JavaScript; attach device-attestation tokens that only exist if you actually run the page's JS; or only serve the mobile app, locked down with TLS pinning (where the app refuses any https connection it does not specifically trust). In those cases the effort of reverse-engineering outweighs just rendering the page in a real browser — so fall back to that. Mobile API endpoints are their own category and usually need MITM proxy work — sitting between the app and the server to inspect traffic — using a tool like Mitmproxy or Charles on a real device.

Code example

python
import requests, re

s = requests.Session()
home = s.get('https://example.com/')
csrf = re.search(r'name="csrf" content="([^"]+)"', home.text).group(1)

api = s.get('https://example.com/api/v1/products', params={
    'page': 1, 'limit': 50
}, headers={'X-CSRF-Token': csrf})
data = api.json()

Related terms

Concept map

How How to Reverse-Engineer API Requests for Scraping connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

Is reverse-engineering APIs legal?

Calling a public-facing internal API is the same as making the request a browser would already make. The legal questions are about what you do with the data, not the act of fetching it. Stay clear of authenticated endpoints you do not have access to.

How do I know if a site uses GraphQL?

GraphQL is a query style where every data type is served from one endpoint. Look for requests to a single URL (often /graphql) with POST bodies that contain query and variables fields — that same endpoint answers every kind of request.

What if the API request body is encrypted?

Some sites encrypt the request body with a key generated by their page-side JavaScript. You can either reverse-engineer how that key is built (hours to days of JS work) or fall back to browser rendering — usually the latter is cheaper.

Last updated: 2026-05-31