The entire web scraping glossary as clean markdown — copy it and paste straight into ChatGPT, Claude, or any LLM as context. No HTML, no markup noise.
The full export is large. Grab the whole thing above, or copy just the section you need below.
# Web Scraping APIs
Core concepts behind modern web scraping APIs — what they do, how they handle hard sites, and where they fit in a data pipeline.
## What Is a CAPTCHA Solver?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-a-captcha-solver
**A CAPTCHA solver is software that automatically completes CAPTCHA challenges for an automated client.** A CAPTCHA is the "prove you're human" test a site shows you — clicking pictures of traffic lights, or a hidden background check. The solver takes that challenge from a site, works it out using AI models, browser automation, or real people paid to solve them, and hands back a token — a pass-code the site accepts as proof of being human. That lets a scraper, bot, or test script complete the challenge without anyone clicking anything by hand.
### Quick facts
- **Also known as:** CAPTCHA automation, automated CAPTCHA handling, anti-CAPTCHA
- **Common types solved:** reCAPTCHA v2/v3, hCaptcha, Cloudflare Turnstile, FunCaptcha, image CAPTCHAs
- **Primary use case:** Keeping scrapers, automated tests, and account workflows running
- **Typical pricing:** $1–$3 per 1,000 solves (machine), $1–$2 per 1,000 (human)
- **Risk level:** Medium — must respect site terms; widely used for public-data scraping and QA
### How CAPTCHA solvers work
Most solvers work in three steps. First, the scraper spots a CAPTCHA on the page (or knows to expect one) and reads off the details the challenge needs — the site key (the public ID that ties the challenge to that website), the page URL, and which type of CAPTCHA it is. Second, it sends those details to a solving backend: this can be an in-house AI model trained on millions of past challenges, a network of low-cost human workers, or a hybrid that passes the hard ones to humans. Third, the backend returns a token (a long, meaningless-looking string) that the scraper pastes into the page's form or attaches to its next request. The target site checks that token with its CAPTCHA provider, sees a passing score, and lets the request through. For invisible CAPTCHAs like reCAPTCHA v3 or Turnstile — which judge you silently instead of asking you to click anything — the solver often runs the challenge inside a real browser fingerprint (the unique profile of signals a browser gives off), so the token carries trusted behavioral and TLS signals. TLS is the encryption layer behind https, and its handshake leaves a fingerprint of its own.
### Why CAPTCHA solvers matter for web scraping
CAPTCHAs are the most visible layer of bot defense, and any non-trivial scraping project will run into them. Without a solver, one CAPTCHA-protected page can stall a job forever. With one, the scraper completes the challenge automatically and keeps going. Solvers also matter because they let you scale: solving 50,000 challenges by hand is not a workflow, but solving them at $2 per thousand is just a line item on a bill. The catch is that solvers are not a magic fix — they handle the challenge itself, but if your IP, headers, or TLS fingerprint still look automated, the site will simply throw another challenge at you a few requests later. A solver is one part of a working scraping setup, not the whole thing.
### Common implementations
Solvers come in three common shapes. Pure-API services (2Captcha, Anti-Captcha, CapSolver) take a job over HTTP and return a token; you wire them into your own code. Browser-automation libraries (Playwright/Puppeteer plugins — tools that drive a real browser from code) inject the solver into a live browser session and click through challenges for you. Full scraping APIs like Scrappey fold the solver into the same request that fetches the page — you send a URL, and the API handles proxies, JS rendering, fingerprinting, and CAPTCHAs in one call, returning the finished HTML or JSON. Most production scrapers end up using either the third option or a mix of the first two.
### Limitations and alternatives
Solvers cost real money per challenge, so a poorly-built scraper that trips a CAPTCHA on every request gets expensive fast. They also add delay — solving a Turnstile challenge can take 8–20 seconds. The best first move is to reduce how often a CAPTCHA appears at all: use quality residential proxies, a coherent browser fingerprint, a moderate request rate, and reused session cookies so repeated requests share one consistent session rather than appearing as many strangers. When you do hit a CAPTCHA, fall back to the solver. For sites that gate every single request behind one, switching to an official API (if the site offers one) or a managed scraping endpoint is almost always cheaper than solving thousands of challenges an hour.
### Example
```python
import requests
resp = requests.post(
'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY',
json={
'cmd': 'request.get',
'url': 'https://example.com/protected',
'autoparse': True
}
)
# CAPTCHA + proxy + fingerprinting handled server-side
html = resp.json()['solution']['response']
```
### FAQ
**Q: Are CAPTCHA solvers legal?**
Using a solver on public data, your own accounts, or for QA testing is generally legal in most places. Using one against a login you don't own, to break a site's terms of service in a way that's contractually enforceable, or to commit fraud is not. The tool itself is neutral; what matters is what you do with it.
**Q: How accurate are CAPTCHA solvers?**
For image CAPTCHAs and reCAPTCHA v2, solve rates from quality providers sit in the 90–99% range. Turnstile and reCAPTCHA v3 are harder because they score your behavior, not just whether you got the puzzle right — so accuracy depends as much on the surrounding fingerprint as on the solver itself.
**Q: How much does CAPTCHA solving cost?**
Machine solvers typically charge $1–$3 per 1,000 solves. Human solvers cost about the same but are slower. Integrated scraping APIs bundle the cost into their per-request price, which is usually cheaper than solving at scale yourself.
**Q: Can sites detect that a CAPTCHA solver was used?**
Not directly — the token a solver returns looks identical to one a human would produce. But sites can spot the context around it: an IP with no browsing history, a missing TLS fingerprint, or a suspiciously perfect 200ms response time are all stronger giveaways than the token itself.
---
## What Is Web Scraping?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-web-scraping
**Web scraping is the automated extraction of structured data from websites.** Instead of a person copying and pasting, a program (a "scraper") visits a web page, reads the page's code, and pulls out the specific pieces you want — prices, titles, ratings, addresses — then saves them somewhere useful like a database or spreadsheet. Under the hood, the scraper sends an HTTP request to a URL, parses the HTML or JSON that comes back, and extracts those fields into a downstream pipeline. It is how price monitors, search engines, and AI training datasets collect information from the open web at scale.
### Quick facts
- **Also known as:** Web harvesting, web data extraction, screen scraping
- **Common languages:** Python, JavaScript/Node, Go
- **Primary use cases:** Price monitoring, lead generation, SEO research, AI training data
- **Common blockers:** Rate limiting, CAPTCHAs, IP bans, JS-rendered content
### How web scraping works
Every scraper has three stages: fetch, parse, store. **Fetch** sends an HTTP request to a URL and receives the response — usually HTML, sometimes JSON returned by a site's internal API (a hidden data endpoint the page itself calls). **Parse** picks out the fields you care about using locators like CSS selectors, XPath, or regex (pattern-matching for text). When a page builds itself with JavaScript after loading, the parser first runs that code inside a headless browser — a real browser with no visible window. **Store** writes the cleaned data to a destination: a CSV file, a Postgres table, an S3 bucket, or directly into an application. Each stage has its own way of breaking — fetch fails when you get blocked, parse fails when the site changes its layout, store fails on duplicate records — so in practice most of a production scraper is the code that handles those failures gracefully.
### What web scraping is used for
Most scraping is commercial. E-commerce sites track competitor prices, travel aggregators pull flight and hotel availability, recruiters build candidate lists from public profiles, and SEO teams audit search results (SERPs) and backlinks. Research and AI uses are growing fast: large language models are trained on scraped web crawls, and academics use scrapers to study everything from misinformation to housing markets. Companies also scrape their own public sites for QA, monitoring, and content audits. The common thread is that the data is already visible to anyone with a browser — but collecting it at scale takes automation rather than manual effort.
### Common tools and approaches
For small jobs, the defaults are still Python's requests + BeautifulSoup or Node's axios + cheerio — lightweight libraries that fetch a page and pick fields out of the HTML. For dynamic sites that need JavaScript to run, Playwright and Puppeteer drive real browsers. For large crawls, Scrapy adds queues, retries, and pipelines on top. The next step up — once you're fighting Cloudflare, rotating thousands of proxies, or solving CAPTCHAs — is a managed scraping API like Scrappey, which runs that infrastructure for you so you only write the parsing logic. The right choice depends on volume, how hard the site is to access, and how much of your time you want to spend on anti-bot defense rather than on the data itself.
### Legal and ethical considerations
Scraping public data is generally legal in the US (the landmark hiQ Labs case established this) and across most of Europe — but with caveats. You should respect robots.txt (the file where a site states which paths bots may visit) where possible; avoid scraping personal data without a lawful basis under GDPR (the EU's data-protection law); don't bypass technical access controls in ways that could trigger the CFAA (a US computer-access law) or its equivalents; and don't republish copyrighted content as your own. Rate-limit yourself — space out your requests — so you don't slow down the target site. When in doubt, especially for logged-in pages, paywalled content, or personal data, get a lawyer's opinion before shipping.
### Example
```python
import requests
from bs4 import BeautifulSoup
# Fetch the page
resp = requests.get('https://example.com/products')
resp.raise_for_status()
# Parse the HTML and pull structured data out of it
soup = BeautifulSoup(resp.text, 'html.parser')
for card in soup.select('.product-card'):
name = card.select_one('.title').get_text(strip=True)
price = card.select_one('.price').get_text(strip=True)
print(name, price)
```
### FAQ
**Q: Is web scraping legal?**
Scraping data that's publicly accessible — without logging in or defeating authentication — is legal in most jurisdictions. The real legal risk concentrates around scraping personal data, copyrighted content, or sites that forbid it in enforceable terms. Treat robots.txt as the floor of what to consider, not the ceiling.
**Q: What's the difference between web scraping and an API?**
An official API is a contract: the site deliberately exposes specific data endpoints in a documented format. Scraping instead reads the same data out of the HTML the site renders for human visitors. APIs are more stable and more polite to use, but most sites don't offer one — so scraping fills the gap.
**Q: Do I need to know how to code to scrape websites?**
For a one-off job, no — no-code tools like Octoparse or browser extensions can work. For anything that runs repeatedly, depends on JavaScript, or runs at scale, you'll need Python or JavaScript. Most production scraping is written in code.
**Q: What blocks most scrapers?**
In order: IP-based rate limiting (too many requests from one address), CAPTCHAs and bot challenges (especially Cloudflare and DataDome), browser fingerprinting (sites identifying you from subtle browser traits), and layout changes that break your parser. The first three are infrastructure problems; the last is a maintenance problem.
---
## What Is a Web Scraping API?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-a-web-scraping-api
**A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parsed data.** Normally, scraping a protected site means running your own browsers, a pool of proxy IPs, and CAPTCHA solvers. A scraping API does all of that for you on its own servers: you send it a URL, it handles the JavaScript rendering, rotates IP addresses, fakes a realistic browser fingerprint, and gets past anti-bot defenses — then returns a clean response from a single request.
### Quick facts
- **Also known as:** Scraping API, scraper API, scraping-as-a-service
- **Typical features:** Proxy rotation, JS rendering, geo-targeting, session reuse
- **Pricing model:** Per request or per credit, often tiered by difficulty
- **Common examples:** Scrappey, ScrapingBee, Bright Data, ScraperAPI, ZenRows
### How a web scraping API works
On your side it is just one POST request: a JSON body holding the target URL, optionally the HTTP method, headers, and flags that say how you want the page rendered, plus an API key in the auth header. On the server side, the API does the hard part. It picks a proxy IP from its pool to match the country and difficulty you asked for, starts (or reuses) a real browser with a fresh fingerprint — the mix of details a site uses to recognize repeat visitors — opens the URL, runs whatever JavaScript is needed to load the content, quietly solves any CAPTCHAs that pop up, and waits for the page to finish loading. Finally it packages up the result — usually the rendered HTML, or JSON if you asked it to auto-parse — and sends it back. The full round trip takes a few seconds for easy sites and 10–30 seconds for heavily protected ones.
### Why use a scraping API instead of building your own
Building your own scraping stack means running Playwright (a browser-automation tool) at scale, maintaining a pool of proxy IPs spread across dozens of networks, keeping your browser fingerprints up to date every time Chrome changes, wiring in CAPTCHA solvers, and writing all the retry logic that holds it together. That is a full-time platform team. A scraping API folds all of that into a simple per-request price. The math usually favors the API below a few hundred thousand requests a month — it is cheaper. Above that, building in-house can win, but only if you have the engineers and the patience to keep it running as anti-bot vendors keep shipping updates.
### What to look for in a scraping API
Three things matter more than a long feature list. First, success rate on hard sites: ask for it broken out per defense — Cloudflare, DataDome, PerimeterX — since an overall average hides the cases you care about. Second, geographic coverage: if you need residential IPs (home-user addresses, harder for sites to block) in Brazil or Vietnam, confirm they actually have them — many providers only have strong US and EU pools. Third, session and cookie support: if your workflow has to log in or carry state from one request to the next, the API must offer sticky sessions, not just one-off calls. Pricing transparency comes next — credit systems vary wildly, and "$0.001 per request" often means "per simple request, multiply by 25x for the hard ones you actually need."
### When a scraping API is the wrong tool
If the target site has an official API for the data you want, use that instead — it is more stable, cheaper, and more polite. If you are scraping one small site at low volume, plain requests plus BeautifulSoup is fine. If your real bottleneck is parsing the data rather than getting to it, a scraping API will not help. And if you are handling logged-in personal data at scale, the legal questions matter more than the technical ones — an API does not change that.
### Example
```python
import requests
# One call: the API handles proxies, browser fingerprinting,
# JavaScript rendering and anti-bot challenges server-side.
resp = requests.post(
'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY',
json={
'cmd': 'request.get',
'url': 'https://example.com/protected',
'autoparse': True,
},
)
html = resp.json()['solution']['response']
```
### FAQ
**Q: How much does a web scraping API cost?**
Entry plans start around $30–$50 a month for 50k–100k simple requests. Hard sites — Cloudflare, DataDome, or anything that needs residential IPs — cost 5–25x more per request. At high volume, expect $0.001–$0.01 per successful request, depending on how protected the target is.
**Q: Do scraping APIs run JavaScript?**
Yes. Every serious provider offers a JavaScript-rendering mode that loads the page in a real headless browser (a full browser running with no visible window). It is slower and more expensive than fetching the raw HTML, so most APIs let you turn it on per request only when you need it.
**Q: Can I use a scraping API with sessions and logins?**
Most do, with some caveats. Look for sticky sessions (the same IP reused across several requests for a set time) and cookie passthrough (the API carries your cookies between calls). You still have to script the login steps yourself; the API just keeps you on the same identity afterward.
**Q: How is a scraping API different from a proxy provider?**
A proxy provider just sells you IP addresses — you still build everything else. A scraping API sells you a finished request: the proxies are bundled in along with page rendering and session handling. You pay more per request but ship months sooner.
---
## What Is a Headless Browser?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-a-headless-browser
**A headless browser is a real web browser — Chrome, Firefox, or WebKit — that runs without a visible window, driven entirely by code instead of by a person clicking.** It still loads pages, runs JavaScript, applies CSS, and fires events exactly like the browser on your screen — it just doesn't draw anything for a human to look at. Instead, it exposes what it's doing through an automation API your program can call. Scrapers, automated tests, and screenshot services all rely on headless browsers to work with sites that need JavaScript to build their content.
### Quick facts
- **Common implementations:** Headless Chrome, Headless Firefox, WebKit (Safari engine)
- **Automation libraries:** Playwright, Puppeteer, Selenium
- **Primary use cases:** Scraping JS-heavy sites, automated testing, PDF/screenshot generation
- **Tradeoff vs. HTTP:** 5–50x slower and more memory-hungry, but renders pages a request library can't
### How headless browsers work
A headless browser is literally the same program as the normal one — you just start it differently. Chromium has a `--headless` flag and Firefox has `-headless`. When it starts, the browser opens a control channel (a debugging protocol — Chrome DevTools Protocol for Chromium, or WebDriver BiDi, the newer standard that works across browsers) and listens on a local port, like a phone line waiting for instructions. Your automation library dials that port and sends commands: go to this URL, wait until this element appears, click this button, run this JavaScript, hand back the HTML. The browser carries them out using its full rendering pipeline — the same network stack, the same JavaScript engine (V8 in Chrome, SpiderMonkey in Firefox), the same DOM (the in-memory tree of the page). The only thing missing is the window on screen.
### Why scrapers need headless browsers
Most modern sites no longer send you a finished HTML page. They send a near-empty shell plus a JavaScript bundle that then fetches the real data from internal APIs and builds the page in your browser — this is how React, Vue, Next.js, and Angular all work. So a plain HTTP request hands you the empty shell, not the content you actually want. A headless browser runs that JavaScript, waits for the API calls to come back, and gives you the finished DOM. It also handles the other moving parts of a real visit — cookies, localStorage (small data the site saves in the browser), redirects, form submissions, and WebSocket connections (live two-way links) — all things a simple request library either can't do or can't fake convincingly.
### Headless browser detection
Anti-bot vendors actively hunt for headless browsers, because the default headless mode gives itself away. Chrome's default headless leaks tells: the `navigator.webdriver` property reads `true` (a flag set whenever a browser is being automated), the User-Agent string contains "HeadlessChrome", the window reports no size, browser plugins are missing, and the WebGL renderer (the graphics-card name the browser reports) comes back generic instead of real-looking. Some libraries (puppeteer-extra-plugin-stealth, playwright-stealth) adjust these default values so an automated browser presents a configuration closer to a normal one; detection vendors track new signals; the dynamic continues to evolve. Chrome's newer `--headless=new` mode closes most of the old differences and is commonly used today. For some sites, a real headful browser with display virtualization (a fake on-screen display so the browser behaves as if a monitor is attached) produces more consistent behavior, which is what scraping APIs run behind the scenes.
### When not to use a headless browser
Headless browsers are expensive to run — hundreds of MB of RAM each, and seconds per page load. If the target site has an obvious internal API endpoint (open DevTools, go to the Network tab, and look for XHR/fetch calls that return JSON), just call that directly with a plain HTTP request. If the site renders its pages on the server and ships complete HTML, a request library plus an HTML parser will be 10–50x faster. Save the headless browser for when the page truly needs JavaScript to render, when you need to interact with elements like buttons or forms, or when the site fingerprints you before it will hand over anything useful.
### Example
```python
from playwright.sync_api import sync_playwright
# Launch Chromium with no visible UI, render a JS-heavy page,
# then read the DOM after scripts have run.
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com/dashboard')
page.wait_for_selector('.loaded')
html = page.content()
browser.close()
```
### FAQ
**Q: Is headless Chrome the same as regular Chrome?**
Yes — same program, same engine, just without the visible window. The newer `--headless=new` mode runs the actual Chrome UI process in a way that's nearly indistinguishable from regular (headful) Chrome on most fingerprinting checks.
**Q: Playwright vs. Puppeteer vs. Selenium — which one?**
Playwright is the modern default: it drives multiple browsers, works in several languages, and has the best developer experience. Puppeteer is Chrome-only but tightly integrated. Selenium is the older standard, still dominant in enterprise QA teams. For a new scraping project, pick Playwright.
**Q: Can headless browsers solve CAPTCHAs?**
Not by themselves — a headless browser can display the challenge, but it can't recognize images or read the invisible bot scores a CAPTCHA uses to judge you. On sites you are permitted to access, the durable approach is to reduce how often a challenge appears in the first place — a coherent fingerprint and quality residential IPs — rather than relying on the headless browser to clear it.
**Q: Do headless browsers respect robots.txt?**
No. robots.txt is a polite convention aimed at crawlers, not browsers, so a headless browser will fetch any URL you point it at. Honoring robots.txt is up to you to build into the code that drives the browser.
---
## What Is Browser Fingerprinting?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-browser-fingerprinting
**Browser fingerprinting is a technique that identifies and tracks a visitor by combining dozens of small, observable characteristics of their browser and device into a single distinctive signature.** Think of it like recognizing someone by their height, voice, and gait rather than their name tag. Unlike cookies, a fingerprint is built from data the browser exposes by default — User-Agent (the browser's self-description), installed fonts, canvas and WebGL rendering quirks (tiny differences in how your hardware draws graphics), audio context output, screen resolution, TLS handshake order (TLS is the encryption layer behind https) — and persists even when the user clears cookies or switches to incognito mode.
### Quick facts
- **Also known as:** Device fingerprinting, passive fingerprinting
- **Common signals:** Canvas, WebGL, AudioContext, fonts, TLS/JA4, HTTP/2 frames
- **Used by:** Cloudflare, DataDome, PerimeterX, Akamai, fraud-prevention vendors
- **Cookie-free:** Yes — fingerprints survive incognito mode and cookie clearing
### How browser fingerprinting works
Sites collect fingerprint data through two channels. **Active fingerprinting** runs JavaScript in the browser to read APIs that report details about your setup: `navigator.userAgent`, `screen.width`, `Intl.DateTimeFormat().resolvedOptions()` (your timezone and locale), a hashed result from drawing a hidden test image (canvas), the name of your graphics chip (WebGL renderer), AudioContext outputs, and the list of installed fonts. **Passive fingerprinting** reads what the browser sends automatically, without being asked: the order of HTTP/2 frames, the cipher and extension order in the TLS ClientHello — the first message your browser sends to set up encryption — which produces the JA3/JA4 fingerprint, plus the exact casing and order of HTTP headers. Each signal alone is weak — millions of users share the same User-Agent — but combine fifteen of them and you have a 30-bit identifier that's unique among hundreds of millions of visitors. (30 bits means it can tell apart roughly a billion possibilities.)
### Why fingerprinting matters for scraping
Fingerprinting is how modern anti-bot systems tell a real Chrome user from Playwright pretending to be one. Even if your scraper rotates IPs, sets a real User-Agent, and uses a headless browser (a browser running with no visible window), mismatches between the layers leak the truth. The giveaway is always an inconsistency. A Linux Chrome User-Agent paired with a Windows TLS fingerprint is a tell. A canvas hash that matches none of the millions seen from real Chrome installs is a tell. A `navigator.plugins` array of length zero in a browser that should have plugins is a tell. Anti-bot scoring engines add up these signals and decide whether to serve the page, challenge with a CAPTCHA, or block outright.
### Why fingerprint consistency is hard to achieve
A single signal like the User-Agent tells only part of the story. What fingerprinting systems actually evaluate is whether all the signals agree with each other: whether the TLS fingerprint matches the browser named in the User-Agent, whether the canvas hash matches what a real installation of that browser produces, whether the timezone matches the network's location, and whether the language headers line up. When automation tooling reports values that don't naturally occur together, the inconsistency is what stands out. Keeping every layer internally consistent across a real browser stack is genuinely difficult engineering — which is why fingerprinting-aware browser-automation services exist. For authorized workflows on sites you are permitted to access, they maintain consistent, real-browser configurations so the layers stay coherent rather than ad hoc.
### Privacy and ethical context
Fingerprinting was originally developed for fraud prevention — banks use it to detect stolen credentials being replayed from a new device. It's also widely used for ad tracking, which has drawn regulatory pushback under GDPR and the ePrivacy Directive (EU privacy laws). Browsers are pushing back too: Safari's ITP, Firefox's resistFingerprinting mode, and Chrome's Privacy Sandbox all aim to flatten the most identifying signals — making everyone look more alike. For scraping, this is good news — as real users become harder to tell apart, fingerprints become harder for sites to rely on.
### Example
```javascript
// A few of the signals a fingerprinting script collects and hashes.
const signals = {
userAgent: navigator.userAgent,
platform: navigator.platform,
languages: navigator.languages,
hardwareConcurrency: navigator.hardwareConcurrency,
deviceMemory: navigator.deviceMemory,
timezone: Intl.DateTimeFormat().resolvedOptions().timeZone,
screen: [screen.width, screen.height, screen.colorDepth],
// Canvas, WebGL, fonts and audio add many more entropy bits.
};
// These are combined into one stable hash that survives cookie clearing.
const fingerprint = JSON.stringify(signals);
```
### FAQ
**Q: How unique is a browser fingerprint?**
EFF's Cover Your Tracks finds that 80–90% of browsers have a fingerprint that's unique within their visitor set — meaning no one else in that group looks the same. The exact uniqueness depends on how many signals are collected; fifteen well-chosen signals are enough to identify most users.
**Q: Does using a VPN change my fingerprint?**
A VPN changes your IP address, not your fingerprint. The canvas hash, TLS signature, screen resolution, and fonts stay exactly the same. Sites can link the VPN IP to the unchanged fingerprint, and the mismatch between location and device often gets you flagged.
**Q: Why can't a single signal be changed to hide automation?**
Any one signal can be set to an arbitrary value, but anti-bot vendors check whether the signals agree with each other. Changing the User-Agent without TLS, canvas, WebGL, and Audio also lining up produces a combination that doesn't exist on any real device — which is itself a strong signal that the request is automated.
**Q: What's a TLS fingerprint?**
It's the JA3 or JA4 hash derived from the order and contents of the TLS ClientHello — the first packet a client sends to start an encrypted connection. Chrome, Firefox, Safari, curl, and Python's requests each send a recognizably different ClientHello, so sites use it to spot non-browser clients no matter what User-Agent they claim.
---
## What Is curl_cffi?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-curl-cffi
**curl_cffi is a Python HTTP client whose TLS fingerprint looks exactly like real Chrome, Firefox, or Safari.** TLS is the encryption layer behind https, and the way a client negotiates it leaves a recognisable signature. curl_cffi wraps *curl-impersonate* — a modified build of curl that uses BoringSSL with the exact cipher list, extensions, and HTTP/2 SETTINGS frames a real browser sends. It works as a drop-in replacement for the requests library: the only code change is adding impersonate="chrome131" to your call.
### Quick facts
- **Language:** Python (CFFI bindings to curl-impersonate)
- **Drop-in for:** requests, httpx — same API surface
- **Spoofs:** TLS JA4, HTTP/2 SETTINGS, header order, GREASE values
- **Matches browser handshake against:** Cloudflare TLS checks, DataDome XHR, Akamai, common WAFs
- **Cannot handle:** Cloudflare Turnstile, Akamai sensor.js scoring, JS challenges
### Why curl_cffi exists
Here is the problem it solves. Python's requests library is built on a stack of layers: urllib3 → Python's ssl module → OpenSSL, using Python's default cipher list. That list has barely changed in years, so it produces one predictable TLS signature. Anti-bot services catalogue that signature as a JA3/JA4 hash (a fingerprint computed from the TLS handshake) and recognise it instantly. You can fake every header, cookie, and URL, but the handshake happens first and gives you away.
curl-impersonate fixes this by patching curl to use BoringSSL — the TLS library inside Chrome — with Chrome's exact cipher list, extension order, GREASE values, and HTTP/2 SETTINGS. The resulting JA4 is indistinguishable from real Chrome. curl_cffi is the Python binding to that patched curl, wrapped in a requests-compatible API.
### What changes when you flip the switch
Adding a single impersonate argument changes everything the handshake reveals:
- The TLS cipher suite list and its order
- The TLS extensions (ALPN, SNI, supported_groups, signature_algorithms, and so on) and their order
- GREASE values — Chrome's deliberately randomised dummy extensions
- The HTTP/2 SETTINGS frame contents (HEADER_TABLE_SIZE, INITIAL_WINDOW_SIZE, MAX_CONCURRENT_STREAMS)
- Header order and casing
Modern anti-bots fingerprint every one of these. Forgetting just one — the classic mistake is spoofing the User-Agent string but leaving the TLS untouched — flags you *faster* than not spoofing at all, because the mismatch (a browser User-Agent paired with a Python handshake) is itself a giveaway.
### What curl_cffi is and is not for
**Use it when:** the target is guarded by a TLS-fingerprinting WAF — a web application firewall that filters traffic before it reaches the site (Cloudflare without Turnstile, DataDome XHR endpoints, medium-strength Akamai) — or you are simply tired of requests getting flagged everywhere. Roughly 60–80% of protected targets become reachable with curl_cffi plus a residential proxy.
**Do not use it when:** the page needs JavaScript to run (Cloudflare Turnstile, Akamai sensor.js, F5 Shape custom VM). curl_cffi only sends HTTP requests; it cannot execute JS. For those you need a real browser — Camoufox, CloakBrowser, or a managed API.
### Example
```python
from curl_cffi import requests
# Drop-in replacement for the requests library
r = requests.get(
"https://protected-site.com/api/data",
impersonate="chrome131",
headers={"Accept-Language": "en-US,en;q=0.9"},
proxies={"https": "http://user:pass@residential-isp:port"},
timeout=20,
)
# Sessions preserve cookies and TLS state across requests
s = requests.Session(impersonate="chrome131")
s.get("https://protected-site.com/") # warm up
data = s.get("https://protected-site.com/api").json()
```
### FAQ
**Q: Is curl_cffi the same as requests?**
It has a compatible API but is completely different underneath. requests uses OpenSSL through Python's ssl module; curl_cffi uses BoringSSL through patched curl, so its TLS fingerprint matches a real browser. The function signatures are nearly identical (.get, .post, .Session), so most scrapers can switch with a single import change.
**Q: Does curl_cffi handle HTTP/2 correctly?**
Yes. It sends HTTP/2 SETTINGS frames matching the browser it is impersonating. This matters because the contents of those frames are a second fingerprint, and they catch scrapers who fixed only their TLS.
**Q: How often do I need to update the impersonation profile?**
Every couple of months. Real users keep upgrading, so a Chrome 120 fingerprint in 2026 looks suspicious on its own. Bump to the latest profile curl_cffi ships and re-check your success rate.
**Q: Does curl_cffi solve JavaScript challenges?**
No. It is purely an HTTP client and does not run JavaScript. For JS challenges (Cloudflare Turnstile, Akamai sensor.js, F5 Shape) you need a real browser like Camoufox or a managed scraping API.
---
## What Is Camoufox?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-camoufox
**Camoufox is a fork of Firefox with anti-fingerprinting patches applied at the C++ build level.** That phrase matters: most anti-fingerprinting tools, like playwright-stealth, change the browser by injecting JavaScript while it runs. Anti-bot systems can spot that injected code by inspecting it with Function.toString(). Camoufox instead bakes its changes into the browser binary when it is compiled, so there is nothing in the JavaScript runtime left to inspect. It is also driven by Mozilla's Juggler protocol (the remote-control channel for Firefox), which sits below CDP — the Chrome DevTools Protocol used to automate Chromium. Because of that, none of the CDP timing artifacts that give away Chromium-based stealth tools are present.
### Quick facts
- **Base browser:** Firefox (C++ build-time patches)
- **Control protocol:** Mozilla Juggler — no CDP exposure
- **Reported compatibility:** High on anti-bot-protected sites (major social & community sites — early 2026)
- **Key feature:** geoip=True aligns WebRTC + DNS + timezone + Accept-Language
- **Fingerprint source:** BrowserForge — samples real-world device distributions
- **Memory footprint:** <200 MB per instance (vs Chrome ~800 MB)
- **Positioning:** Debloated, headless-first browser built for AI agents
### Why Camoufox exists
Modern anti-bots have a simple trick: they run Function.prototype.toString.call(fn) on browser functions to read the source code behind them. A genuine built-in function returns [native code]. If it returns anything else, the function has been tampered with — and the tampering itself becomes the bot signal. This breaks playwright-stealth completely: every value it overrides (navigator.webdriver, WebGLRenderingContext.prototype.getParameter, etc.) leaves a visible JavaScript source signature that toString exposes.
Camoufox changes the same things — canvas hash, WebGL renderer, AudioContext output, navigator quirks — but does it in C++ when the browser is built. The functions still return [native code] because they genuinely *are* native code. With no JavaScript injected, there is nothing for runtime inspection to catch.
### The geoip flag
Anti-bots check whether five identity clues all agree: the IP's country, the WebRTC ICE candidate (the network address the browser reveals when setting up peer connections), the DNS resolver's location, the timezone, and the Accept-Language header. If a US proxy is paired with a Pakistani DNS resolver, the request fails this coherence check before the browser fingerprint even gets looked at.
Camoufox's geoip=True automatically lines up all five clues with the proxy's exit country. It looks up the proxy IP, finds its country, and configures the browser to match — no manual setup. This one feature covers the detection layer that scrapers most often forget.
### Trade-offs vs alternatives
**vs playwright-stealth:** Camoufox defends at a layer that JavaScript patches simply cannot reach. playwright-stealth is largely broken against modern anti-bots in 2026.
**vs CloakBrowser (Chromium):** Camoufox is built on Firefox; CloakBrowser is built on Chromium. Chromium has roughly 65% of the browser market versus Firefox's ~3%, so for sites that weigh browser popularity, CloakBrowser blends in better. CloakBrowser also passes a major anti-bot's 60-extension probe by loading real extensions — something Camoufox does not handle as natively.
**vs real Chrome:** a real Chrome on real hardware is the gold standard, but it cannot scale to thousands of parallel sessions. Camoufox is the best headless option that still looks like a real browser.
### What Camoufox patches at the C++ level
Camoufox goes far beyond hiding navigator.webdriver. Because everything is compiled into the binary, each of these returns a native value with no JavaScript source signature to inspect:
- **Navigator** — device, OS, hardware concurrency, platform, vendor, and the full UA string
- **Screen & viewport** — resolution, available dimensions, color depth, device pixel ratio
- **Canvas & WebGL** — canvas hash, WebGL renderer/vendor, supported extensions, context attributes, and shader precision values
- **Audio** — AudioContext output and speech playback-rate parameters
- **Fonts** — the installed-font list, spoofed to match the target OS profile
- **Geolocation & Intl** — latitude/longitude, timezone, locale, and Intl formatting
- **WebRTC** — local and public ICE candidate IPs, spoofed at the protocol level rather than blocked
### Fingerprint injection and rotation
Camoufox does not hand out a single hard-coded fingerprint to every session. It uses **BrowserForge**, which samples how device characteristics are actually distributed in the real world, so every generated profile holds together and looks plausible: the screen size fits the device class, the WebGL renderer fits the platform, the fonts fit the OS. You can let it auto-generate a profile, pin a specific one so a session keeps the same identity, or supply your own through the fingerprint argument. The PyPI package wires up the injection for you, so rotating identity across thousands of sessions is a config change rather than a rebuild.
### Built for AI agents
Camoufox bills itself as an open-source browser built for AI agents, and the build shows it. It is a **debloated Firefox** — telemetry and background services stripped out — that runs headless-first and is tuned for a cleaner DOM, with no CSS-animation or telemetry noise. A cleaner page is cheaper for an LLM (large language model) to read and reason over. Add the sub-200 MB footprint and you can run many parallel agent sessions on modest hardware. The project stays in sync with the latest Firefox releases, and its source has been public since the v146.0.1-beta.25 release in January 2026.
### Example
```python
from camoufox.sync_api import Camoufox
with Camoufox(
headless=True,
geoip=True, # auto-align WebRTC, DNS, timezone, Accept-Language
os=["windows", "macos"], # BrowserForge samples a plausible profile from these
locale="en-US", # locale + Intl spoofing
proxy={
"server": "http://residential-proxy:port",
"username": "user",
"password": "pass",
},
humanize=True, # realistic mouse and scroll cadence
block_webrtc=False, # spoofs ICE IPs instead of blocking (less detectable)
block_images=False, # block_images=True trips some detectors
) as browser:
page = browser.new_page()
page.goto("https://cloudflare-protected.com/")
page.wait_for_load_state("networkidle")
html = page.content()
```
### FAQ
**Q: Is Camoufox open source?**
Yes. Camoufox is open source, built as C++ patches layered on top of Firefox. The project ships pre-built binaries plus a Python API (with both sync and async styles) that works much like Playwright.
**Q: Why Firefox and not Chrome?**
Firefox is automated through Mozilla's Juggler protocol, which sits below CDP (the Chrome DevTools Protocol). CDP itself leaks tell-tale signals — Runtime.enable timing, execution-context artifacts, binding exposure — that give away Chromium-based stealth tools. Firefox plus Juggler sidesteps all of them, simply because those signals do not exist in the protocol.
**Q: Does Camoufox handle CAPTCHAs?**
Not directly — Camoufox does not deal with CAPTCHAs itself. But because Camoufox presents a real, consistent fingerprint, the suspicious context that usually triggers CAPTCHAs is reduced, so they show up less often. If one still appears, you would handle it separately.
**Q: How does this compare to playwright-stealth?**
playwright-stealth patches detection points in JavaScript while the browser runs, which can be spotted via Function.toString(). Camoufox patches the same points in C++ when the browser is built, leaving no runtime signatures. That architectural difference is the whole point: most JS-level stealth is broken in 2026, while build-level patches are the current state of the art.
**Q: Where do Camoufox's fingerprints come from?**
From BrowserForge, which samples real-world device-characteristic distributions to build internally consistent profiles — the screen size, WebGL renderer, fonts, and navigator properties all agree with each other and match real traffic. You can auto-generate a profile, pin one to keep a session's identity stable, or supply your own through the fingerprint argument.
**Q: Can Camoufox rotate fingerprints across many sessions?**
Yes. The PyPI package automatically injects a fresh, internally consistent fingerprint into each browser instance, so rotating identity across thousands of parallel sessions is a config change rather than a rebuild. You can narrow the pool with options like os=["windows", "macos"] and locale to keep profiles plausible for your targets.
**Q: How much memory does Camoufox use?**
Under 200 MB per instance, versus roughly 800 MB for Chrome. It is a debloated, headless-first Firefox with telemetry and background services stripped out, which is exactly what makes it practical to run many parallel agent or scraping sessions on modest hardware.
**Q: Does Camoufox block WebRTC to prevent IP leaks?**
It can, but the default and recommended approach is to spoof the WebRTC ICE candidates at the protocol level so they match the proxy's exit IP. Blocking WebRTC outright is itself a detectable signal; spoofing keeps things looking like a normal browser while still stopping the real local and public IP from leaking.
---
## What Is AI Web Scraping?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-ai-web-scraping
**AI web scraping is an approach that replaces CSS selectors with natural-language prompts, LLM-based extraction, and Markdown-first output.** Normally you tell a scraper exactly where data lives on the page, like .product-price > span.amount. With AI scraping you instead describe what you want in plain English, and an LLM (large language model — the AI that powers tools like ChatGPT) reads the page and pulls it out for you. The category took off in 2024–2025 with Firecrawl (111K GitHub stars) and Crawl4AI (60K stars) leading; the market is forecast to grow from $7.5B in 2025 to $38B by 2034.
### Quick facts
- **Leading tools:** Firecrawl (managed), Crawl4AI (open source), ScrapeGraphAI
- **Output format:** Clean Markdown — ~67% fewer tokens than raw HTML
- **Extraction accuracy:** F1 > 0.95 on structured tasks (NEXT-EVAL benchmark, 2025)
- **Native integrations:** LangChain, LlamaIndex, CrewAI, MCP servers
- **Production pattern:** LLM + Pydantic + Instructor for schema-validated extraction
### Why the shift happened
Three things came together in 2024–2025. First, **LLMs got good enough at structured extraction** — that is, reliably turning messy text into clean fields like price or title. The NEXT-EVAL benchmark showed F1 > 0.95 (F1 is an accuracy score from 0 to 1) when the input is properly formatted. Second, **token costs dropped** — you pay LLMs per token (a token is roughly a word-piece), and Markdown output uses about 67% fewer tokens than raw HTML, which adds up fast across thousands of pages. Third, **MCP (Model Context Protocol) shipped** — a standard way to hand tools to an AI, so Claude, Cursor, and Codex can scrape directly with no code on the LLM side. The result is a workflow where you describe the data once and the pipeline keeps working even when a site is redesigned.
### The leading tools
**Firecrawl** — a hosted service you can also run yourself. You give it a URL and it returns clean Markdown or JSON. Its FIRE-1 agent navigates JavaScript-heavy sites on its own, and an /interact endpoint can click buttons and fill forms. It plugs straight into LangChain and LlamaIndex (popular AI app frameworks). 500 free scrapes/month. Used by SAP, Zapier, Deloitte.
**Crawl4AI** — open source under the Apache 2.0 license, often called the "Scrapy of the LLM era". You run it on your own servers, and it supports Ollama so the AI model runs locally too. Its adaptive crawling learns a site's selectors over time. You keep full control of your data.
**ScrapeGraphAI** — you describe what you want, and an LLM builds and runs a graph-based extraction pipeline (a series of connected steps) to get it. It is self-healing: when a site's structure changes, you just re-describe what you need and it adapts.
### The production pattern
Asking an LLM for data raw is too unreliable for real production use. Ask one for a price across 10,000 articles and you get $40, 40 dollars, "forty", and occasionally numbers it simply made up. The fix is **schema-validated extraction with Pydantic + Instructor**. A schema is just a definition of the exact shape you expect. You define that shape as a Pydantic model (Pydantic is a Python library that checks data against a defined type), then pass it to the LLM through Instructor, which makes the LLM return a typed object instead of free text. Instructor retries when the output does not match and throws away malformed results before they reach your pipeline. So if the LLM puts "competitive" in a salary field, validation fails, the call retries, and you end up with either a real number or None — never garbage.
Sometimes the old-school approach still wins. At large scale on a fixed schema (10M+ documents, e-commerce / classifieds), classical NLP — spaCy NER (named entity recognition: spotting things like names, prices, dates) plus dependency parsing — costs effectively nothing after the model loads and runs in under a millisecond per item. The common production setup is a hybrid: use classical NLP to pre-filter and tag everything, and call the LLM only for the ambiguous cases.
### Example
```python
# Production AI scraping: Firecrawl + Instructor for typed output
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
import instructor, anthropic
class JobPosting(BaseModel):
title: str
company: str
salary_min_usd: int | None = Field(description="Floor of salary range in USD")
location: str
remote: bool
app = FirecrawlApp(api_key="fc-...")
markdown = app.scrape_url("https://example.com/job/123",
params={"formats": ["markdown"]})["markdown"]
client = instructor.from_anthropic(anthropic.Anthropic())
job = client.messages.create(
model="claude-sonnet-4",
response_model=JobPosting,
messages=[{"role": "user", "content": markdown}],
max_retries=3,
)
# job is a validated JobPosting object, not a string
```
### FAQ
**Q: Is AI scraping more accurate than CSS selectors?**
It is a different trade-off, not strictly better. CSS selectors are deterministic (same input, same output) and free, but they break the instant a site is redesigned. LLM extraction survives redesigns because it reads meaning rather than page structure — but it costs money per request and can hallucinate (confidently return wrong answers). Schema-validated LLM extraction (Pydantic + Instructor) catches those hallucinations before they reach your pipeline.
**Q: Does AI scraping interact with anti-bot systems?**
No. AI handles the extraction layer (reading the page), not the access layer (getting the page). You still need a consistent browser configuration, proxies, and the same TLS handling (TLS is the encryption behind https, and sites profile how your client negotiates it) to fetch the page in the first place — for sites you are permitted to access. Firecrawl bundles these into one managed service; self-hosted Crawl4AI lets you bring your own stack.
**Q: What is MCP and why does it matter?**
Model Context Protocol is a standard way to expose tools to LLMs so an AI can call them. Both Firecrawl and Crawl4AI ship MCP servers, so Claude or Cursor can scrape just by making a tool call, with no code to write. For agentic workflows (where the AI decides its own steps) this turns the web into a first-class capability any LLM can use.
**Q: Should I use Firecrawl or Crawl4AI?**
Choose Firecrawl if you want a managed service with the FIRE-1 agent for hard sites and you do not mind your data leaving your own infrastructure. Choose Crawl4AI if you need full data sovereignty (your data never leaves your servers), want to run local LLMs with Ollama, or are cost-sensitive and willing to operate the stack yourself.
---
## What Is Mobile API Scraping?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-mobile-api-scraping
**Mobile API scraping means watching the traffic a vendor's phone app sends to its servers, then making those same requests yourself from Python or any HTTP client.** The trick: the data a mobile app receives often sits behind much weaker protection than the website does — no Cloudflare, no JA4 fingerprinting (a way to identify a client from the shape of its TLS handshake, the encryption layer behind https), often just a simple Bearer token. For data you own or are permitted to access, it is often the most direct starting point on the scraping decision flow.
### Quick facts
- **Why it works:** Mobile apps usually hit a separate backend with weaker anti-bot
- **Toolchain:** Rooted Android emulator + HTTPToolkit (free) + curl_cffi
- **Why it differs:** Web-only anti-bot deployments (Akamai, Cloudflare, DataDome, F5 Shape) often do not cover the mobile backend
- **Constraints:** Apps with SSL pinning or jailbreak-detection limit traffic inspection
- **Decision flow position:** Step 1 — always try first
### The toolchain (free, ~30 minutes to set up)
Everything you need is free. Here is the full setup:
- **Android Studio + AVD.** An AVD is an Android Virtual Device — a phone emulator running on your computer. Create one with API 30+ (Android 11+). Avoid API 28 — the rooting scripts do not support it.
- **rootAVD** (github.com/newbit1/rootAVD). Rooting gives you admin control over the emulator; this is a one-command script that does it. Afterward, confirm Magisk (the root manager) shows up in the app drawer.
- **HTTP Toolkit** (free, httptoolkit.com). This is the tool that records the app's traffic. Open it → Intercept → "Android device via ADB". It auto-detects the running AVD and installs its own trusted certificate so it can read the encrypted traffic.
- **Install the target app** via Google Play on the AVD, or sideload the APK (the Android install file) from apk.support.
- **Use the app while HTTP Toolkit captures.** Filter by the target domain. The requests that actually return data are usually about a dozen out of hundreds — search, listing, detail.
- **Replicate in Python.** Right-click any captured request → copy as cURL → import into Postman → confirm it returns data → port to curl_cffi.
### Step-by-step: intercepting an Android app
Android is the easier platform to intercept because Google publishes it as open source, including emulator images that accept certificates you install yourself. (A certificate is what lets the proxy read encrypted traffic.) The full workflow:
- **Install Android Studio** and create an emulator using an image *without* Google Play. Play images run Play Integrity attestation — a check that the device is untampered — and refuse to launch once you add a custom certificate. The plain AOSP system images (API 30+) work without that restriction.
- **Start mitmproxy** on your computer: mitmproxy --mode regular --listen-port 8080. mitmproxy is a proxy that sits between the app and the server so you can see the traffic. Note the host IP the emulator can reach (usually 10.0.2.2).
- **Point the emulator at the proxy**: Settings → Network → set proxy to 10.0.2.2:8080. Open mitm.it in the emulator browser, download the *Android* cert, and install it via Settings → Security → User certificates.
- **Install the target app**. If it fails here, the cause is almost always certificate pinning — the app refuses to talk to a server whose certificate it doesn't already recognise. See the next section.
- **Use the app normally**. The mitmproxy console shows every request and response, so the endpoint, headers, how requests are signed, and how pages are paged through all become visible right away. Common finds: GraphQL endpoints, signed JWT auth tokens (compact, self-contained login tokens) that expire after an hour, and unprotected list endpoints that only need a couple of mobile-specific headers.
For an app you are permitted to inspect, the result of this exercise is a clear picture of how the mobile backend is structured, even where the company's web stack uses a separate anti-bot product.
### How certificate pinning affects traffic inspection
Roughly half of mainstream apps pin their TLS certificate. Pinning means the app has the expected server certificate's fingerprint baked in and refuses to talk to anything else. So a proxy certificate is ignored, the app shows a network error, and traffic inspection does not work on a pinned app.
**Frida** is a well-known instrumentation tool sometimes used in mobile app testing to observe how pinning checks behave at runtime. On Android, pinning is commonly implemented through okhttp3.CertificatePinner and javax.net.ssl.TrustManagerFactory. Flutter apps put their pinning in the Dart layer rather than Java. iOS apps use a different stack again.
If pinning lives in native code (rare, but it happens in banking apps), inspection is much harder. At that point the effort often exceeds what a managed scraping API would cost, and the decision flow suggests moving back up the ladder. Note that bypassing pinning on apps you do not own or are not authorized to test may violate the app's terms and applicable law.
### What to record before disconnecting the proxy
Once you have a captured session, write down all of this before the session expires — you are documenting the API so you never have to touch the live app again:
- **The endpoint path** and HTTP method.
- **Authentication scheme** — Bearer token, signed request, or OAuth refresh flow. Note the TTL (time-to-live, i.e. how long the token stays valid).
- **Request signing** — many apps sign each request with an HMAC (a checksum keyed by a secret) of the body plus a shared secret. The secret is hidden in the app binary and usually survives across versions.
- **Required headers** — X-App-Version, X-Device-ID, X-Build-Number. They look optional, but the API often returns 403 (forbidden) without them.
- **Pagination model** — how it walks through pages: offset/limit, cursor, or token. Cursor-based pagination from a mobile API is almost always more reliable than offset-based on the web.
- **Rate limit** — fire 20 requests quickly and watch for a 429 (too many requests) or a rate-limit header. Mobile APIs often have looser limits than the web equivalent.
Then write the scraper against this documentation, not against the live app. Rotating X-Device-ID per worker, refreshing the auth token before it expires, and honouring the request-signing scheme is enough for most production cases.
### Why mobile APIs are softer than the web
Three structural reasons:
- **Mobile apps already authenticate.** The app ships with an API key or signs requests with a per-user token, so the backend trusts those requests more than anonymous hits from a browser. More trust means lighter bot defences.
- **Anti-bot vendors target browsers.** Cloudflare, Akamai, and DataDome built their products to catch headless Chrome and Selenium. Traffic from a real device already looks like a real device — there is no equivalent product going after native HTTP clients at scale.
- **JS rendering is irrelevant.** Mobile APIs return JSON, not HTML. With no DOM there are no hidden honeypot fields and no client-side challenge to trip — the entire browser-fingerprinting category simply doesn't apply.
For example, a retailer's mobile app may hit a direct GraphQL endpoint served by a different backend than the web frontend, which is why the mobile and web paths can carry different anti-bot configurations even when they return the same data.
### When mobile API scraping does not work
**SSL pinning.** Some apps lock onto their own SSL certificate and refuse to talk to an inspection proxy's certificate. Banking apps and high-value retailers commonly pin, and for those apps traffic inspection is generally not possible without authorization from the app owner.
**Jailbreak detection.** Some apps crash on rooted devices. SafetyNet Attestation — Google's check that a device isn't tampered with — is the usual mechanism; Magisk Hide / DenyList can usually work around it.
**ARM-only apps.** The default AVD runs on x86 chips. Some apps refuse to run on x86 emulators. Either use an arm64 emulator (slower) or a physical device with frida-server installed.
**Tokens expire.** Most apps issue fresh tokens at login. Build a token-refresh step into your scraper rather than relying on a single captured token.
### Example
```python
# After capturing the mobile API request in HTTPToolkit:
from curl_cffi import requests
resp = requests.get(
"https://api.target.com/v2/listings",
headers={
# All copied directly from the HTTPToolkit capture
"Authorization": "Bearer <token_from_capture>",
"X-App-Version": "4.2.1",
"User-Agent": "TargetApp/4.2.1 (Android 11; SDK 30)",
"Accept": "application/json",
},
impersonate="chrome131", # most mobile APIs are TLS-permissive
timeout=30,
)
data = resp.json()
# Often the same dataset the web frontend exposes via a JS-heavy SPA
# is returned here in clean JSON via the separate mobile backend.
```
### FAQ
**Q: Is mobile API scraping legal?**
The same legal rules apply as for any scraping — whether the data is public or private matters far more than which channel you used to get it. Scraping a public e-commerce catalogue through the mobile API is generally the same legal posture as scraping it through the web. Bypassing authentication or scraping logged-in user data is a different question and likely violates the Terms of Service at minimum.
**Q: Do I need a physical device?**
No — Android Studio's emulator is enough for most apps. You only need a physical phone for ARM-only apps or apps with very aggressive emulator detection.
**Q: What is SSL pinning?**
When an app pins its SSL certificate, it bakes the expected server certificate (or its hash) into the app and refuses any connection that presents a different one. That prevents traffic-inspection tools like HTTP Toolkit from reading the app's encrypted traffic, because they present their own certificate. Working around pinning is only appropriate on apps you own or are authorized to test, and may otherwise violate the app's terms and applicable law.
**Q: Can I scrape mobile APIs at scale without ever running the emulator?**
Yes — once you have captured the request format, the emulator is only needed for refreshing tokens and catching protocol changes. The actual scraping runs from curl_cffi (or any HTTP client) against the captured endpoints, scaled out across residential proxies as needed.
**Q: Why does Android without Google Play work but with Google Play does not?**
Apps installed from the Play Store can call the Play Integrity API, which vouches that the device isn't rooted and the app hasn't been tampered with. Installing your own certificate trips a Play Integrity failure on Google Play images. AOSP images without Google Play services skip that check entirely, so the app behaves as if everything is normal.
**Q: Is intercepting a mobile app legally distinct from scraping the web version?**
It depends on the jurisdiction and the app's Terms of Service. The intercept itself is local — you are reading traffic from a device you own. Reusing the resulting API is governed by the same ToS / CFAA / DMCA framework as web scraping, plus whatever app store agreements bind the operator. The technical novelty is on the intercept side, not in the legal exposure.
---
## What Is CloakBrowser?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-cloakbrowser
**CloakBrowser is a Chromium build with 49 C++ binary patches that give it a consistent browser configuration.** The goal is for it to present like an ordinary browser. Most anti-fingerprinting tools, like playwright-stealth, inject JavaScript at runtime to change browser values — but that injection is detectable, because a site can ask a function to show its own source code with Function.toString() and notice the difference. CloakBrowser takes a different route: it edits the Chromium source code itself and recompiles it. The patched browser features return [native code] when inspected because they genuinely *are* native code — there is no injected JavaScript to catch. It is tuned for Chromium-only sites and for Akamai's 60-extension probe, and it reports a reCAPTCHA v3 score of 0.9 (a score close to 1.0 means the site is very confident you are human).
### Quick facts
- **Base browser:** Chromium (vs Camoufox's Firefox)
- **Patches:** 49 C++ binary patches — Canvas, WebGL, AudioContext, Battery, CDP input
- **Distinct feature:** Loads real extensions (uBlock, 1Password) to pass Akamai's 60-extension probe
- **reCAPTCHA v3 score:** ~0.9 (consistent, low-risk score)
- **Memory footprint:** ~200+ MB per browser instance
### How CloakBrowser is different from playwright-stealth
Here is the key fact every stealth scraper has to understand in 2026: anti-bot scripts run Function.prototype.toString.call(fn) on the browser functions a stealth tool usually overrides. That call returns a function's source code. If the result is anything other than [native code], the site knows the function was tampered with — so the patch *itself* becomes the giveaway. This is why playwright-stealth fails against Kasada, recent Akamai versions, and DataDome: each JavaScript override leaves a visible source signature.
CloakBrowser changes the same things (canvas hash, WebGL renderer, AudioContext, navigator quirks, Battery API, CDP input handling) but does it in the C++ code before the browser is even built. The functions still return [native code] because they are still native code. Nothing was injected at runtime, so toString() finds nothing to flag.
### The 60-extension probe and real extensions
Akamai's sensor.js tries to load 60 known chrome-extension://[id]/manifest.json URLs — in effect, checking which browser extensions you have installed. Real Chrome users almost always have a few (uBlock Origin, LastPass, Bitwarden, 1Password), so at least some of these requests should succeed. A headless browser (a browser running with no visible window, typical of bots) has none, so all 60 requests fail at once with net::ERR_FAILED — a result that is statistically impossible for a real user.
CloakBrowser ships with profiles that have real extensions installed, so some probes return real manifest data and the overall response pattern looks like a genuine Chrome user. This is the one Chromium-side feature that Camoufox cannot match as naturally, because Firefox has no chrome-extension protocol at all.
### When to choose CloakBrowser vs Camoufox
**Pick CloakBrowser for:** Chromium-only sites; Akamai targets where sensor.js actively scores you and the extension probe matters; and sites that weight browser market share, since about 65% of users run a Chromium-family browser. The trade-off is higher memory use.
**Pick Camoufox for:** Cloudflare (reported high compatibility in Mar 2026), and sites where CDP detection is the main barrier — Camoufox is built on Firefox and drives the browser with Mozilla's Juggler protocol instead of CDP (the Chrome DevTools Protocol that automation tools use and anti-bots watch for). It also uses less memory.
**For both:** use real residential or ISP IP addresses; align your timezone and locale to the IP with geoip-style settings; run with a consistent, rate-limited input mode (humanize=True or equivalent); and never run more than about 10 instances on one machine.
### Is the closed-source binary safe? (security analysis)
CloakBrowser's control library is open source, but the part that actually does the stealth work is a **pre-built, closed-source Chromium binary** you download and run with full local privileges. That is a real supply-chain concern (the risk that software you install does something hidden): a patched browser binary could in principle read your .ssh keys, harvest environment secrets, or phone home, and because you cannot see the C++ source you cannot rule it out.
An independent behavioural audit — github.com/pim97/cloakbrowser-analyze — ran nine runtime tests against the binary (watching what it actually does while running) and reported **no malicious behaviour observed**: 2.9M extracted strings contained no suspicious URLs, hardcoded credentials, or exfiltration keywords; packet capture showed only the expected PyPI/GitHub traffic from the wrapper; the process never touched .ssh, .aws, or planted decoy secrets; and every process it spawned was a standard Chromium component.
The same write-up is clear about the limit of that evidence: *passing behavioural tests is not the same as being provably safe.* A closed binary can still hide behaviour that only triggers after a delay or under specific conditions, which runtime observation will not catch. If your threat model cannot tolerate an unauditable binary — corporate machines, or anything near credentials or production infrastructure — prefer a fully open-source stack like Camoufox or PatchRight, or run CloakBrowser inside a disposable container with no access to host secrets.
### Example
```python
# CloakBrowser is configured similarly to Playwright but with C++ patches baked in.
# Note: real extension profiles are configured at install time, not at runtime.
from cloakbrowser import CloakBrowser
with CloakBrowser(
profile="default_with_ublock_1password", # real extensions for Akamai probe
proxy={
"server": "http://residential:port",
"username": "user",
"password": "pass",
},
headless=True,
humanize=True,
) as browser:
page = browser.new_page()
page.goto("https://akamai-protected.com/")
page.wait_for_load_state("networkidle")
# _abck flips to ~0~ after sensor.js POST — accumulated trust visible
listings = page.eval_on_selector_all(".listing", "els => els.map(el => el.innerText)")
```
### FAQ
**Q: Why does CloakBrowser need 49 patches?**
Each patch hides one specific way a browser can be fingerprinted (identified) — Canvas pixel noise injection, WebGL renderer string spoofing, AudioContext output randomisation, navigator quirks, Battery API, CDP input timing. Together they cover the checks in the modern Akamai sensor.js script. New patches are added as new detection methods are documented.
**Q: Can I install my own extensions in CloakBrowser?**
Yes — the build accepts standard Chrome extension packages. The default profiles come with uBlock Origin and 1Password installed, because those are among the most common real-user extensions and they pass Akamai's probe well.
**Q: How is CloakBrowser distributed?**
It comes as a pre-built Chromium binary plus a Python control library. You do not need to build it from source unless you want to add your own custom patches — the standard binary covers most stealth use cases.
**Q: Will my fingerprint be unique enough?**
Yes. CloakBrowser adds per-profile random noise to canvas and audio output, so two CloakBrowser instances produce different fingerprints (the values sites use to recognise a browser). Combine that with different proxy IPs and lightly randomized navigator properties, and you can run many sessions in parallel without them looking like the same client.
**Q: Is CloakBrowser safe to run? It's a closed-source binary.**
The wrapper library is open source, but the stealth comes from a pre-built, closed-source Chromium binary that runs with full local privileges — a genuine supply-chain risk (the chance that downloaded software does something hidden). An independent behavioural audit at github.com/pim97/cloakbrowser-analyze ran nine runtime tests (string analysis over 2.9M strings, network capture, file-system and process monitoring, planted decoy secrets) and found no malicious behaviour. That is reassuring but not proof: a closed binary can hide triggers that only fire after a delay or under certain conditions, which runtime observation misses. Read 'no malicious behaviour observed' as exactly that, not 'provably safe'.
**Q: How do I reduce the risk of running the binary?**
Run it in a disposable container or virtual machine that has no access to your real secrets — no mounted SSH or AWS credentials, no real environment variables, a network allowlist limiting where it can connect, and a throwaway proxy account. That way, if the binary ever does something unexpected, the damage is contained. If you cannot accept running an unauditable binary at all, use a fully open-source alternative like Camoufox or PatchRight, where you can read every line.
**Q: Is CloakBrowser open source like Camoufox?**
Only partly. The control/wrapper layer is open source, but the patched Chromium binary itself ships pre-built and closed — you cannot inspect or rebuild the C++ patches from source. Camoufox, by contrast, is fully open source: its Firefox patch set is public and you can compile it yourself. If full transparency is a hard requirement, that difference is the deciding factor.
---
## What Is PatchRight?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-patchright
**PatchRight is a browser-automation library that edits Playwright's own Python code before Chrome launches, instead of injecting JavaScript into the page after it loads.** Why does that matter? Anti-bot systems like Kasada can ask the browser to show the source of its built-in functions using Function.prototype.toString() — and if they see hand-written JavaScript there, they can tell the function was modified at runtime. PatchRight leaves nothing in the JavaScript to read, because its changes happen at the Python-to-Chrome bridge, below any code the page can inspect. This makes it a common choice for browser automation on JavaScript-heavy sites you are permitted to access, where runtime-injection tools like playwright-stealth leave a detectable signature.
### Quick facts
- **Patch level:** Python source (Playwright bindings), pre-Chrome
- **Why it works:** No JS runtime modifications — Function.toString() finds nothing
- **Main use case:** Sites that inspect function source (where runtime-patch signatures are visible)
- **Drop-in?:** Mostly — playwright API compatible with minor adjustments
- **Install:** pip install patchright
### The architectural difference
**playwright-stealth** hides a bot by injecting JavaScript into the page (via Page.addInitScript) that rewrites built-in browser properties — navigator.webdriver, WebGLRenderingContext.prototype.getParameter, HTMLCanvasElement.toDataURL, and dozens more. The trick works, until an anti-bot script asks to see the source of one of those rewritten functions: Function.prototype.toString.call(navigator.__lookupGetter__("webdriver")). A genuine browser function answers [native code]; a patched one shows the injected JavaScript instead. At that point the patch itself becomes the giveaway.
**PatchRight** takes a different route. It edits Playwright's Python source so the underlying CDP commands (CDP - the Chrome DevTools Protocol, the channel Playwright uses to steer Chrome) are sent in a way that achieves the same result without ever touching JavaScript. The browser's built-in functions are left exactly as they were, so Function.toString() still returns [native code] — there was no JavaScript patch to expose.
### Where PatchRight fits
**JavaScript-heavy sites that inspect function source.** Because PatchRight avoids runtime JavaScript injection, it does not leave a Function.toString() signature the way runtime patches do. This is the main reason it exists.
**Cloudflare with active Turnstile/JS challenges.** On sites you are authorized to automate, a consistent browser configuration matters, and PatchRight's source-level approach is one way to keep the function-source surface consistent.
**Any case where you have confirmed playwright-stealth leaves a detectable runtime signature.** Switching is easy — PatchRight is a near drop-in replacement with very few API changes.
It is not the right tool for everything. It addresses only the function-source surface; it does not change the canvas/WebGL/audio fingerprint layer. For those, CloakBrowser (Chromium) or Camoufox (Firefox) work at the canvas/WebGL/audio layer instead.
### What PatchRight does not address
PatchRight covers the toString-detectable surface in Playwright Python. It does not handle:
- **Canvas / WebGL / AudioContext fingerprinting** — these read your GPU and hardware. Use CloakBrowser or Camoufox.
- **TLS fingerprinting** — TLS is the encryption layer behind https, and it sits below the browser. Whatever TLS signature your Chromium ships with is what sites see.
- **Behavioural detection** — input timing and interaction patterns. This is handled by other layers such as Botasaurus.
- **IP reputation** — that is the proxy's job. Use residential or ISP static.
Think of PatchRight as one layer in a stack, addressing only the function-source surface.
### Example
```python
# PatchRight is a near-drop-in replacement for Playwright Python
# Note the import path change: patchright.sync_api instead of playwright.sync_api
from patchright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context(
proxy={"server": "http://user:pass@residential:port"},
# ... otherwise identical to Playwright
)
page = context.new_page()
page.goto("https://kasada-protected.com/")
# PoW solved naturally by real browser execution
html = page.content()
```
### FAQ
**Q: Is PatchRight a fork of Playwright?**
Not really a fork — it is better described as a patched build of Playwright. It uses the same upstream codebase, with targeted edits to the Python source in the spots where Playwright would otherwise leak a toString-detectable signature. Updates from the official Playwright team still flow through.
**Q: How is this different from undetected-chromedriver?**
undetected-chromedriver does one focused thing: it removes the navigator.webdriver flag from Selenium-driven Chrome, which is a small patch surface. PatchRight applies the same idea across the much larger Playwright API, where Function.toString() checks are more common. Both have their place; PatchRight covers a wider surface.
**Q: Does PatchRight handle CAPTCHAs?**
No — it only deals with the function-source layer. CAPTCHAs are handled separately, for example by a solver service or by a real-browser setup that scores well on reCAPTCHA v3.
**Q: Should I use PatchRight or Camoufox?**
Use PatchRight when you specifically need Chromium and the obstacle is Kasada or playwright-stealth detection. Use Camoufox when you need Firefox (for Cloudflare, or multi-vector coherence with geoip=True) or when the obstacle is canvas/WebGL fingerprinting. They are not mutually exclusive — large scraping operations often run both.
---
## What Is Firecrawl?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-firecrawl
**Firecrawl is a web-scraping API built for AI: you hand it a URL and it hands back clean Markdown or JSON — no CSS selectors, no XPath, no HTML parsing on your end.** It also ships an MCP (Model Context Protocol — a standard way for AI tools to call external services) server, so assistants like Claude, Cursor, and Codex can scrape the web through plain-language requests with zero code. Its FIRE-1 agent navigates JavaScript-heavy sites on its own, and an /interact endpoint clicks buttons and fills forms. The project has 111K+ GitHub stars and is used in production by SAP, Zapier, and Deloitte.
### Quick facts
- **Output formats:** Markdown (default), HTML, JSON, screenshots
- **Native integrations:** LangChain, LlamaIndex, CrewAI, MCP servers
- **FIRE-1 agent:** Autonomously navigates multi-page workflows (login, search, paginate)
- **Free tier:** 500 scrapes/month
- **Self-hostable:** Yes — MIT-licensed open-source version
### Three core endpoints
**/scrape** — send a URL, get structured content back. By default it returns Markdown, which uses about 67% fewer tokens than raw HTML — a big saving that adds up across a RAG pipeline (RAG = feeding scraped text to an LLM so it can answer from real sources). You can also get JSON by supplying a schema describing the fields you want. It handles JavaScript-heavy sites because a real browser runs behind it.
**/crawl** — give it a starting URL and it returns every page on the site, with limits on how deep it goes and glob patterns (wildcard URL filters like /docs/*) to narrow the set. Handy for ingesting documentation or building a knowledge base.
**/search** — runs a web search and fetches the page content in a single call. Built for AI agents that need to ground answers in up-to-date information without stitching together several APIs.
### The MCP server
This is Firecrawl's most-used feature in 2026. The MCP server is something you point Claude Code, Cursor, or any MCP client at. The LLM then gets ready-made tools — firecrawl.scrape(url), firecrawl.search(query), firecrawl.crawl(url) — that it can call in plain language. The key shift: your code does not call Firecrawl; the LLM does, on its own, when the user says "scrape this page" or "find me current pricing for X".
For agent-style workflows, this effectively gives any LLM the ability to use the web as a built-in skill. Combine it with Pydantic + Instructor (Python tools that force the model's output to match a defined schema) and you get a production-grade extraction pipeline in just a few lines.
### Trade-offs vs alternatives
**vs Crawl4AI (open source):** Firecrawl is a managed service; Crawl4AI is something you run and maintain yourself. Firecrawl deals with anti-bot defenses and proxies for you; with Crawl4AI you set all of that up. Pick Firecrawl for speed, Crawl4AI when you need full control over your own data.
**vs Scrappey or Bright Data:** Firecrawl has an opinion about output — it returns Markdown tuned for LLMs. Scrappey returns raw HTML and lets you parse it however you like. For RAG pipelines, Firecrawl saves you the HTML-to-Markdown step. For traditional scraping (pulling specific fields with selectors), Scrappey is more flexible.
**vs ScrapeGraphAI:** Firecrawl gives you the building blocks; ScrapeGraphAI builds an extraction pipeline for you from a plain-language prompt. They sit at different levels of abstraction.
### Example
```python
from firecrawl import FirecrawlApp
from pydantic import BaseModel
import instructor, anthropic
class Product(BaseModel):
name: str
price_usd: float
in_stock: bool
# Step 1: scrape the page into clean Markdown
app = FirecrawlApp(api_key="fc-...")
md = app.scrape_url(
"https://store.example.com/product/123",
params={"formats": ["markdown"]},
)["markdown"]
# Step 2: extract typed data with Instructor + Claude
client = instructor.from_anthropic(anthropic.Anthropic())
product = client.messages.create(
model="claude-sonnet-4-6",
response_model=Product,
messages=[{"role": "user", "content": md}],
max_retries=3,
)
print(product) # typed object, validated, ready for database
```
### FAQ
**Q: How does Firecrawl work with anti-bot systems?**
Firecrawl runs its own fleet of managed browsers. A call to /scrape on a URL fronted by Cloudflare or Akamai (services that filter automated traffic) returns clean Markdown for sites you are permitted to access. You do not have to manage proxies or browser configuration yourself.
**Q: What is FIRE-1?**
FIRE-1 is the Firecrawl agent that works through multi-step tasks on its own: logging in, searching, paging through results, clicking "load more". When the data you want only shows up after some interaction, FIRE-1 does that interaction for you. You describe the goal; the agent figures out the clicks.
**Q: Can I self-host Firecrawl?**
Yes. The project is open-source under the MIT license, so you can run it on your own servers. The self-hosted version leaves out a few of the managed-cloud features but is fully capable for the core scraping work. Use the hosted version if you want the FIRE-1 agent and the managed anti-bot fleet; self-host when you need to keep data fully in-house.
**Q: What is the cost?**
The free tier covers 500 scrapes per month. Paid plans start at $19/month for higher volumes. If you are running a RAG pipeline or AI agents at large scale, compare the per-page cost against managed alternatives — Firecrawl is competitive, but not always the cheapest at very high volumes.
---
## What Is Schema-Validated LLM Extraction?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-pydantic-extraction
**Schema-validated LLM extraction is the standard production pattern for AI scraping: you describe the data you want as a Pydantic schema (a Python class that defines field names and types), hand it to the LLM through the Instructor library, and get back a checked, typed Python object instead of a loose string.** Large language models (LLMs) are great at reading messy HTML but unreliable at returning data in a consistent shape. The schema check catches made-up values, normalises currencies and units, rejects malformed output, and retries automatically — so you do not have to write that retry code yourself.
### Quick facts
- **Stack:** Pydantic (schemas) + Instructor (LLM-output validation) + any LLM
- **Providers supported:** Anthropic, OpenAI, Mistral, Cohere, Gemini, Ollama (local)
- **What it catches:** Type mismatches, hallucinated dates, malformed JSON, missing required fields
- **Failure mode handled:** LLM returns "competitive" for an int field → Instructor retries automatically
- **Cost overhead:** ~0–2× base LLM cost depending on retry rate
### Why raw LLM extraction fails in production
Ask GPT-4 or Claude for a salary across 10,000 job-board pages and the same field comes back in many shapes: $40,000, 40 dollars, 40k USD, "forty thousand", and sometimes null or a number it simply made up. A database cannot store that mess. Add a date field and you hit a worse problem: when a scraped article has no publication date, the LLM may invent one that fits the article's tone. That fabrication then flows into your pipeline as if it were a real fact.
The fix is not "better prompting" — it is structural. Keep the two jobs separate: semantic understanding (reading the page, which the LLM is good at) and structural guarantees (enforcing the exact shape, which schema validation is good at).
### How Instructor adds value over raw API
Anthropic and OpenAI both let you request structured output directly using a JSON schema (a description of the expected fields and types). Instructor wraps those built-in features and adds three things that matter in production:
- **Automatic retries on validation failure.** If the LLM returns a string where you asked for an int, Instructor re-asks the model — sending the validation error back as a hint — until it gets valid output or hits max_retries. You do not write retry logic.
- **Multi-provider abstraction.** The same Pydantic schema works with OpenAI, Anthropic, Mistral, Cohere, Gemini, and local Ollama. Switch providers without rewriting your extraction code.
- **Streaming + partial validation.** For large schemas, Instructor streams partial Pydantic objects as the LLM produces them — handy for low-latency UIs that show results as they arrive.
### When classical NLP still wins
LLM extraction costs roughly $0.001–$0.01 per article on modern Claude or GPT-class models. Classical NLP — older, rule- and statistics-based text tools like spaCy NER (named-entity recognition) and dependency parsing (working out grammatical structure) — costs effectively zero once the model is loaded. Use classical NLP when you are scraping millions of consistent documents with a fixed schema, your latency budget is under 5 ms, or cost matters more than handling edge cases. Use LLM + Instructor when sources vary, when meaning depends on context ("Apple" the company vs. the fruit), when the schema may change, or when you need to resolve equivalent phrasings ("FTE" = "full-time" = "permanent" = "direct hire").
The pattern Bloomberg, Reuters Refinitiv, and FactSet actually use is a hybrid: cheap classical NLP as a fast pre-filter that tags 95% of documents, with the LLM reserved for the ambiguous 5%. On a million-document corpus that hybrid is the difference between $50 and $5,000 in extraction cost.
### Example
```python
# pip install instructor pydantic anthropic
from pydantic import BaseModel, Field
import instructor, anthropic
class JobPosting(BaseModel):
title: str
company: str
salary_min_usd: int | None = Field(
description="Floor of salary range in USD. Convert from other currencies if needed."
)
salary_max_usd: int | None
years_experience_min: int
location: str
remote: bool
client = instructor.from_anthropic(anthropic.Anthropic())
result = client.messages.create(
model="claude-sonnet-4-6",
response_model=JobPosting,
messages=[{"role": "user", "content": f"Extract from:\n\n{scraped_html}"}],
max_retries=3,
)
# result is a validated JobPosting object — Instructor caught any type
# mismatches and retried until output was valid.
print(result.salary_min_usd, type(result.salary_min_usd)) # 95000 <class 'int'>
```
### FAQ
**Q: Is Instructor a wrapper around the LLM API?**
Yes. It patches the official SDK (Anthropic, OpenAI, and others) so you can pass a response_model argument that points at a Pydantic class. The library then handles building the prompt, injecting the JSON schema, parsing the response, validating it, and retrying. Your code still looks almost exactly like a direct SDK call.
**Q: What if the LLM keeps failing validation?**
Instructor retries up to max_retries (default 3, configurable). After that it raises a ValidationError, so you can fall back to manual handling, queue the item for human review, or skip the record. In practice, a retry rate above about 2% usually means your schema is too strict for the source data or your prompt is unclear.
**Q: Does this work with local LLMs?**
Yes. Instructor supports Ollama and other tools that run models on your own hardware. Smaller local models fail validation more often than Claude or GPT-4, so the extra cost of those retries can cancel out the savings from not paying for API calls. Benchmark it for your specific use case.
**Q: Why not just use the OpenAI structured outputs feature directly?**
You can — but Instructor layers the automatic retries, multi-provider support, and streaming on top. For one-off extractions the direct API is fine; for production pipelines the retry logic and the freedom to swap providers pay off quickly.
---
## What Is Botasaurus?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-botasaurus
**Botasaurus is a free, open-source (MIT-licensed) Python framework for building web scrapers. You wrap your scraping functions with one of three decorators — @browser, @request, @task — and it can generate curved, human-like mouse movements (Bezier curves) instead of the mechanical straight lines default automation produces.** It is maintained by Omkar Cloud, which lists compatibility with sites protected by Cloudflare WAF, BrowserScan, Fingerprint, DataDome, and Cloudflare Turnstile. Its mouse-physics module (Humancursor, contributed by Flori Batusha and Ambri) draws realistic-looking cursor paths with slightly randomised speed, instead of the dead-straight lines that Playwright produces by default. As with any automation framework, it should only be pointed at sites and data you own, control, or are permitted to access.
### Quick facts
- **License:** MIT
- **Maintainer:** Omkar Cloud (github.com/omkarcloud)
- **GitHub stars:** 4.7k+ (May 2026)
- **Decorators:** @browser, @request, @task
- **Listed compatibility:** Cloudflare WAF, Cloudflare Turnstile, DataDome, BrowserScan, Fingerprint Bot Detection
### The three decorators
A decorator is just a tag you put above a Python function to change how it runs. Botasaurus gives you three, each matching a different way of fetching data:
- **@browser** — runs your function inside a real automated Chromium browser, with consistent browser configuration and human-like behaviour already applied. Use it when the page needs a full browser to load.
- **@request** — runs your function as a plain HTTP request (faster, no browser) with browser-consistent headers. Use it when the page works without JavaScript.
- **@task** — a general wrapper for work that is not scraping at all (parsing, API calls, machine-learning steps), so that work can still use Botasaurus's parallelism and caching.
You can mix all three in one project: @request grabs a listing page from an XHR endpoint, @browser opens detail pages that need JavaScript, and @task runs the LLM extraction step.
### How the mouse physics work
Behavioural anti-bot systems such as DataDome observe more than 35 signals about interaction — the path a mouse takes, scroll speed, typing rhythm, click location — and score them in real time. Default automation that moves the cursor with page.mouse.move(x, y) produces a perfectly straight path, a pattern unlike anything a human hand creates. This is one of the simplest behavioural signals such systems use.
Botasaurus ships **Humancursor**, a separate library by Flori Batusha and Ambri, that draws Bezier-curve paths (smooth curves rather than straight lines) with randomised speed and Fitts's Law deceleration — meaning the cursor slows as it nears the target, slightly overshoots, then corrects, just like a person. The page.click_human(selector) helper exposes this through a single call. The library exists because realistic input simulation is a recurring requirement in legitimate UI testing and automation, not only in scraping.
### What Botasaurus does not replace
Botasaurus is a framework, not a complete stack on its own. A production setup typically also involves:
- **Residential or mobile proxies** for a stable IP — datacenter IPs are often blocked at the network layer regardless of behaviour.
- **A patched browser binary (CloakBrowser / Camoufox)** on sites that inspect things like Function.toString() output or run Akamai's 60-extension probe. Botasaurus operates at the JavaScript layer on top of a stock Chromium.
- **curl_cffi or tls-client** for HTTP-only work where a browser-consistent TLS handshake matters. Botasaurus's @request uses browser-like defaults but does not reproduce a curl_cffi-grade JA4 TLS fingerprint.
In a production setup, Botasaurus is the orchestration and behaviour layer; you pair it with the right network and browser-binary layers underneath for content you are permitted to access.
### Example
```python
# Botasaurus with humanized mouse movement (use only on sites you are
# permitted to access). Illustrates the Humancursor behaviour layer.
from botasaurus.browser import browser, Driver
@browser(
proxy="http://user:pass@residential:port",
humanize=True, # enables Humancursor Bezier-curve movements
)
def scrape(driver: Driver, data):
driver.get("https://your-authorized-site.example.com/")
driver.click_human(".product-link") # Bezier path, not teleport
driver.scroll_human(amount=600) # variable-velocity scroll
return driver.page_html
result = scrape("ignored")
```
### FAQ
**Q: Is Botasaurus a replacement for Playwright?**
No — it builds on top of Chromium automation (using primitives similar to Playwright) and adds the decorator API, the humanization layer, parallel execution, caching, and the ability to package scrapers as desktop apps or web UIs. For the lowest-level browser control you still talk to Chromium APIs directly; Botasaurus is the convenience layer above that.
**Q: Does Botasaurus work with Cloudflare Turnstile?**
The project lists compatibility with sites using Cloudflare Turnstile, plus Cloudflare WAF, DataDome, BrowserScan, and Fingerprint Bot Detection. How well it works in practice depends on the specific site, your proxy quality, and how recently the site updated its configuration. Always test on a target you are authorized to access before scaling up.
**Q: How is Botasaurus different from undetected-chromedriver?**
undetected-chromedriver does one narrow thing: it adjusts the navigator.webdriver flag that distinguishes Selenium-driven Chrome. Botasaurus is a full framework — decorators, humanization, caching, parallelism, packaging — built on the idea that browser automation involves many surfaces beyond that single flag. The two solve problems of very different size.
**Q: Can I run Botasaurus headless?**
Yes — the @browser decorator accepts a headless=True argument. Be aware that headless Chromium renders its canvas in software (SwiftShader, device ID 0x0000C0DE), which Akamai blocklists. For Akamai-protected targets, run Botasaurus with headless=False inside Xvfb (a virtual display), or switch the browser layer to CloakBrowser.
---
## What Is Crawl4AI?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-crawl4ai
**Crawl4AI is the most-starred open-source LLM-friendly web crawler on GitHub — 66.3k stars under Apache 2.0 license, maintained by UncleCode.** A web crawler is a program that visits pages and pulls out their content; "LLM-friendly" means the output is shaped for feeding into AI models. By default it returns clean Markdown instead of raw HTML, which uses fewer tokens (the chunks of text an LLM bills and reasons over). It can call any LLM through LiteLLM — a library that talks to many providers behind one interface — including Ollama for running a model on your own machine. It also ships adaptive crawling, which uses information-foraging algorithms to judge when it has gathered enough and stop. The current 0.8.x line adds anti-bot detection with proxy escalation and Shadow DOM flattening.
### Quick facts
- **License:** Apache 2.0 (not MIT — common misconception)
- **GitHub stars:** 66.3k (May 2026)
- **Maintainer:** UncleCode (github.com/unclecode)
- **Python:** 3.10+ — installs Playwright as a dependency
- **LLM support:** Any provider via LiteLLM — OpenAI, Anthropic, Gemini, Ollama (local)
### What it gives you out of the box
- **URL → Markdown.** Give it a URL and the default output is clean Markdown that keeps the page's structure. It first strips navigation, ads, and boilerplate, so you ingest the substance, not the page chrome around it.
- **Adaptive crawling.** A built-in rule decides when it has enough information to answer a query and stops, instead of mechanically visiting every link down to a fixed depth (a full breadth-first search, or BFS, to depth-N). Handy for RAG ingestion — loading content into an AI knowledge base — where "good enough" beats exhaustive.
- **Schema-based extraction.** You describe the data you want, either with a CSS schema or a Pydantic class (a Python way to define a data shape) plus a prompt and an LLM provider, and Crawl4AI makes the call for you. Works with OpenAI, Anthropic, Gemini, and any local model reachable via LiteLLM (Ollama, vLLM).
- **Browser primitives.** It is built on Playwright, the browser-automation engine, so you get JavaScript execution, session management, custom JS injection, lazy-load handling, and proxy support. Full-page scanning reproduces scrolling to trigger content that only loads as you go down the page.
- **Anti-bot detection (0.8.x).** Recent versions add proxy escalation — switching to a fresh proxy when a site flags the crawler as a bot — plus Shadow DOM flattening, which reads components built with isolated DOM trees that other crawlers cannot see into.
### Crawl4AI vs Firecrawl
The two are often compared head-to-head; they solve overlapping problems in different ways. **Crawl4AI** is self-hosted by default — you run the Python library or Docker image yourself, so your data never leaves your infrastructure, you bring your own LLM (including a local Ollama model), and there is no per-scrape charge. **Firecrawl** is managed-cloud-first — you call an API and Firecrawl runs the browser fleet, the anti-bot handling, and its FIRE-1 agent for hard sites, charging per scrape after a 500/month free tier.
Choose Crawl4AI when you need to keep data in-house, watch costs, or run local-LLM pipelines. Choose Firecrawl for the fastest time-to-results when the target has real anti-bot defenses and you do not want to maintain proxies yourself.
### Where Crawl4AI does not help
Crawl4AI is a crawler and extraction framework, not a full anti-bot solution. On heavily protected targets — Akamai sensor.js, F5 Shape, top-tier DataDome — its Playwright-based browser hits the same fingerprinting walls as any other CDP-driven automation (CDP is the Chrome DevTools Protocol that tools like Playwright use to control the browser, and which defenses can spot). The 0.8.x proxy escalation reduces this problem but does not remove it. For those targets you either swap in CloakBrowser or PatchRight as the browser layer (Crawl4AI is pluggable, so the browser can be replaced), or route those URLs to a managed API.
### Example
```python
# Crawl4AI with local Ollama — extraction never leaves your machine.
# pip install crawl4ai && crawl4ai-setup
import asyncio
from crawl4ai import AsyncWebCrawler, LLMExtractionStrategy
from pydantic import BaseModel
class Product(BaseModel):
name: str
price_usd: float
in_stock: bool
async def main():
strategy = LLMExtractionStrategy(
provider="ollama/llama3.3", # local — no token, no API call
schema=Product.model_json_schema(),
instruction="Extract the product details from this page.",
)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://store.example.com/product/123",
extraction_strategy=strategy,
)
print(result.extracted_content)
asyncio.run(main())
```
### FAQ
**Q: Is Crawl4AI MIT licensed?**
No — it is Apache 2.0. The MIT claim is a common misconception that shows up in many secondary write-ups. For most use cases Apache 2.0 behaves like MIT (both are permissive, meaning you can use, modify, and redistribute freely), but Apache 2.0 adds explicit patent-grant language. Attribution is required when you redistribute.
**Q: Does Crawl4AI work with local LLMs?**
Yes. Under the hood it uses LiteLLM, which routes to any LLM you name with a provider string. ollama/llama3.3, ollama/mistral, vLLM-hosted models, and self-hosted OpenAI-compatible endpoints all work. Running the model locally like this is the standard approach when cost or privacy matters.
**Q: How does adaptive crawling decide when to stop?**
It scores each page it visits for how much new information it adds beyond what it already has. When that gain from new pages falls below a set threshold, the crawl stops itself. This is useful for RAG ingestion — building an AI knowledge base — where you want broad coverage but do not want to keep recursing through near-duplicate pages forever.
**Q: Can Crawl4AI handle heavily protected sites like Akamai?**
It runs on Playwright, so it inherits Playwright's fingerprint problems against Akamai sensor.js and similar deep-fingerprinting targets — defenses that profile the browser in detail to tell bots from people. The 0.8.x line added proxy escalation, which helps with medium-difficulty targets. For the hardest sites, either swap the browser layer to CloakBrowser/Camoufox or route via a managed API.
---
## What Is Burp Suite MCP for Scraping Recon?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-burp-mcp-recon
**The Burp Suite MCP Server is an official PortSwigger extension (released 3 April 2025) that exposes Burp's HTTP history, Repeater, Intruder, Collaborator, and proxy controls as Model Context Protocol tools.** Burp Suite is a tool that records and replays the web traffic flowing through a site; MCP (Model Context Protocol) is a standard way for an AI assistant to call external tools. Connect Claude Code, Cursor, or any MCP client to this extension and you can analyse a captured Burp session with a single prompt. Work that used to take hours of clicking through requests by hand - tracing how a cookie changes over time, finding the endpoints that send anti-bot data, and choosing which approach fits - becomes a 2-minute interaction. It is the recon tool that makes Step 0 of the scraping decision flow practical at scale.
### Quick facts
- **Released:** 3 April 2025 by PortSwigger
- **Language:** Kotlin — runs as a Burp extension
- **Architecture:** SSE server inside Burp + stdio proxy bridge for MCP clients
- **MCP clients supported:** Claude Desktop, Claude Code, Cursor, any MCP-compatible client
- **Free?:** Community Edition works; Collaborator requires Burp Professional
### What the MCP tools expose
The extension turns Burp's main features into tools the AI can call. From a Claude Code prompt you can:
- Send HTTP/1.1 and HTTP/2 requests directly, with Burp handling TLS impersonation (TLS is the encryption layer behind https, and impersonation makes the request look like it came from a real browser).
- Search and filter proxy history (HTTP + WebSocket) with regex - pattern matching to find specific requests.
- Generate and poll Burp Collaborator payloads for out-of-band testing, meaning checks that happen over a separate channel (Professional only).
- Create Repeater tabs and send requests to Intruder for fuzzing - automatically resending a request with many varied inputs.
- Export and modify project + user configuration via JSON.
- Control proxy intercept and the task execution engine.
- Use built-in encoders (URL, Base64) and random string generation.
An automatic Claude Desktop installer is packaged with the extension, so the typical setup is "install Burp extension → restart Claude Desktop → MCP tools appear" with no manual configuration.
### Why this matters for scraping recon
Before this extension, figuring out which cookie unlocks which route, when the anti-bot's sensor data is sent, and what gets re-checked on a POST took a 1–4 hour manual walk through HTTP history. Most of that work is spotting patterns in a human-readable timeline - exactly what LLMs are good at. With the MCP server you can prompt: *"I have a Burp session captured against retailer.com. Trace the cookie lifecycle for _abck. When does it flip from ~-1~ to ~0~? Which endpoint fires the sensor POST? Which subsequent endpoints check the cookie state?"* Here _abck is the session cookie an anti-bot sets; the value flipping from -1 to 0 signals you have passed the check. The LLM reads through the history and answers in minutes.
The practical effect: Step 0 of the scraping decision flow — "identify the anti-bot and the approach that will work" — collapses from a half-day to a single conversation.
### Build a reusable recon skill
The biggest payoff is to write a single burp-antibot-recon.md skill file holding the prompts you keep rerunning against new targets. Typical contents:
- Identify the anti-bot vendor from cookies and response headers.
- Map the cookie lifecycle for the vendor's primary session token.
- Find the sensor / challenge POST endpoint.
- Identify routes that enforce vs. ignore the cookie state.
- Recommend a step from the scraping decision flow.
Run the same skill against every new target. The recon output feeds directly into your scraper architecture decisions - which TLS library, which proxy type, whether to invest in a patched browser, or whether to skip straight to a managed API.
### Example
```bash
# 1. Install the extension into Burp (Bapp Store → search "MCP Server")
# 2. Point Claude Code at the MCP server (PortSwigger ships an installer)
# 3. Example prompts you can run from Claude Code with the MCP attached:
# "Show me every Set-Cookie header from the last 30 requests to retailer.com,
# grouped by domain and TTL."
# "Trace the _abck cookie lifecycle. Identify the request where it transitions
# from ~-1~ to ~0~ and show that request's body."
# "Identify all POST endpoints that include the parameter sensor_data and
# return their response status codes."
# "Given the anti-bot signals you've found, recommend a step from the
# scraping decision flow (Mobile API / XHR / JSON-in-HTML / curl_cffi /
# patched browser / managed API)."
```
### FAQ
**Q: Do I need Burp Professional?**
No. The core MCP tools work with Burp Community Edition, which is free. Burp Collaborator features (out-of-band testing) require Professional, but for scraping recon you rarely need Collaborator - the proxy history, Repeater, and search tools do the heavy lifting.
**Q: How is this different from just opening DevTools?**
DevTools (the browser's built-in inspector) shows one tab's requests in real time. Burp captures every request across every tab, keeps them for searchable analysis, lets you replay them through Repeater with changed parameters, and exposes the whole history to MCP. For recon on a session you have already captured, Burp + MCP is an order of magnitude faster than re-running the session in DevTools and clicking through by hand.
**Q: Can I use this with Cursor or other MCP clients?**
Yes. The extension exposes a standard MCP server, so any MCP-compatible client can connect. The included installer is just for Claude Desktop, but Cursor, Codex, and others work too. Under the hood, Burp runs the server over SSE (server-sent events, a one-way streaming channel) and a small stdio proxy bridges it to the MCP clients.
**Q: Is this for offensive security or for scraping?**
PortSwigger built it as a security-testing tool, but the recon workflow is identical to what serious scraping engineers do before writing a single line of code - identify the anti-bot, trace cookies, classify endpoints. The same prompts that find security vulnerabilities also map how a site's traffic and session flow work.
---
## What Is the Web Scraping Decision Flow?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-the-scraping-decision-flow
**The web scraping decision flow is a six-step checklist, ordered cheapest-first, that experienced engineers run through on every new target they are permitted to access.** You try the steps in order and stop at the first one that works. Each step you move down to costs more engineering effort, more infrastructure, and more money per request. In practice, most production scraping is handled by steps 1-3 (the mobile API, an XHR endpoint, or JSON already sitting inside the HTML). The general principle: start with the simplest option, because the mobile app usually talks to the same backend through a more direct endpoint.
### Quick facts
- **Step 0:** Identify the anti-bot vendor (Wappalyzer, wafw00f, Burp + MCP)
- **Step 1:** Find the mobile API (HTTPToolkit on rooted AVD)
- **Step 2:** Find the XHR / GraphQL endpoint (DevTools, Burp)
- **Step 3:** Look for JSON embedded in HTML (__NEXT_DATA__, chompjs)
- **Step 4:** HTTP scraping with curl_cffi + residential proxy
- **Step 5:** Browser with C++ patches (Camoufox / CloakBrowser / PatchRight)
- **Step 6:** Managed scraping API (Scrappey / Bright Data / Zyte)
### Why the order matters
The steps get harder as you go down, so the earlier you stop the easier your life is. Step 1 (mobile API) hands you clean JSON over a relaxed HTTP endpoint, and the only price is an afternoon learning HTTPToolkit (a tool that lets you watch an app's network traffic). Step 2 (XHR — the background requests a page makes to fetch data) also gives you JSON, but from an endpoint that might be guarded. Step 3 (JSON-in-HTML) is the same data as a plain string you parse, with no browser at all. Steps 4-6 each pile on more infrastructure and budget.
The cost ladder is real. Step 4 needs residential proxies (~$3–10/GB). Step 5 needs a patched-browser binary plus 200MB RAM per instance plus proxies. Step 6 is per-request pricing on managed APIs ($0.20–$3 per 1,000). Starting at step 5 when step 1 would have worked is a recurring waste of engineering time — but it is common when teams don't consciously walk the flow.
### Step-by-step walkthrough
**Step 0 — Recon.** Before anything, identify the stack. Install Wappalyzer (a Chrome extension) and visit the target; it names the anti-bot vendor in one click. Or run wafw00f https://target.com from the command line. With Burp Suite MCP attached to Claude Code, one prompt traces the cookie lifecycle and recommends which step to use.
**Step 1 — Mobile API.** Run the app inside a rooted Android Studio emulator (AVD) and capture its traffic with HTTPToolkit. The mobile app often talks to a separate backend with a different configuration. For example, a retailer's mobile app may use a direct GraphQL endpoint that is served by a different backend than the web frontend's Akamai + DataDome stack.
**Step 2 — XHR.** Open Chrome DevTools → Network → Fetch/XHR. Many single-page apps load everything from one undocumented JSON endpoint you can request directly.
**Step 3 — JSON in HTML.** Many sites ship their data right inside the page source. Next.js sites embed full state in __NEXT_DATA__; React SPAs often expose window.__INITIAL_STATE__. For example, some product pages ship 100KB+ of product data in __NEXT_DATA__, which can be read directly because no JS executes.
**Step 4 — HTTP + curl_cffi.** Send plain HTTP requests with a TLS handshake (the encryption setup behind https) that matches a real browser via impersonate="chrome131", plus a residential proxy. This works for many targets where server-side scoring is light.
**Step 5 — Patched browser.** A real browser configured for a consistent fingerprint: Camoufox, CloakBrowser, and PatchRight. Each addresses a specific layer (canvas/WebGL, extension probes, or function-source inspection) that JS-level runtime patching cannot reach.
**Step 6 — Managed API.** Hand the problem to a paid service. This is common for sites with a custom JS VM such as F5 Shape, where a DIY approach is impractical. Once you are spending more than ~2 engineer-days/month on maintenance, the managed API is cheaper than the engineer.
### Cost progression — when to escalate
StepCostMaintenance burden
1 — Mobile APIFreeLow (token refresh)
2 — XHR / GraphQLFreeLow–medium
3 — JSON-in-HTMLFreeLow
4 — HTTP + curl_cffiProxy only (~$2–10/GB residential)Medium (TLS profile rotation)
5 — Patched browserProxy + 200MB RAM/instanceMedium–high (per-target tuning)
6 — Managed API$0.20–$3 per 1,000 requestsZero
### Example
```python
# Skeleton: try steps 3 and 4 before launching a browser.
import re, chompjs
from curl_cffi import requests
URL = "https://target.com/product/123"
s = requests.Session(impersonate="chrome131")
r = s.get(URL, proxies={"https": "http://user:pass@residential:port"})
# Step 3: JSON-in-HTML — often the entire dataset is here
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', r.text, re.S)
if m:
data = chompjs.parse_js_object(m.group(1))
print("Step 3 succeeded.")
else:
# Step 4 fallback: parse the HTML with selectolax/BeautifulSoup
print("No embedded state — fall through to step 4 HTML scraping.")
# Only reach for steps 5–6 if 1–4 have all failed.
```
### FAQ
**Q: When should I start at step 6 (managed API)?**
Start there in three cases: the target uses a custom JS VM such as F5 Shape (which makes a DIY approach impractical); your team is small and scraping isn't your core product; or maintenance would cost more than ~2 engineer-days per month. For everything else, walking up from step 1 is cheaper in the long run, even if your first scraper takes a day longer to build.
**Q: Is step 1 (mobile API) always available?**
Most brands with a mobile app have a mobile API, but not all of them are softer than the web frontend. Some apps pin SSL certificates, which means you need Frida or objection to intercept their traffic. Others have heavy jailbreak detection and may crash on emulators. For the ~30% of targets where step 1 doesn't work, walk to step 2 or 3.
**Q: How do I know which step a target is on?**
Step 0 recon tells you. Wappalyzer names the anti-bot vendor, and inspecting the cookies confirms it. Once you know the vendor, you know which steps are worth trying: a custom JS VM such as F5 Shape generally points to step 6, function-source inspection points to step 5 (PatchRight), DataDome is often handled at step 3 or 4, light Cloudflare at step 4, and a site with no anti-bot vendor at all works at step 4 with plain requests.
**Q: What about scraping legality across these steps?**
Legality is a separate question from which technical step you use. Scraping public data through the mobile API carries the same legal posture as scraping it through the web. Anything behind a login is a different matter entirely — that's where you should be looking at Computer Use Agents with user consent, not scraping.
---
## What Is a Computer Use Agent?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-a-computer-use-agent
**A Computer Use Agent (CUA) is an AI agent that acts like a person at a keyboard: it logs into a portal as the user, clicks through the screens, deals with MFA (multi-factor login codes) and CAPTCHAs, and hands back clean, structured data.** It differs from web scraping in three ways: the user gives permission first (so there's no terms-of-service conflict), it only touches data the user already owns, and it works on sites that have no public API to call. Anthropic's Computer Use, OpenAI's Operator, Skyvern (85.8% WebVoyager), and Browser Use (89% WebVoyager, the leading open-source option at 78k+ GitHub stars) are the current production-grade implementations.
### Quick facts
- **WebVoyager — Browser Use:** 89% (open-source, leading)
- **WebVoyager — OpenAI Operator/CUA:** 87% (controlled VM environments)
- **WebVoyager — Skyvern:** 85.8%
- **WebVoyager — Anthropic Computer Use:** 56% (real desktop environments)
- **Browser Use stars:** 78k+ (May 2026)
### CUA vs web scraping — different categories
**Classic web scraping** sends anonymous HTTP or browser requests to public pages. You only get what a logged-out visitor would see. You pay per request, you can run many at once, and responses come back fast.
**A Computer Use Agent** works the opposite way: the user grants it permission, and it logs in as that user. You pay per task, you can run only a few at a time (each task needs its own VM — a throwaway virtual machine), and each task is slower (30 seconds to several minutes). The legal picture is clean because the data belongs to the user.
A useful mental model: CUAs are **"Plaid for any website"** — they bring the open-banking pattern (the user gives permission, then data is pulled out in a structured form) to portals that offer no public API. Think utility bills, bank statements, payroll exports, insurance claims, tax filings, or e-commerce backend orders.
### When each one wins
**Use a CUA when:** the data sits behind a login the user owns; the portal has no API; the job needs MFA, step-up authentication (an extra identity check mid-session), or human-style clicking around; or you need occasional, small-scale retrievals (say 5 documents each for 200 users).
**Use traditional scraping when:** the data is public (e-commerce listings, search results, social media, news, real estate); you need fast responses (under a second); you need to run many requests in parallel (100+ at once); or cost per request matters (scraping is 10–100× cheaper for the same data when both approaches work).
For 100k items, scraping might cost €20–€100 on Scrappey. Running 100k CUA tasks could cost $5,000–$100,000 depending on the platform. That cost gap is exactly why these are two different tools, not rivals.
### The market in May 2026
**Anthropic Computer Use** — a direct API that drives the real host machine with raw mouse and keyboard actions. Best for building custom agent pipelines. It scores 56% on WebVoyager (a benchmark of real web tasks) because it operates full desktops with all their mess, not stripped-down browser-only VMs.
**OpenAI Operator (CUA)** — a hosted product with browser control built in; it scores 87% on WebVoyager in controlled environments.
**Skyvern** — open-source (YC-backed) and driven by a Vision-LLM (a model that reads the screen as an image). It scores 85.8% on WebVoyager and is strong at invoice retrieval, job applications, government forms, and insurance quotes. Available both cloud-hosted and self-hostable.
**Browser Use** — the leading *open-source* browser-only agent at 89% WebVoyager, with 78k+ GitHub stars. Plug in any LLM and run it locally or self-hosted. It supports OpenAI, Anthropic, Gemini, and Ollama for local models.
**Deck** — managed VMs with a credential vault and SOC 2 compliance, positioned as "Plaid for any website" with 100k+ utility provider integrations.
### Example
```python
# Browser Use (open source, 89% WebVoyager) — the standard open-source CUA
# pip install browser-use
from browser_use import Agent, ChatOpenAI
agent = Agent(
task="Log into example-utility.com using the credentials in the env, "
"navigate to billing history, download the last 12 months of "
"statements as PDFs, and return the file paths.",
llm=ChatOpenAI(model="gpt-4o"), # or ChatAnthropic, Gemini, local Ollama
)
result = agent.run()
# Returns structured output of the task — file paths, total billed, dates.
```
### FAQ
**Q: Is a CUA the same as a headless browser?**
No — they sit at different layers. A headless browser is just Chrome or Firefox running with no visible window; it's the engine that loads pages. A CUA is an AI agent layered on top of that browser (or sometimes a real desktop): it reads the page (visually or through the DOM, the page's element structure), decides the next move, and acts. The headless browser is the body; the CUA is the brain.
**Q: Why is Anthropic's Computer Use score lower than OpenAI's?**
Because they're tested on different difficulty levels. WebVoyager measures browser-only tasks in controlled environments. OpenAI Operator runs in optimised browser-only VMs and scores 87%. Anthropic's Computer Use is more general — it can drive any desktop application, not just a browser — and was benchmarked in the harder real-desktop setting, scoring 56%. They solve overlapping but not identical problems, so the numbers aren't apples-to-apples.
**Q: Should I use Browser Use or Skyvern?**
Pick Browser Use if you want the highest open-source WebVoyager score and the most active community (89%, 78k+ stars). Pick Skyvern if you specifically want a Vision-LLM driven agent that works from screenshots instead of the DOM (85.8%) — handy when the DOM is dynamically obfuscated (deliberately scrambled to block scrapers). For invoice retrieval and form-filling in particular, Skyvern has more documented production deployments.
**Q: When is a CUA cheaper than a managed scraping API?**
Almost never for public data. CUAs bill per task ($0.05–$1 each); managed scraping APIs bill per request ($0.0002–$0.003 each). That CUA premium pays for the user-consent flow, MFA handling, and access to login-required data — none of which you need for public data. Use CUAs for portals behind a user login, and scraping APIs for everything else.
---
## What Is a Self-Healing Scraper?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-a-self-healing-scraper
**A self-healing scraper is a scraper that notices, while it is running, that the rules it uses to find data on a page have stopped working — and then fixes those rules on its own.** When it breaks, it sends the page's HTML to an LLM (a large language model like Claude Haiku or GPT-4o-mini, both cheap to run) and asks for corrected selectors — the small instructions, like "the price is inside this tag", that tell the scraper where each value lives. The new selectors are written automatically, with *no code deployment*. This matters because *changes to selectors are the single largest failure mode* for long-running production spiders: sites get redesigned and the old rules silently miss everything. Pair this with a Pydantic + Instructor extraction layer downstream (a step that checks the extracted data matches the shape you expect) and you have a pipeline that survives most site redesigns on its own.
### Quick facts
- **What it fixes:** CSS / XPath selector changes (~80% of spider breakages)
- **What it does not fix:** Anti-bot escalation (site upgraded to Cloudflare), schema-level changes
- **Cost per heal:** ~$0.0003 with Claude Haiku, ~$0.01 with Sonnet
- **Detection trigger:** Item count drops to zero (or below threshold) after a run
- **Required guardrail:** Pydantic schema validation on healed output before persisting
### The architecture
Five parts work together:
- **Scrapy extension hook.** It listens for the item_scraped and spider_closed signals (events Scrapy fires when it grabs an item and when a run ends). It counts the items found and keeps one full page of the broken HTML in memory, in case healing is needed.
- **Failure detector.** When a run ends with zero items (or fewer than a threshold you set), it triggers the heal flow.
- **LLM call.** Send the old selectors plus a trimmed copy of the page HTML (the first 8K characters is usually enough) to the LLM. The prompt asks it to return corrected selectors as JSON.
- **Selector updater.** Read the LLM's answer and write the new selectors straight into the spider's YAML or JSON config. The config lives in Git, so every heal is auditable and easy to revert.
- **Validation.** Re-run the spider. If items come back, it healed — notify Slack with the diff. If the heal returns items that look plausible but are wrong (Pydantic flags the wrong data type), escalate to a human instead of blindly trusting the fix.
### Why it works for selector changes
Selector changes are the textbook case where LLMs beat plain regex (pattern matching on raw text). Real-world HTML is messy: inconsistent, minified, sometimes deliberately scrambled. An LLM reads it the way a person would: "the title is the text inside the first <h1> whose class contains 'product'". The model hands back h1[class*='product']::text and you keep scraping.
The cost math is favourable. A Claude Haiku heal costs roughly $0.0003 per call. Even if a fleet of 50 spiders each break once per quarter, you spend less than a dollar a year on healing — and nobody gets paged at 3am. Set that against one engineer-hour of manual selector debugging and the ROI is obvious.
### What this pattern does not do
- **Anti-bot upgrades.** If the spider broke because the site added Cloudflare protection where there was none, no selector change helps — the spider needs a new TLS or browser layer (a way to look like a real browser, TLS being the encryption behind https). The heal flow should spot this case (an HTTP 403 instead of HTML, or a Cloudflare challenge page in the response) and send it to a different alert rather than asking the LLM to write selectors.
- **Schema-level changes.** If the site renamed price to current_price in its JSON-LD (structured product data embedded in the page), the selector may still find an element, but the field itself has changed. Selector healing plus Pydantic-validated extraction catch this together: the selector finds the element, the LLM call extracts what looks like a price, the schema checks the type, and a normalisation step renames the field. Three layers.
- **LLM hallucinations.** Without schema validation on healed output, you can ingest made-up data. Always validate. If the heal returns strings where you expected integers, fail the heal and escalate.
### Example
```python
# Scrapy extension that heals selectors when items drop to zero
from scrapy import signals
import anthropic, json, yaml
HEAL_PROMPT = """You are a web scraping expert. A Scrapy spider broke because the site
changed its HTML.
Old selectors (no longer working):
title: {title}
price: {price}
image: {image}
New page HTML (truncated):
{html}
Return ONLY a JSON object with corrected CSS selectors:
{{"title": "...", "price": "...", "image": "..."}}
"""
class SelfHeal:
@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def __init__(self):
self.item_count = 0
self.broken_page = None
def item_scraped(self, item, response, spider):
self.item_count += 1
if self.broken_page is None:
self.broken_page = response.text
def spider_closed(self, spider, reason):
if self.item_count > 0 or not self.broken_page:
return
old = yaml.safe_load(open(f"selectors/{spider.name}.yml"))
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
messages=[{"role": "user", "content": HEAL_PROMPT.format(
**old, html=self.broken_page[:8000]
)}],
)
new = json.loads(resp.content[0].text)
yaml.safe_dump(new, open(f"selectors/{spider.name}.yml", "w"))
# Slack-notify, then trigger re-run
```
### FAQ
**Q: What if the LLM writes wrong selectors?**
The Pydantic validation layer downstream catches it. If the healed selectors return strings where you expected integers, null where a value is required, or anything that fails your schema, the validation step rejects the heal, the spider is marked broken, and a human reviews it. Without validation, the heal is dangerous. With validation, it is safe.
**Q: Can I do this without Scrapy?**
Yes — the same pattern works with any scraping framework. You need four building blocks: a way to count items per run, a way to hold one broken page in memory, an LLM client, and a selector config file you can rewrite. Crawlee (Node), Crawl4AI, even a hand-rolled requests + BeautifulSoup spider can do it.
**Q: How often does this trigger?**
It depends on the target. E-commerce and news sites redesign often (every few months); enterprise SaaS portals change slowly (years). Across a 50-spider fleet, expect a handful of heals per quarter. Most are fixed in minutes by the LLM, with no human ever paged.
**Q: What about anti-bot escalations?**
A site that goes from "no anti-bot" to "Cloudflare" returns an HTTP 403 or a challenge page rather than HTML with missing selectors. The heal flow should detect this (by the response status and body pattern) and route to a different alert: "spider needs anti-bot upgrade", not "spider needs new selectors". Two failure modes, two playbooks.
---
## Best Web Scraping API for JavaScript-Rendered Sites
URL: https://scrappey.com/qa/web-scraping-apis/best-scraping-api-for-javascript-rendered-sites
**The best web scraping API for JavaScript-rendered sites runs a real headless browser per request, executes the page's JavaScript, waits for dynamic content to load, and returns the final rendered HTML or a structured extraction.** Plenty of modern sites build their pages in the browser with JavaScript instead of sending finished HTML, so a plain download gets you nothing useful. The right API loads the page in an actual browser for you. The choice between APIs comes down to which can render the hardest SPAs - single-page apps built with React, Vue, Angular, Next.js - without you maintaining a browser fleet, while still handling anti-bot defenses, proxies, and CAPTCHAs in the same call.
### Quick facts
- **Core requirement:** Real browser rendering (not just HTML fetch)
- **Must-have features:** Wait strategies, scroll, click, JS injection, network capture
- **Common targets:** Single-page apps, infinite-scroll feeds, React/Vue/Next sites
- **Typical wait pattern:** Wait for specific selector or network idle, not fixed timer
- **Pricing model:** Per-request, often with a rendering premium over HTML-only
### Why HTML-only scraping fails on SPAs
A single-page app (SPA) sends a near-empty HTML shell - the real content is built in the browser by JavaScript that fetches data from an API and writes it into the page (the DOM, the browser's live model of the page). A plain HTTP fetch only downloads that first shell; it never runs the JavaScript, so it sees the empty placeholder, not the content. To scrape these sites you have to run the JavaScript, wait for the page updates to finish, and only then grab the HTML. That is exactly what a JS-rendering scraping API does for you.
### What to look for
Pick an API that uses a real browser engine (Chromium, Firefox), not a lightweight JS shim that only fakes parts of a browser. Look for configurable wait strategies - wait for a CSS selector to appear, wait for network activity to go quiet ("network idle"), or wait for a custom JavaScript check of your own to pass. You also want support for scrolling and clicking to trigger lazy-loaded content (content that only loads as you reach it), plus per-request proxy and fingerprint control. Network capture is a bonus: it lets you grab the underlying XHR data (the background API calls the page makes) directly, which is often cleaner than re-extracting it from rendered HTML. Finally, watch cost transparency - JS rendering costs more than a plain HTML fetch, so you want to render only when needed.
### When NOT to use a rendering API
If the SPA pulls its data from a JSON endpoint you can spot, calling that endpoint directly is faster, cheaper, and more reliable than rendering the whole page. Open the browser's network tab, find the XHR (the background request) that returns the data, and replicate it yourself. Rendering is the fallback for when that endpoint is encrypted, signed, or otherwise too awkward to call directly.
### Example
```python
import requests
resp = requests.post('https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY', json={
'cmd': 'request.get',
'url': 'https://spa.example.com/products',
'session': 'js-render-session',
'browserActions': [
{'type': 'wait_for_selector', 'cssSelector': '.product-card'},
{'type': 'scroll'}
]
})
html = resp.json()['solution']['response']
```
### FAQ
**Q: Do I need rendering for React/Next.js sites?**
Often no. Next.js apps that use SSR (server-side rendering - the server builds the HTML before sending it) deliver fully-rendered content on the first load. Check the page source (View Source, not DevTools) for the content. If it is already there, a plain HTTP fetch works. If the source is just an empty shell, rendering is required.
**Q: What is the fastest wait strategy?**
Wait for a specific selector - a CSS target that only appears once the content is ready. Network idle (waiting for traffic to stop) and fixed timers both waste time, because selector waits return the moment the data is on the page.
**Q: Can I render JS without a browser?**
Tools like jsdom can run simple JavaScript without a full browser, but they break on anything that relies on modern browser APIs, fetch streams, or fingerprinting checks. Real browsers are the safe default in 2026.
---
## Best Web Scraping API for Price Scraping & E-commerce Price Monitoring
URL: https://scrappey.com/qa/web-scraping-apis/best-scraping-api-for-ecommerce-price-monitoring
**The best web scraping API for e-commerce price monitoring is one that reliably pulls accurate, location-correct product data from major retailers (large marketplaces and hosted-platform stores) as often as your pricing decisions require — without getting IP-banned, returning stale data, or forcing you to build custom code for every site.** A scraping API is a service you hand a web address to; it fetches the page and returns clean data. For price tracking, three things matter most: residential proxy coverage (IP addresses from real home internet connections) in the countries you monitor, enough capacity to keep up during traffic spikes (Black Friday, product drops), and clean structured output that drops straight into your price feed.
### Quick facts
- **Critical features:** Residential proxies, geo-targeting, parallel throughput, structured output
- **Top targets:** Major marketplaces, big-box retailers, hosted-platform stores
- **Common cadence:** Hourly for hot SKUs, daily for the long tail
- **Block risk:** High — retailers actively block scrapers; residential rotation is mandatory
- **Cost driver:** SKU count × cadence × retailer difficulty multiplier
### Why generic scrapers fail at scale
Retailers run sophisticated anti-bot systems that watch for behavior typical of price monitoring: hitting the same high-margin product pages over and over, never adding to cart, never checking out. A basic scraper falls into this pattern and gets blocked within hours. A good price-monitoring API avoids it by rotating residential IPs by country, varying how requests look, and capping how many requests hit a single site at once — so no single identity stands out.
### Geo-targeting matters more than you think
Prices are not the same everywhere. A major marketplace's prices, stock, and shipping eligibility change by country and even by ZIP code; big-box retailers show different inventory per store. If you scrape from the wrong location, you collect the wrong numbers and your pricing decisions are off from the start. Geo-targeting means choosing where your request appears to come from. A good API lets you pick the proxy's country and city per request, so you track the exact markets you sell in.
### Structured output vs raw HTML
Some scraping APIs hand back raw HTML and leave the parsing (pulling the price and other fields out of the page) to you. For price monitoring, prefer APIs that ship ready-made extractors for the big retailers — they absorb layout changes for you, so a retailer redesign does not break your pipeline at 3am. For smaller, long-tail hosted-platform stores, plain HTML plus a small extraction layer (CSS selectors, which point at elements on the page, or LLM-based extraction) is usually the simpler choice.
### Example
```python
import requests
resp = requests.post('https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY', json={
'cmd': 'request.get',
'url': 'https://www.example-retailer.com/dp/B0XXXXXXXX',
'proxyCountry': 'UnitedStates',
'session': 'retailer-us-pool-1'
})
html = resp.json()['solution']['response']
```
### FAQ
**Q: How often can I scrape a major retailer for prices?**
Hourly per product (per SKU) is realistic with rotating residential proxies. Going faster than that tends to trigger blocks. For real-time pricing on hot items, spread the load across multiple sessions and stay within each IP's safe request rate.
**Q: Do I need a different proxy per country?**
Yes. Most large retailers serve different prices, currency, and inventory based on the country they detect from your IP. Scraping the US site from a German IP gives you misleading data.
**Q: Should I parse HTML or use a structured product API?**
For top-10 retailers, structured product endpoints (where offered) save engineering time because the parsing is done for you. For long-tail DTC and hosted-platform stores, raw HTML plus a generic extractor is more flexible.
---
## Best Web Scraping API for SEO Audits
URL: https://scrappey.com/qa/web-scraping-apis/best-scraping-api-for-seo-audits
**The best web scraping API for SEO audits combines reliable SERP scraping (Google, Bing, regional engines) with on-page extraction — title, meta, headings, schema, internal links, render-blocking resources, and Core Web Vitals.** An SEO audit is a health check that measures how well a site ranks and why. A scraping API (a hosted service you send a URL to, which fetches the page for you) does the heavy data-gathering. The work happens in two phases: first pull SERP positions — where your keywords rank on the search results page — for your target keywords, then crawl those ranking pages to extract the signals SEO software needs to score them.
### Quick facts
- **SERP requirements:** Country/language localization, mobile vs desktop SERP, AI overview capture
- **On-page extraction:** Title, meta, h1-h6, schema.org, hreflang, canonical, robots
- **Performance metrics:** LCP, INP, CLS — needs real-browser rendering
- **Blocking risk:** Google aggressively blocks SERP scraping — residential rotation mandatory
- **Cadence:** Weekly for SERPs, monthly for full site audit
### SERP scraping in 2026
A Google search results page (SERP) is no longer just ten blue links. In 2026 it mixes organic results, AI overviews, knowledge panels, product carousels, and ad blocks — all drawn by JavaScript and personalized to the user. A good SEO scraping API untangles this into clean, structured output: ranked organic positions, featured snippets, AI overview text, paid placements, and competitor citations. Country and device targeting are not optional — the same query returns a very different SERP on desktop in the US versus mobile in Germany, so you must tell the API where and on what device to search.
### On-page extraction
Once you know which URLs rank, the on-page pass visits each one and pulls out the SEO signals: title, meta description, canonical (which page is the "official" version), robots directive (whether search engines may index the page), hreflang alternates (language/region versions), all h1-h6 headings in the order they appear, structured data (JSON-LD, microdata — machine-readable tags that describe the page's content), Open Graph and Twitter cards, image alt counts, internal vs external link counts, and word count. For technical SEO, also grab render-blocking JS, the CSS file count, and the diff between the rendered DOM (the page after JavaScript runs) and the raw source HTML.
### Core Web Vitals require real browsers
Core Web Vitals are Google's scores for how fast and stable a page feels: LCP (how long the main content takes to appear), INP (how quickly the page responds to taps and clicks), and CLS (how much the layout jumps around as it loads). You cannot measure these from a plain HTTP fetch — they only emerge when a real browser actually renders and runs the page. So you need a real browser on a network profile that matches Google's field data, usually a simulated slow 4G connection. Most scraping APIs offer this as a premium feature, so budget for it on the pages that matter (homepage, top landing pages) rather than crawling the whole site this way.
### Example
```python
import requests
resp = requests.post('https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY', json={
'cmd': 'request.get',
'url': 'https://www.google.com/search?q=best+web+scraping+api&gl=us&hl=en',
'proxyCountry': 'UnitedStates'
})
```
### FAQ
**Q: Can I scrape Google SERPs legally?**
Google's terms of service prohibit automated SERP scraping, but the result data itself is public. SEO tools have done this for years with little legal exposure. For production, the real challenge is technical, not legal: use a managed SERP API so you don't run afoul of Google's technical defenses (rate limits, blocks, and CAPTCHAs).
**Q: How fresh do SERPs need to be?**
It depends on how fast the rankings move. Weekly is enough for general tracking; daily for fast-moving SERPs like news or trending products; hourly only when you're watching SERP volatility around a Google algorithm update.
**Q: Should I scrape competitor pages or use a SEO tool?**
Both, because they cover different gaps. SEO tools (Ahrefs, Semrush) give you long-running history and the link graph (who links to whom). A scraping API gives you fresh, raw page data for the specific competitors and queries you care about, without the platform's built-in assumptions about what matters.
---
## Best Web Scraping API for LLM Training Data
URL: https://scrappey.com/qa/web-scraping-apis/best-scraping-api-for-llm-training-data
**The best web scraping API for LLM training data delivers clean, deduplicated, license-aware text at the scale training pipelines need — boilerplate stripped, main content extracted, code blocks preserved, and metadata captured for filtering downstream.** In plain terms: an LLM learns from billions of words of text, so the scraper's job is to gather that text and hand it back already tidy. The output should drop into a vector store (a database that holds text as numbers a model can search) or a fine-tuning pipeline with minimal cleanup. Raw HTML is not training data; clean markdown is.
### Quick facts
- **Output format:** Clean markdown with code fences, lists, and tables preserved
- **Boilerplate removal:** Nav, footer, comments, ads stripped; main content kept
- **Dedupe support:** Stable URL canonicalization + content hashing
- **Metadata:** Author, date, language, license, robots respect
- **Scale:** Millions of URLs/day with retry, dead-link handling, idempotency
### Why raw HTML is not training data
If you train on raw HTML, model quality suffers. Boilerplate — the repeated parts of every page like the nav bar, footer, and related-articles widgets — leaks into a fine-tuned model's answers as off-topic noise. Worse, because that same boilerplate appears on thousands of pages, the model sees it again and again and learns to overweight it. A training-grade scraper fixes this with main-content extraction: algorithms (readability-style, named after the reader-view tools that pull just the article, or LLM-based) that find the real article, strip the boilerplate, keep code and tables intact, and output markdown that reads as cleanly as the original article.
### Dedupe and quality filtering
Crawling the web turns up the same text over and over — the same article on the original site, its AMP version (a stripped-down mobile copy), syndicated mirrors, and archive.org snapshots. To handle this, a good API gives each page a stable content hash (a short fingerprint of the text; identical text always produces the same fingerprint) so your pipeline can drop duplicates before training. Licensing matters too: respect robots.txt and ai.txt directives (the files where a site says which bots, including AI crawlers, may visit), capture canonical URLs (the one official address for a page), and surface whether content is Creative Commons or all-rights-reserved so legal can audit the dataset later.
### Scale and idempotency
Training datasets are millions of URLs, so the API has to cope with scale. Key needs: idempotent retries — meaning a retry produces the same result, so the same URL always maps to the same hash; dead-link tracking, so you do not keep re-scraping 410s (the HTTP code for a page that is gone for good); proxy rotation (swapping IP addresses) at scale; and backpressure, the ability to slow down when downstream pipelines stall instead of flooding them. Throughput in the 1,000-10,000 URLs/minute range is achievable with a managed API; building this in-house is months of engineering before the first useful dataset lands.
### Example
```python
import requests
resp = requests.post('https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY', json={
'cmd': 'request.get',
'url': 'https://example.com/article',
'markdown': True
})
markdown = resp.json()['solution']['markdown']
```
### FAQ
**Q: Should I scrape my own training data or buy it?**
For domain-specific fine-tuning (medical, legal, your own product docs), scrape it yourself — third-party datasets do not cover specialized corpora. For general pretraining, buy a dataset or use Common Crawl (a free, public archive of billions of web pages); building a clean web crawl from scratch is an enormous engineering cost.
**Q: How do I respect robots.txt for AI training?**
Respect both robots.txt and the emerging ai.txt convention (a newer file specifically for AI crawler rules). A good scraping API checks these per domain before it issues the request. Ignoring them is a legal and reputational risk you do not want to take in 2026.
**Q: What about copyright?**
Scraping for training is in active legal flux, meaning the rules are still being settled in courts. The defensible position is: respect robots/ai.txt, store source URLs and licenses, and have a legal sign-off process on the dataset before training. The scraping tool does not decide; your legal team does.
---
## Best Web Scraping API for Competitor Research
URL: https://scrappey.com/qa/web-scraping-apis/best-scraping-api-for-competitor-research
**The best web scraping API for competitor research covers the full surface a strategy team needs to monitor — pricing pages, product detail, content marketing, ad copy, review platforms, hiring pages — without per-source engineering.** In plain terms: a scraping API is a service you call over the web to fetch pages for you, handling the messy anti-bot work in the background. For competitor research, the thing that matters is breadth and reliability, not depth on any one site. You need consistent monthly snapshots across dozens of competitors, not a heavy custom scraper built for a single target.
### Quick facts
- **Sources covered:** Pricing, product, blog, ads, reviews (G2/Capterra), careers, social
- **Cadence:** Daily for ads/pricing, weekly for content, monthly for full snapshot
- **Output expectations:** Diff-friendly markdown; track changes over time
- **Key features:** Reliable rendering, structured output, archive/diff support
- **Common workflow:** Scrape → diff against last snapshot → alert on meaningful changes
### Breadth over depth
Competitor research is wide and shallow. You are watching 20-50 companies, and for each one a handful of page types — pricing, top product pages, blog index, ad library, key review sites — so 5-10 pages per company. Hand-building a separate scraper for every source is engineering time you do not have. A general-purpose scraping API that can reliably render pages (run the JavaScript that builds them) and return structured output (clean, ready-to-use data like JSON) covers all of this in days, not months.
### Diff and alerting
The whole point of competitor monitoring is spotting change. A good workflow saves each snapshot, then compares it against the previous one (a content diff) and alerts you when something meaningful shifts — a new pricing tier, a product launch, a removed feature, or a hiring spike for a specific role. Markdown output is far easier to diff than HTML, because it strips out layout noise so only the real content differences stand out. Layering an LLM (a language model) on top to summarize the diff catches meaning-level changes that a plain character-by-character diff would miss.
### Ad and review platforms
Public ad libraries, Google Ads Transparency, G2 reviews, Capterra reviews, App Store reviews — these are all gated behind anti-bot defenses (systems that block automated visitors) or rate limits (caps on how many requests you can send). A general scraping API handles them. Specialized APIs sometimes do them better. The trade-off is engineering cost versus the number of vendors you juggle — for portfolios under 50 competitors, bundling everything into one general API is usually the right call.
### Example
```python
import requests, hashlib
def snapshot(url):
r = requests.post('https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY', json={
'cmd': 'request.get', 'url': url, 'markdown': True
})
md = r.json()['solution']['markdown']
return md, hashlib.sha256(md.encode()).hexdigest()
md, h = snapshot('https://competitor.com/pricing')
```
### FAQ
**Q: How often should I scrape competitors?**
Scrape pricing and ads daily, content and product pages weekly, and hiring or full-site sweeps monthly. Going more frequent mostly adds noise; going less frequent risks missing launches.
**Q: Should I use a scraping API or a competitive-intelligence platform?**
Competitive-intelligence (CI) platforms like Crayon and Kompyte hand you curated insights and cost less to set up. A scraping API gives you more flexibility and full ownership of the raw data. For technical teams, the API path scales further; for non-technical teams, the platform is usually the better fit.
**Q: What about JavaScript-rendered competitor sites?**
Most modern marketing sites use SSR (server-side rendering — the server sends finished HTML), so a simple HTML fetch works. The exceptions are heavy app dashboards and gated demos, which build their content in the browser and need JS rendering — something any decent scraping API offers as a per-request option.
---
## How to Get All Links From a Webpage
URL: https://scrappey.com/qa/web-scraping-apis/how-to-get-all-links-from-a-webpage
**Getting all links from a webpage means downloading the page, reading every <a href> attribute (the URL inside each link tag), turning relative URLs into full ones, cleaning them up (fragments and query-string order), and removing duplicates.** On a static page this is a one-liner; on a JavaScript-rendered page you must run the page's scripts first; and in a crawling pipeline you also want to filter by domain, by URL pattern, or by rel attributes such as nofollow and ugc.
### Quick facts
- **Static pages:** BeautifulSoup or cheerio + urljoin for relatives
- **JS-rendered:** Playwright/headless browser, then querySelectorAll(a)
- **Always do:** Resolve relatives, strip fragments, lowercase host, dedupe
- **Filter on:** Same domain, URL pattern, rel attribute, link text
- **Watch for:** Links in onclick handlers, data-href, JS-only navigation
### The basic pattern (static HTML)
The plan is straightforward: download the page, parse the HTML, walk through every <a> tag that has an href, and turn each href into a full URL by resolving it against the document's base URL (a relative link like /page is just shorthand for the complete address). Strip the fragment - everything after the # - unless you specifically care about anchored links. Normalize the host to lowercase and put the path in a canonical (consistent) form. Drop the junk: empty hrefs, javascript: pseudo-links, and mailto: addresses. Finally, dedupe so the same URL is not listed twice.
### When you need a real browser
Modern SPAs (single-page apps - sites that build the page in your browser with JavaScript) and infinite-scroll feeds add links to the DOM (the live, in-memory version of the page) only after the initial HTML loads. A plain static fetch never runs that JavaScript, so it misses those links. Use Playwright (or a JS-rendering scraping API), wait for the page to settle, then run document.querySelectorAll('a[href]') in the browser context to read the finished page. For infinite scroll, scroll to the bottom in steps and collect links after each scroll until no new ones appear.
### Filtering for crawl pipelines
For a focused crawler you usually want fewer links, not more, so filter aggressively: same-domain only (or a domain allowlist), URL patterns that match real content paths (skip /login, /cart, and asset paths), and respect rel="nofollow" if you care about the crawled site's signal. rel="nofollow" is a hint a site adds to a link to say "do not pass ranking credit through here." For SEO link extraction, keep the rel attributes as metadata rather than filtering on them.
### Example
```python
from urllib.parse import urljoin, urldefrag
import requests
from bs4 import BeautifulSoup
def get_links(url):
r = requests.get(url, timeout=30)
soup = BeautifulSoup(r.text, 'html.parser')
links = set()
for a in soup.select('a[href]'):
href = a['href'].strip()
if not href or href.startswith(('javascript:', 'mailto:', 'tel:')):
continue
links.add(urldefrag(urljoin(r.url, href)).url)
return sorted(links)
```
### FAQ
**Q: Why are some links missing from my extraction?**
Most likely those links are added by JavaScript after the page first loads, so a plain static fetch never sees them. Switch to a headless browser (a real browser engine running with no window) or a JS-rendering API, and wait for the DOM to settle before reading the links.
**Q: Should I follow rel="nofollow" links?**
For crawling, yes - nofollow is a signal to search engines about PageRank (how ranking credit flows between pages), not a rule that blocks access. For SEO analysis, surface the attribute as metadata rather than filtering those links out.
**Q: How do I handle infinite scroll?**
Scroll the page programmatically in a loop, collecting links after each scroll, until two scrolls in a row return the same set of links (meaning nothing new loaded). Cap the number of iterations so you do not get stuck on a feed that scrolls forever.
---
## How to Scrape Infinite-Scroll Pages
URL: https://scrappey.com/qa/web-scraping-apis/how-to-scrape-infinite-scroll-pages
**Infinite scroll is the page design where new content keeps loading on its own as you scroll down (like a social feed that never ends). To scrape one, your code has to trigger those same scroll events, wait for the new content to render, collect it, and figure out when the feed has actually stopped.** The naive "scroll to the bottom once" fails, because the bottom keeps moving as more content loads. The reliable pattern is a loop: scroll, wait, collect, check whether anything new appeared, and repeat — with a cap so a truly endless feed doesn't trap you.
### Quick facts
- **Required:** A real browser (Playwright, Puppeteer, or rendering API)
- **Loop pattern:** Scroll → wait for new items → collect → check delta → repeat
- **End-of-feed signal:** Two consecutive iterations with no new items, or scroll height stable
- **Cap:** Hard max on iterations or items to avoid runaway
- **Alternative:** Find the XHR endpoint behind the scroll and call it directly
### The XHR shortcut
Before you bother scrolling, open your browser's network tab (the DevTools panel that lists every request the page makes). Infinite scroll is almost always powered by a paginated JSON endpoint that the page calls in the background (an XHR — a JavaScript request that fetches data without reloading the page) as you scroll. That endpoint takes a cursor or page parameter and returns the next batch of items. Calling it directly is far faster than driving a browser — no rendering, no scroll loops, just JSON you page through. If the endpoint is open, or only needs a CSRF token (a small anti-forgery value) grabbed from the first page, this is the best route.
### When you have to scroll
If that endpoint is signed, encrypted, or hands back ready-made HTML fragments instead of clean data, you need a real browser. The loop goes: read the current scroll height, scroll to the bottom (or down by one screen at a time), wait until either a new item appears or the network goes quiet, collect the new items, then compare against the previous round. If the item count hasn't changed for two rounds in a row, the feed is done. Cap it at a sensible maximum (for example 200 rounds) so Twitter-style feeds that never truly end don't run forever.
### Pitfalls
Virtualized lists (libraries like react-window or react-virtual) drop off-screen items out of the page's HTML as you scroll — so by the time you reach the bottom, the top items are already gone. The fix is to collect after each scroll step, not just at the end. Some pages also wait until the user has paused for a moment before loading more, so add a 500ms-2s pause after each scroll. Finally, anti-bot systems flag mechanical scrolling (identical screen-sized jumps with no variation), so randomize how far you scroll and how long you pause.
### Example
```python
import requests
resp = requests.post('https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY', json={
'cmd': 'request.get',
'url': 'https://feed.example.com',
'browserActions': [
{'type': 'scroll', 'cssSelector': '.end-of-feed', 'repeat': 50, 'delayMs': 1000}
]
})
html = resp.json()['solution']['response']
```
### FAQ
**Q: Is the XHR shortcut always faster?**
Almost always. A signed or encrypted endpoint occasionally blocks it, but checking the network tab for that hidden JSON endpoint is the right first move on any infinite-scroll page.
**Q: How do I detect the end of the feed?**
Watch for an end-of-feed marker element, or compare the item count across rounds — two rounds in a row with the same count means nothing new is loading, so you've reached the end.
**Q: What about virtualized lists?**
Collect items after each scroll step, not at the end. Once an item scrolls off-screen it gets removed from the page's HTML, so you can't read it back later.
---
## How to Reverse-Engineer API Requests for Scraping
URL: https://scrappey.com/qa/web-scraping-apis/how-to-reverse-engineer-api-requests
**Reverse-engineering API requests for scraping means watching the network traffic a website makes, spotting the JSON endpoints that feed its visible UI, and calling those endpoints directly instead of scraping the rendered HTML.** An API (Application Programming Interface) is the set of data requests a site supports; the JSON it returns is clean, structured data. For most modern sites this API path is dramatically faster, cheaper, and more reliable than running a browser — you skip the JavaScript, get structured data, and avoid most fingerprint-based blocking (where a site identifies and blocks automated clients by their technical traits).
### Quick facts
- **Workflow:** Open DevTools → Network → reproduce the action → filter for XHR/fetch
- **Look for:** JSON responses, GraphQL queries, structured pagination cursors
- **Always copy:** Full URL, all headers, body — replicate exactly first, simplify after
- **Watch for:** CSRF tokens, signed query params, dynamic auth headers
- **When it fails:** Encrypted bodies, attestation tokens, mobile-only endpoints
### The basic workflow
Open DevTools (your browser's built-in developer panel, usually F12) and switch to the Network tab, then filter to Fetch/XHR — these are the background data requests the page makes. Now do the action you want to scrape: load a page, scroll, run a search. Scan the requests for ones that return structured JSON containing the data you want. Right-click that request and choose "Copy as cURL" (cURL is a command-line tool for making HTTP requests) — you now have a known-good copy. Paste it into a script, confirm it works, then remove headers one by one to find the minimum set the server actually needs.
### Handling auth and CSRF
Most internal APIs want proof of who you are: usually a session cookie (a token tying requests to your login), a CSRF token from the initial page (a one-time value that proves the request came from the real site, not a forgery), or an auth header. Session cookies: load the public page first, grab the cookie, reuse it. CSRF tokens: pull the token out of the initial HTML (usually a meta tag or a hidden form input) and include it in later API calls. Bearer tokens: log in once through the normal flow, capture the token, and refresh it as needed.
### When reverse-engineering fails
Some endpoints fight back. They might sign each request with an HMAC (a tamper-proof checksum) computed in deliberately scrambled, or obfuscated, JavaScript; attach device-attestation tokens that only exist if you actually run the page's JS; or only serve the mobile app, locked down with TLS pinning (where the app refuses any https connection it does not specifically trust). In those cases the effort of reverse-engineering outweighs just rendering the page in a real browser — so fall back to that. Mobile API endpoints are their own category and usually need MITM proxy work — sitting between the app and the server to inspect traffic — using a tool like Mitmproxy or Charles on a real device.
### Example
```python
import requests, re
s = requests.Session()
home = s.get('https://example.com/')
csrf = re.search(r'name="csrf" content="([^"]+)"', home.text).group(1)
api = s.get('https://example.com/api/v1/products', params={
'page': 1, 'limit': 50
}, headers={'X-CSRF-Token': csrf})
data = api.json()
```
### FAQ
**Q: Is reverse-engineering APIs legal?**
Calling a public-facing internal API is the same as making the request a browser would already make. The legal questions are about what you do with the data, not the act of fetching it. Stay clear of authenticated endpoints you do not have access to.
**Q: How do I know if a site uses GraphQL?**
GraphQL is a query style where every data type is served from one endpoint. Look for requests to a single URL (often /graphql) with POST bodies that contain query and variables fields — that same endpoint answers every kind of request.
**Q: What if the API request body is encrypted?**
Some sites encrypt the request body with a key generated by their page-side JavaScript. You can either reverse-engineer how that key is built (hours to days of JS work) or fall back to browser rendering — usually the latter is cheaper.
---
## Synchronous vs Asynchronous Web Scraping
URL: https://scrappey.com/qa/web-scraping-apis/synchronous-vs-asynchronous-web-scraping
**Synchronous web scraping sends one request at a time and waits ("blocks") until each one finishes before starting the next; asynchronous scraping fires off many requests at once and handles each response as it arrives, using an event loop or a pool of workers.** Which one is better depends on what is slowing you down. If your requests are slow (page rendering, anti-bot challenges), async wins because it does useful work during the wait. But if you are capped by per-host rate limits (how often a site lets one source hit it) or by proxy throughput, async stops helping - the bottleneck is no longer your own machine's CPU.
### Quick facts
- **Sync pattern:** requests.get(url) for url in urls — simple, slow
- **Async pattern:** asyncio.gather() with aiohttp/httpx, or a thread pool
- **When sync is fine:** Small jobs, low-latency targets, simple debugging
- **When async helps:** Many slow requests (rendering, CAPTCHAs), wide URL fan-out
- **When neither matters:** When rate-limited or proxy-capped — scale proxies first, not concurrency
### The shape of the bottleneck
Web scraping is almost always I/O-bound, meaning the time is spent waiting on the network, not crunching numbers. A single request takes anywhere from 200ms to 30s of real elapsed time, but the actual CPU work on your machine is just milliseconds. Synchronous code wastes all the rest of that time sitting idle; asynchronous code starts another request during the wait. The math is stark: for 1,000 URLs at 1 second each, sync takes 1,000 seconds, while async with 50 concurrent workers (50 requests in flight at once) finishes in about 20 seconds.
### Where async stops helping
Concurrency always runs into a ceiling somewhere - your proxy pool, the target site's per-IP rate limit, or your scraping API's per-account throughput cap. Once you hit any of these, adding more concurrency just makes requests pile up in a queue without finishing any faster. The number to watch is throughput (URLs actually completed per minute), not how many requests you launch at once. Measure it directly. If 50 workers produce the same throughput as 200, the bottleneck has moved off your machine and onto one of those external limits.
### Practical recommendations
For under 100 URLs, just write synchronous code - it is easier to debug and easier to retry the odd failure by hand. For 100 to 10,000 URLs, use async with a modest concurrency cap (10-50). Above 10,000 URLs, switch to a managed scraping API that handles concurrency, retries, and dead-letter queues (a holding spot for requests that keep failing) for you. Building that layer yourself is more work than it looks.
### Example
```python
import asyncio, aiohttp
async def fetch(session, url):
async with session.get(url, timeout=30) as r:
return await r.text()
async def main(urls):
sem = asyncio.Semaphore(50)
async with aiohttp.ClientSession() as session:
async def bounded(u):
async with sem:
return await fetch(session, u)
return await asyncio.gather(*[bounded(u) for u in urls])
```
### FAQ
**Q: What concurrency should I start with?**
Start at 10 and keep doubling until throughput stops improving. The common sweet spot is 20-50 for general scraping; go lower (5-10) for tough anti-bot targets, where high concurrency tends to trigger blocks.
**Q: Is async faster than threading?**
For pure I/O work like HTTP scraping, async carries slightly less overhead than threads at very high concurrency (1,000+ requests at once). Below that, threading is simpler and the difference is negligible.
**Q: Does async help with CPU-bound parsing?**
No. Parsing is CPU work, and async does not run CPU work in parallel - it only overlaps waiting. If parsing becomes a bottleneck, use a multiprocessing pool instead. This is rare, since parsing is usually fast compared to fetching the page.
---
## What Is Batch Web Scraping?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-batch-web-scraping
**Batch web scraping means handing a whole list of URLs to a service as one job, letting it work through them in the background, and collecting the results once they are ready — instead of firing each request one at a time and waiting for each reply.** The batch service handles the hard plumbing for you: running many requests at once (concurrency), retrying failures, setting aside URLs that keep failing (a "dead-letter queue"), and making sure a URL is not processed twice (idempotency). The trade-off is that any single result takes longer to come back. Batch is the right choice when you have thousands or millions of URLs and do not need any one result instantly.
### Quick facts
- **Job size:** 100 to 1,000,000+ URLs per batch
- **Latency:** Minutes to hours; not for real-time pipelines
- **Use cases:** Crawl ingestion, dataset building, bulk monitoring
- **Tradeoff:** Throughput and reliability up; per-request latency up
- **Idempotency:** Same job ID + same URL list → same result; safe to retry
### When to batch
Reach for batch when three things are true: (a) you have a large list of URLs, (b) you do not need the results in real time, and (c) you would rather the service handle retries and concurrency than write that code yourself. Building your own batch processor that does the job well is months of effort — controlling how many requests run at once, retrying failures, parking the URLs that never succeed, avoiding duplicate work, and tracking progress. A managed batch endpoint (a ready-made API that runs the job for you) spreads that work across all its customers, so you do not pay for it alone.
### How to size batches
Most batch APIs let you submit up to about 1 million URLs in a single job, but the smart size is smaller: 1,000–10,000 URLs per batch. Smaller batches pay off in three ways. You get faster feedback — a broken configuration shows up in minutes instead of hours. You can run several batches at the same time under different job IDs to balance the load. And if one batch goes wrong, you only re-run that batch, not the whole crawl. So split a 1-million-URL crawl into 100–200 batches.
### Synchronous fallback
Sometimes a job is batch-sized overall but a few results need to come back right away — for example, a content-monitoring pipeline that pulls 99% of its data from a nightly batch but needs to check a breaking-news URL the moment it appears. Most scraping APIs offer both endpoints: a batch one and a per-request (synchronous) one that replies immediately. Send the urgent work to the sync endpoint and everything else to batch. Just do not call the sync endpoint in a tight loop hoping to match batch throughput — you will get rate-limited (temporarily blocked for sending too many requests too fast).
### Example
```python
import requests
from concurrent.futures import ThreadPoolExecutor
ENDPOINT = 'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY'
urls = ['https://example.com/p/1', 'https://example.com/p/2']
def fetch(url):
return requests.post(ENDPOINT, json={
'cmd': 'request.get',
'url': url,
'markdown': True
}).json()
# Fire requests in parallel — each one uses one concurrent thread
with ThreadPoolExecutor(max_workers=10) as pool:
results = list(pool.map(fetch, urls))
```
### FAQ
**Q: How is batch different from running async requests in parallel?**
Batch APIs handle concurrency, retries, dead-letter queues (a holding spot for URLs that keep failing), and idempotency (not processing the same URL twice) for you. With plain parallel async, all of that is your code's job. For under 10k URLs, doing it yourself is fine; above that, batch is dramatically less work.
**Q: How long does a batch job take?**
Figure a few seconds of real work per URL, divided by how many the API runs at once (its parallelism). A 10k-URL batch on a typical API finishes in 10–30 minutes. Sites with tough anti-bot defenses take longer, since each request needs more effort to get through.
**Q: What happens if the batch contains bad URLs?**
Good batch APIs skip URLs they cannot reach (a 404 "not found", or a DNS failure where the domain name will not resolve), retry temporary glitches, and give you a per-URL status in the results. That way you can re-queue just the specific URLs that failed instead of re-running the whole batch.
---
## What Is Stateful Web Scraping?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-stateful-web-scraping
**Stateful web scraping means keeping the same identity across many requests - the same cookies, session tokens, browser fingerprint, and proxy IP - so the site sees one consistent visitor for the whole session, not a crowd of strangers.** The opposite, stateless scraping, starts fresh on every request. That is fine for public pages but breaks anything needing a login, multi-step navigation, or tokens earned earlier in the session. Most real scraping projects need state for at least some flows.
### Quick facts
- **Stateful needs:** Same cookie jar, same IP, same fingerprint, same JA3 across requests
- **Required for:** Login flows, cart/checkout, multi-page forms, CSRF-gated pages
- **Session lifetime:** Minutes to hours; longer than typical anti-bot session windows
- **Storage:** Cookie jar + session ID returned by API + sticky proxy session
- **Anti-pattern:** Rotating IP/fingerprint mid-session — looks fake to the target
### Why state matters
A real person has a consistent session - same browser, same IP, same cookies - from the moment they log in until they leave. Sites treat that consistency as a sign of trust. So a session that suddenly swaps its IP or fingerprint halfway through looks like a bot. Stateless scraping also breaks any step that depends on the one before it: logged-in pages, shopping cart flows, forms protected by CSRF tokens (one-time anti-forgery codes), and paginated lists that track your place using cookies.
### How stateful APIs implement it
A stateful scraping API gives you a session ID - a handle that ties your requests together. Your first request creates the session: it assigns a sticky IP (one that stays put), a consistent fingerprint, and an empty cookie jar. Every later request that sends the same session ID reuses all of that. Sessions have a TTL (time to live - how long before they expire), usually 10 minutes to a few hours; after that the session is gone and you start over. Some APIs let you keep a session alive indefinitely for a per-session fee.
### When you do NOT need state
Public listing pages, product detail pages, blog posts, and most SEO crawl targets are stateless - each request works fine on its own. Stateless requests are cheaper (no session fee, and the IP can rotate freely) and easier to run in parallel. Use state only where the flow actually requires it, and default to stateless for everything else.
### Example
```python
import requests
sess = 'login-flow-user-1234'
login = requests.post('https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY', json={
'cmd': 'request.post',
'url': 'https://target.com/login',
'postData': 'user=...&pass=...',
'session': sess
})
profile = requests.post('https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY', json={
'cmd': 'request.get',
'url': 'https://target.com/account',
'session': sess
})
```
### FAQ
**Q: How long should a session live?**
Match it to how long the real user flow would take. A login plus 10 pages is about a 5-minute session. Long-running monitoring, like refreshing an account dashboard, can keep a session alive for hours. Past a few hours the target site will often invalidate it on its own anyway.
**Q: Can I share one session across many parallel requests?**
A few at a time is fine. A real browser fires 5-10 parallel requests just to load one page, so a handful of concurrent calls per session looks normal. Firing many parallel requests through a single session is unrealistic and gets that session flagged.
**Q: Do I need a different session per user?**
Yes, if you are scraping logged-in views or per-account data. Treat one session as one identity. Mixing multiple accounts into a single session is both a security risk and an easy way to get detected.
---
## What Is the Chrome DevTools Protocol (CDP)?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-the-chrome-devtools-protocol
**The Chrome DevTools Protocol (CDP) is the low-level interface for instrumenting and controlling Chromium-based browsers.** Low-level means it speaks directly to the browser's internals rather than through a convenience layer, so it is powerful but wordy. It is the same machinery your browser's built-in DevTools panel (F12) uses to inspect a page. Puppeteer, Playwright, and many stealth tools sit on top of CDP. For scraping it gives you fine-grained control: intercept network requests, override headers, run JavaScript inside the page, capture screenshots, and dump the DOM (the live structure of the page). Using CDP directly is verbose, but it reaches capabilities the higher-level libraries do not expose.
### Quick facts
- **What it controls:** Any Chromium browser (Chrome, Edge, Brave, Opera)
- **Connection:** WebSocket to chrome://inspect endpoint
- **Built on top by:** Puppeteer, Playwright, undetected-chromedriver, Camoufox
- **Direct use cases:** Custom interception, attach-to-existing-Chrome, browser-internal probes
- **Detectability:** CDP enables --remote-debugging-port; some sites detect this
### Where CDP fits
Every library that controls a Chromium browser ultimately speaks CDP under the hood. Puppeteer and Playwright wrap it in friendlier APIs and add their own conveniences, such as auto-waiting (pausing until an element is ready) and selector engines (helpers for finding elements on the page). For about 95% of scraping you want the wrapper, not raw CDP. The main exception is when you need to attach to a real user's Chrome - a profile that already has cookies, history, and extensions installed - instead of launching a fresh headless instance (a browser with no visible window). In that case, talking to CDP directly through its WebSocket endpoint (the live two-way connection the browser opens for debugging) is the cleanest path.
### Detection considerations
Chrome only opens the CDP port when you launch it with the --remote-debugging-port flag. Some defensive scripts probe for this and flag the session - but it is a weak signal, because the port is visible only to the host machine, not to the page itself. The stronger CDP-related giveaway is the Runtime.enable domain being active in the page context, which Puppeteer and Playwright switch on by default. (A domain is one feature area of CDP; Runtime.enable turns on JavaScript-execution hooks, and turning it on leaves traces a page can notice.) Some automation tools toggle these domains off when they are not needed.
### When to use CDP directly
Three real cases call for raw CDP: (1) attaching to an existing Chrome process that uses a real profile, (2) building custom request interception that Playwright's API does not expose, and (3) building a tool that needs precise control over which CDP domains are enabled. For everything else, Playwright or Puppeteer is the better default.
### Example
```python
import asyncio, json, websockets, requests
async def cdp_navigate(url):
targets = requests.get('http://localhost:9222/json').json()
ws_url = targets[0]['webSocketDebuggerUrl']
async with websockets.connect(ws_url) as ws:
await ws.send(json.dumps({
'id': 1, 'method': 'Page.navigate', 'params': {'url': url}
}))
asyncio.run(cdp_navigate('https://example.com'))
```
### FAQ
**Q: Should I use CDP directly or Playwright?**
Use Playwright unless you have a specific reason not to. Direct CDP is verbose, undocumented for many edge cases, and tends to break between Chrome versions. Playwright keeps that compatibility working for you.
**Q: Can sites detect CDP usage?**
They cannot see the protocol itself, but they can detect its symptoms: side effects of Runtime.enable, a missing chrome.runtime.runtimeId value, and certain navigator probes (checks scripts run against the browser's navigator object). Well-configured automation tools account for most of those signals.
**Q: Does CDP work in Firefox?**
Firefox implements a CDP-compatible subset so Playwright can drive it, but it lacks many domains (feature areas). For Firefox scraping (Camoufox is Firefox-based), the Playwright API is the cleaner interface.
---
## What Is an MCP Server for Scraping?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-mcp-server-for-scraping
**An MCP server for scraping is a Model Context Protocol endpoint that exposes scraping tools (fetch, screenshot, parse, search) as callable functions an AI agent can invoke.** MCP - the Model Context Protocol, a standard way for AI assistants to plug into outside tools, a bit like a USB port for AI - was introduced by Anthropic in late 2024 and adopted across the AI tooling ecosystem in 2025. It replaced the one-off, hand-written wiring each AI assistant used to need with a single shared protocol. For scraping, this means an AI agent like Claude or a custom OpenAI assistant can call scrape(url, schema), search(query), or browser.click(selector) against a managed scraping backend without the agent author writing any HTTP glue code.
### Quick facts
- **Standard:** Model Context Protocol (Anthropic, 2024) — JSON-RPC over stdio or SSE
- **Common scraping MCPs:** Firecrawl, Browserbase, Apify, Burp Suite, Steel, webclaw
- **Typical tools exposed:** scrape(url), search(query), browser.click/type/screenshot, crawl(url, depth)
- **Auth model:** API key passed at connection setup, sometimes per-tool scopes
- **Where it wins:** Agent-driven workflows where the AI decides what to scrape next
### Why MCP changed scraping for AI agents
Before MCP, every AI-agent integration was built from scratch. Claude's tool-use format and OpenAI's function-calling format were different; LangChain, LlamaIndex, and CrewAI each had their own wiring. To give an agent the ability to scrape, you had to write the scraping client, a JSON schema describing each function (its inputs and outputs), the error-handling, and the rate-limit logic - then copy all of it into every agent framework you wanted to support.
MCP collapsed this down to one server per tool, which any MCP-capable client can use. Firecrawl, Browserbase, and Apify shipped MCP servers in early 2025; by late 2025 most managed scraping APIs offer one. On the agent side, the code is now just a single MCP connection string in a config file. The scraping vendor handles the hard parts - fingerprinting, proxies, and CAPTCHA - and presents a clean set of tools.
### What tools an MCP scraping server typically exposes
The conventional surface across major scraping MCPs:
ToolWhat it doesUsed when the agent…
scrape(url)Fetches and returns clean markdown or text…knows the URL and just needs content
search(query)SERP scrape — returns ranked URLs + snippets…needs to find a page first
crawl(url, depth)Recursive scrape with budget…wants the whole site or section
extract(url, schema)LLM-extraction against a Pydantic-style schema…needs structured data, not text
browser.{click, type, ...}Stateful browser session for interactive flows…needs login, multi-step forms, infinite scroll
screenshot(url)Returns PNG for vision-model inspection…needs to verify visual state
A quick read of the table: most tools are single-shot (give a URL, get content back), while browser.{click, type, ...} keeps a live, stateful session open so the agent can interact step by step - useful for logins or multi-page forms. The extract tool is the one that returns structured data shaped to a schema rather than raw text.
Burp Suite's MCP server is the outlier - it exposes the security-research surface (intercept, modify, replay) rather than scraping primitives. It is included here because the recon workflows it enables overlap with mobile API discovery.
### When MCP wins and when it doesn't
**MCP wins when** the work is driven by an agent and the timing is unpredictable: research assistants, customer-support bots that look things up, code agents that read documentation, content-generation pipelines that need fresh source material. The agent decides which URLs to scrape; the MCP server handles the how.
**MCP does not win when** the work is a known, repeating batch job: scrape this 10k-product list every 12 hours, or monitor these 500 SKUs every minute. For those, a traditional REST scraping API - a plain HTTP endpoint you call on a fixed schedule - is cheaper, more predictable, and easier to monitor. MCP's value is the agent-orchestration glue, not the scraping itself.
The other catch is cost. MCP servers from managed vendors charge per tool call. An agent that scrapes 1000 URLs per task at $0.005 each costs $5 per task - fine for occasional research, expensive for production. Self-hosting your own MCP server (Firecrawl's open-source variant, Crawl4AI's MCP wrapper, the webclaw Rust server) avoids that per-call fee, but you take on the work of running the infrastructure yourself.
### Example
```json
{
"mcpServers": {
"firecrawl": {
"command": "npx",
"args": ["-y", "@firecrawl/mcp-server"],
"env": { "FIRECRAWL_API_KEY": "fc-..." }
},
"browserbase": {
"command": "npx",
"args": ["-y", "@browserbasehq/mcp-server"],
"env": {
"BROWSERBASE_API_KEY": "bb_...",
"BROWSERBASE_PROJECT_ID": "proj_..."
}
}
}
}
```
### FAQ
**Q: Do I need MCP if I am already using LangChain or LlamaIndex?**
No - those frameworks already have their own way of calling tools and can talk to an HTTP scraping API directly. MCP is most useful when the thing calling the tools is Claude Desktop, Cursor, Cline, or another AI client built around the MCP standard. For pure Python agent frameworks, calling the vendor's HTTP API is actually one step shorter than going through MCP.
**Q: Is MCP scraping fundamentally different from REST scraping or just a wrapper?**
It is a wrapper, but a useful one. The scraping itself is identical to the vendor's HTTP API. What MCP adds is a discovery protocol (the agent asks "what tools do you have?" and gets a schema back) and a standard, consistent way of reporting errors. The same Firecrawl backend serves both endpoints.
**Q: Can I host my own MCP scraping server?**
Yes. Firecrawl and Crawl4AI are open-source and come with MCP servers. The webclaw Rust server is purpose-built for low-latency MCP scraping. The catch is the same as with any self-hosted scraping setup - you are responsible for the proxies, fingerprinting, and JavaScript rendering.
---
## What Is the Scrapy + Go TLS Sidecar Architecture?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-scrapy-go-tls-sidecar-architecture
**The Scrapy + Go TLS sidecar architecture is the most common production pattern for scraping Akamai- and Cloudflare-protected sites at scale.** The idea is a division of labor: Scrapy (the Python scraping framework) handles orchestration, queueing, retries, and pipelines (its chain of post-processing steps) — but its underlying HTTP stack, Twisted, cannot impersonate Chrome's TLS handshake. TLS is the encryption layer behind https, and the handshake is the opening negotiation that gives away whether you are a real browser. So a small Go HTTP service runs alongside Scrapy as a *sidecar* (a helper process that does one job), exposes a POST /fetch endpoint, and uses a Chrome-exact TLS library (utls) to make the actual request. Scrapy talks to the sidecar over local HTTP through a custom downloader middleware. The result is Chrome-perfect JA4 + HTTP/2 fingerprints with all the productivity of Scrapy's framework.
### Quick facts
- **Components:** Scrapy (Python) + Go HTTP sidecar with utls / azuretls
- **Bridge:** Custom DOWNLOAD_HANDLERS middleware that POSTs to sidecar
- **Session model:** Pool of N sidecar connections, sticky session ID → connection map
- **Typical throughput:** 20–50 req/min per session against Akamai; 200+ req/min against unprotected sites
- **When to use:** Akamai, Cloudflare BM, PerimeterX at scale. Skip for unprotected sites — pure Scrapy is simpler.
### Why pure Scrapy fails on protected sites
Scrapy is built on Twisted, which uses OpenSSL with default cipher suites and a stock HTTP/2 implementation. The problem is the fingerprints this produces. The JA4 fingerprint — a short signature derived from the TLS handshake — is not Chrome's, the HTTP/2 SETTINGS frame is not Chrome's, and the pseudo-header order is not Chrome's. Anti-bot vendors score the handshake before any HTML is served, so even a perfect User-Agent and a freshly rotated proxy can't help — the request is already classified as bot.
Adding curl_cffi to a Scrapy spider via a custom downloader works for medium-strength deployments. For Akamai's harder customers it fails because curl_cffi's impersonation profiles lag the latest Chrome by a few versions, and Akamai's sensor detects the gap. The Go path uses utls — a TLS library actively maintained against Chrome master — and stays current within days of a Chrome release.
### The architecture
Three processes share one network. Here is who does what:
ComponentRoleLives where
**Scrapy spider**URL queue, retries, item pipeline, deduplicationWorker container (Python)
**Go TLS sidecar**Issues actual HTTPS requests with Chrome TLS via utlsSame pod / container, localhost:8080
**Proxy pool**ISP/residential IPs; sticky per Scrapy session IDExternal (Bright Data, Decodo, custom)
The flow is straightforward. Scrapy's spider issues a request as normal. A custom DOWNLOAD_HANDLERS entry replaces the default HTTPS handler with one that POSTs {url, method, headers, body, session_id, proxy} to localhost:8080/fetch (a service on the same machine). The Go process keeps a pool of sessions keyed by session_id — each session has its own utls connection, cookie jar, and proxy binding. Scrapy receives the response body and headers as if it had fetched directly.
The session pool matters because of how Akamai builds trust. Its _abck cookie accumulates trust across requests sent over the same TLS connection. Opening a new TLS connection for every request resets that trust score. So each Scrapy spider instance is mapped to one Go session, which holds its TLS connection and cookies for the entire crawl.
### Why Go specifically
Three reasons Go is the default choice for the sidecar:
- **utls is the gold standard for Chrome TLS impersonation.** The Go ecosystem maintains it, the Chinese scraping community contributes Chrome-version-specific profiles upstream weekly, and the library tracks Chrome master closer than any Python equivalent (curl_cffi wraps a forked curl which lags by Chrome version).
- **Concurrency for free.** A Scrapy worker fires bursts of requests; goroutines (Go's lightweight threads) absorb the load with no thread-pool tuning. A Python sidecar would need asyncio or threading — both work but neither is as cheap as goroutines.
- **HTTP/2 framing is exposed.** Libraries like azuretls let you set the SETTINGS frame values, the WINDOW_UPDATE delta, and the pseudo-header order directly — all low-level details Chrome sends a specific way. Python's httpx hides these behind its own HTTP/2 implementation.
The TypeScript ecosystem has caught up with cycle-tls (Node.js, bundles Go under the hood). Rust's webclaw is the newer alternative. Both work; Go remains the default because the production case studies that prove the pattern were written in Go.
### Tradeoffs and when to skip this pattern
The sidecar is operational overhead. You run two processes per worker, ship one more deployment artifact, and add one more thing that can crash. For most scraping problems this is unnecessary. Skip it when:
- The target uses no anti-bot or only Cloudflare Bot Fight Mode — pure Scrapy with a residential proxy works.
- The target uses DataDome — per-request scoring means session continuity matters less than IP quality. curl_cffi + Scrapy is enough.
- The volume is below ~10k requests/day — a managed scraping API is cheaper than running this infrastructure.
Use it when the target is Akamai, Cloudflare Bot Management Enterprise, or PerimeterX, and the volume justifies running your own infrastructure (~50k+ requests/day). Below that volume, a managed API at $0.50–$3 per 1000 requests is cheaper than the engineering time to build and operate this pattern.
### Example
```python
# scrapy custom downloader handler — bridges Scrapy → Go sidecar
import json
import requests
from scrapy.core.downloader.handlers.http import HTTPDownloadHandler
from scrapy.http import HtmlResponse
from twisted.internet.threads import deferToThread
SIDECAR = "http://localhost:8080/fetch"
class GoTLSDownloadHandler(HTTPDownloadHandler):
def download_request(self, request, spider):
payload = {
"url": request.url,
"method": request.method,
"headers": dict(request.headers.to_unicode_dict()),
"body": request.body.decode("utf-8", "replace"),
"session_id": request.meta.get("session_id", "default"),
"proxy": request.meta.get("proxy"),
}
def _go():
r = requests.post(SIDECAR, json=payload, timeout=60)
r.raise_for_status()
out = r.json()
return HtmlResponse(
url=out["final_url"],
status=out["status"],
headers=out["headers"],
body=out["body"].encode(),
request=request,
)
return deferToThread(_go)
# settings.py
DOWNLOAD_HANDLERS = {
"https": "myproject.handlers.GoTLSDownloadHandler",
"http": "myproject.handlers.GoTLSDownloadHandler",
}
```
### FAQ
**Q: Can I do this with curl_cffi in Python instead of Go?**
For medium-strength targets, yes — using curl_cffi as a Scrapy downloader handler is a valid, simpler architecture. The reason production teams pick Go is that utls tracks Chrome master closer than curl_cffi tracks BoringSSL (the TLS engine curl_cffi imitates), and Akamai's detection model rewards the freshest TLS profile. If you can accept being a few Chrome versions behind, the Python-only path is fine.
**Q: Why not just use a real headless browser like Camoufox?**
Throughput. A Camoufox instance uses 200-400MB of RAM and handles ~5 requests/min with the kind of warm-up and pacing Akamai expects. A Go TLS sidecar handles 50+ requests/min per session in 20MB of RAM. For high-volume scraping where the protection is TLS-and-HTTP/2-centric (not JS-heavy), the sidecar is 10× cheaper to operate.
**Q: How do I rotate sessions without losing _abck trust?**
Pre-warm them. Before retiring a session, spin up the next one with the same proxy IP and warm it by visiting the homepage, waiting a few seconds, then visiting one product page. That way the new session's _abck cookie is already trusted before you need it. Keep a small ring of pre-warmed sessions ahead of the queue and throughput stays smooth.
**Q: Where does the proxy go — Scrapy side or Go side?**
Go side. Scrapy passes the proxy URL through the request meta, and the Go sidecar uses it for both the TLS handshake and the upstream connection. Putting the proxy on the Scrapy side defeats the whole architecture, because the TLS handshake would then originate from Scrapy, not from utls.
---
## Web Scraping Tools 2026 — A Comparison
URL: https://scrappey.com/qa/web-scraping-apis/web-scraping-tools-2026
**"Web scraping tools" is the whole family of software you use to pull data off websites — and in 2026 that family is big but neatly sorted into roles.** Each tool does one main job: HTTP/TLS impersonation (mimicking a real browser's network signature), browser automation, framework/orchestration, AI scraping, HTML parsing, reverse engineering, or managed APIs. The right pick depends entirely on which job you need done. This page is one place to compare the major options, grouped by role, with a one-line strength for each. For help deciding which role you need first, see the scraping decision flow.
### Quick facts
- **Roles covered:** HTTP/TLS, browser automation, frameworks, AI scraping, parsing, reverse engineering, managed APIs
- **Tools listed:** ~40 across all roles
- **Languages:** Python, Node.js, Go, Rust, .NET, Java
- **Selection principle:** Pick the role first (decision flow), then the tool within it
- **What this page is not:** A "best of" ranking — each tool has a valid niche
### The comparison table
Tool
Lang
Role
Strength
HTTP / TLS impersonation
**curl_cffi**PythonHTTP client with Chrome TLSDefault for most scraping today; wraps a forked curl
**tls-client**Go / Python wrapperJA3/JA4 fingerprint matchingUsed inside Python via Go shim; flexible profile config
**utls / azuretls**GoLow-level Chrome TLSTracks Chrome master closer than anything else; sidecar-of-choice
**cycle-tls**Node.jsBrowser TLS in JSBundles Go under the hood; only solid Node option
**noble-tls**PythonPure-Python JA3/JA4No native deps — easier deploy, slightly behind on profile freshness
**hrequests**Pythonrequests-compatible stealth clientDrop-in for legacy requests-based code
**Scrapling**PythonHigh-level scraping clientBuilt-in Turnstile solve, auto-retry, content fingerprinting
**webclaw**RustMCP-native scraping10 MCP tools, sub-second cold start, AI-extraction first
Browser automation
**Playwright**Python / Node / .NET / JavaCDP-based browser driverMulti-language, auto-wait, parallel contexts; default browser tool
**Puppeteer**Node.jsCDP browser driver (Chrome only)Google's original; smaller surface, mature ecosystem
**Selenium**Python / Java / manyLegacy WebDriver browser driverWidest browser support; oldest detection surface (navigator.webdriver)
**SeleniumBase UC**PythonSelenium + undetected-chromedriverQuick on/off CDP stealth, pytest integration
**undetected-chromedriver**PythonPatched Chrome driverPatches CDP fingerprint at runtime; handles simple checks
**nodriver**PythonRaw CDP async, no WebDriverAsyncio-native; no WebDriver fingerprint at all
**pydoll**PythonPure-Python CDPNo native deps; lightweight CDP wrapper
**Camoufox**PythonStealth Firefox fork (Juggler protocol)No CDP leaks; passes most Cloudflare deployments by default
**CloakBrowser**Python / NodePatched Chromium with C++ stealth49 documented C++ patches; high reCAPTCHA v3 scores
**PatchRight**PythonPlaywright source-patchingPatches Playwright source so toString() inspection passes; holds up against Kasada
**Botasaurus**PythonHigh-level scraping frameworkGaussian mouse curves, profile management, deployable as API
**Botright**PythonCAPTCHA-focused browser automationBuilt-in solvers for hCaptcha, FunCaptcha, GeeTest
Frameworks & orchestration
**Scrapy**PythonCrawler framework + pipelinesIndustry default for large crawls; built-in queue, retries, deduplication
**Crawlee**Node / PythonApify's unified scraping frameworkSwitches between HTTP, Cheerio, Playwright behind one API
**Colly**GoGo crawler frameworkFastest framework option; ideal for pure-HTTP heavy-volume jobs
**Katana**GoSecurity-oriented crawlerRecon tool that doubles as a crawler; headless-mode flag
**Scrapyd / scrapy-redis**PythonScrapy deployment & distributionDaemon (Scrapyd) and Redis-backed queue (scrapy-redis) for scaling Scrapy
AI / LLM scraping
**Firecrawl**Hosted + open-sourceManaged AI-scraping APIMarkdown output, MCP server, FIRE-1 extraction agent
**Crawl4AI**PythonSelf-hosted LLM scrapingMIT licensed; Ollama-compatible local extraction
**ScrapeGraphAI**PythonNL-to-graph scraping pipelinesSelf-healing extraction when target schema drifts
**Jina Reader**Hosted APIOne-endpoint AI scraper.jina.ai/{url} — simplest possible interface, generous free tier
**Browserbase**HostedManaged cloud browsers for agentsStagehand integration, MCP server, persistent sessions for AI agents
**Steel**Self-hostedOpen-source cloud browserSelf-hosted alternative to Browserbase; MCP server included
HTML / data parsing
**BeautifulSoup4**PythonBeginner-friendly HTML parserEasiest API; slow on large documents
**lxml**PythonFast XML/HTML parserC-backed; the engine behind BeautifulSoup and Parsel
**selectolax**PythonUltra-fast HTML parsingC-based; 10–100× faster than BeautifulSoup, CSS selectors only
**Parsel**PythonScrapy's selector libraryXPath + CSS, drop-in outside Scrapy too
**chompjs**PythonJavaScript object literal parserExtracts JS-embedded data without running a JS runtime
Reverse engineering
**mitmproxy**PythonHTTPS intercepting proxyCLI / web UI / scriptable; mobile API discovery default
**HTTP Toolkit**AppGUI HTTPS interceptionOne-click iOS / Android device intercept, friendly UI
**Frida**Multi-langRuntime instrumentationCertificate-pinning handling, function hooking on mobile
**Burp Suite**App / ProCommercial intercepting proxy + MCPPortSwigger's pen-test workbench; MCP server for AI-driven recon
CAPTCHA solving
**CapSolver**APIAI-powered solverSub-10s solves on most CAPTCHA types; Turnstile, reCAPTCHA, hCaptcha
**2Captcha**APIHuman + AI hybridOldest service in the category; falls back to humans on novel CAPTCHAs
**Anti-Captcha**APIHuman + AI hybridSimilar to 2Captcha; some teams prefer for hCaptcha accuracy
Managed scraping APIs
**Scrappey**APIFull-stack managed scrapingHandles authorized verification workflows, residential proxies, and rendering in one call
**Bright Data**API + proxyLargest proxy network + scraping APIs100M+ residential IPs; covers F5 Shape targets others can't
**Oxylabs**API + proxyEnterprise scraping APIsSERP, e-commerce, real-estate verticals + OxyCopilot AI assistant
**Zyte**APISmart Proxy Manager + Scrapy CloudBuilt by the Scrapy team; deepest Scrapy integration
**Apify**PlatformPre-built scraper marketplace10k+ ready-made Actors; built-in scheduling and storage
**ScrapingBee**APISimple managed scrapingGenerous free tier; easy onboarding for one-off jobs
**Decodo (Smartproxy)**API + proxyMid-market proxy + scrapingRenamed from Smartproxy in 2024; balanced price/performance
### How to read this table
Focus on the role groupings, not the individual tool names. Most scraping failures come from picking the wrong role — reaching for a browser automation tool when a plain HTTP client would have worked, or paying for a managed API when a 30-line script would have done the job. The safe approach is to work top-down and stop at the first role that works:
- **Try HTTP/TLS first.** If curl_cffi impersonating Chrome gets you the page, stop there. Every role below it in the table costs more compute or money.
- **Move up to browser automation only when the page needs JavaScript to run.** Most product pages, search results, and API endpoints don't. Infinite scroll, OAuth login flows, and single-page apps (sites that build their content in the browser) do.
- **Add a framework once the crawl grows past ~1000 URLs.** Below that, a script is fine. Above it, Scrapy or Crawlee earn their keep by handling retries and data pipelines for you.
- **Reach for AI scraping when the data layout is fuzzy or keeps changing.** Firecrawl, Crawl4AI, and ScrapeGraphAI let an LLM (large language model) pull out the fields, so you stop hand-maintaining a parser for every site.
- **Use managed APIs for the hard, low-volume cases.** When the protection is Akamai, F5 Shape, or Bot Management Enterprise and your volume doesn't justify running your own infrastructure, a managed API costs less than the engineering time to handle it yourself.
### What is and isn't in this list
This list sticks to tools that are actively maintained and used in production today. A few categories exist but were left out on purpose:
- **Legacy HTTP clients** (plain requests, aiohttp) — fine for unprotected sites but beaten by any modern anti-bot system, so they're folded into the curl_cffi entry rather than listed on their own.
- **Browser fingerprint databases** (Multilogin, GoLogin) — handy related tooling, but they aren't scraping tools themselves; they're covered in the browser-fingerprint entries.
- **Proxy aggregators** (SwiftShadow, Scrapoxy) — covered in the proxies category instead of the tools category.
- **Generic JavaScript runtimes** (Node, Bun) — not scraping tools, but every Node-based tool above runs on one.
If a tool you rely on is missing, it's most likely because it doesn't add anything new to the comparison — it usually fills the same niche as a listed one, and you'll learn it fastest through the listed equivalent.
### FAQ
**Q: Which tool should I start with as a beginner?**
Start with Python + requests for your first scrape — it works on anything unprotected. When you hit a 403 (blocked) response, switch to curl_cffi; it has the same API, so it drops straight in. When you hit a JavaScript-heavy site that won't load without a browser, switch to Playwright. When the crawl grows past ~1000 URLs, wrap it in Scrapy. Each step solves a specific problem the previous one couldn't.
**Q: Why are managed APIs at the bottom of the table?**
It's not a ranking — bottom here means "last resort". Managed APIs are the right answer for hard targets (Akamai, F5 Shape) when your volume is below the level that would justify building your own setup, and for teams where engineering time costs more than per-request fees. The top of the table is "cheap and do-it-yourself"; the bottom is "more expensive but zero maintenance".
**Q: Where do Camoufox, CloakBrowser, and PatchRight fit if Playwright is the default?**
They're hardened versions of Playwright or Chromium. You switch to them when Playwright's default fingerprint (the signature anti-bot systems read) gets caught — for example by Cloudflare Bot Management, Kasada, or recent Akamai. The API is mostly the same; Camoufox even works with async_playwright. The cost of moving to them is operational, not a new thing to learn.
**Q: Why is curl_cffi listed under HTTP/TLS but Playwright under browser automation if both fetch URLs?**
Because they work very differently. curl_cffi sends one HTTP request and parses the response text — that's it. Playwright launches a full browser, runs the page's JavaScript, renders the page, and then runs your code against the resulting DOM (the live page structure). curl_cffi uses about 5MB of memory per request; Playwright uses around 200MB and is roughly ten times slower. Different problems, different costs.
---
## What Is Playwright?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-playwright
**Playwright is a cross-browser automation framework from Microsoft that drives Chromium, Firefox, and WebKit through a single API.** An automation framework means your code can control a real browser - opening pages, clicking, typing - instead of just downloading raw HTML. Released in 2020 as a Puppeteer successor, it added auto-waiting (it waits for elements to be ready so you don't have to guess), parallel browser contexts (multiple isolated sessions at once), and first-class support for Python, .NET, and Java alongside Node.js. In scraping it is the default browser-automation choice when JavaScript execution is required - but it ships with default fingerprints (identifying traits a site can read) that anti-bot vendors detect immediately, so production scrapers run a patched variant (Camoufox, PatchRight, CloakBrowser) rather than vanilla Playwright.
### Quick facts
- **Vendor:** Microsoft (open-source, Apache 2.0)
- **Languages:** Python, Node.js / TypeScript, .NET, Java
- **Browsers:** Chromium, Firefox, WebKit (via patched binaries it ships)
- **Protocol:** Chrome DevTools Protocol (CDP) for Chromium; bidirectional WebSocket
- **Default detection:** Block-grade on Akamai, Kasada, Cloudflare BM out of the box
### Where Playwright fits in scraping
Playwright is the right tool when the data is rendered client-side - built by JavaScript in the browser - so a plain HTTP client can't reach it: single-page apps that fetch via XHR (background requests) after the first paint, infinite-scroll lists, OAuth login flows, anything that requires real DOM events like clicks. The trade-off is weight: it runs ~200MB of RAM per browser context - far heavier than a lightweight HTTP client like curl_cffi - so use it only when the lighter approach doesn't work.
The Python API is the most common in scraping. async_playwright integrates with asyncio (Python's async system) cleanly, and scrapy-playwright wraps it as a Scrapy downloader middleware, so a crawl uses a real browser only on the specific pages that need one. The Node.js version is the original and slightly ahead on features, but the Python one is feature-stable enough to match.
### Why default Playwright gets blocked
Vanilla (unmodified) Playwright is detected on multiple surfaces at once - each one a separate giveaway:
- navigator.webdriver === true — the most-checked flag; it openly announces "a browser is being automated" and is set by Playwright and Selenium alike.
- **CDP connection signal** — the channel Playwright uses to control Chrome leaves traces; anti-bot scripts probe for window.cdc_ properties and Runtime.evaluate timing artifacts.
- **Headless mode tells** — running without a visible window leaves gaps a real browser wouldn't have: missing chrome.runtime, missing plugins, a languages array of length 1, no permissions API.
- **Function.toString() inspection** — a site can ask a browser function to print its own source; any stealth plugin that patches methods at the JS level fails this check (see the toString inspection entry).
- **Default Playwright User-Agent** includes "HeadlessChrome" unless explicitly overridden, which flags the request instantly.
Setting headless: false and overriding the User-Agent removes the cheapest signals, but the CDP signal and toString inspection still fire. Presenting a consistent fingerprint in production generally requires a patched fork rather than runtime configuration.
### Playwright vs Puppeteer vs Selenium
Picking between the three:
- **Playwright** — multi-browser, multi-language, modern auto-wait API. Default choice for new scrapers in Python or Node. Fastest learning curve.
- **Puppeteer** — Node-only, Chromium-only. Smaller API surface, mature ecosystem, slightly faster startup. Pick if you're Node-only and don't need Firefox/WebKit.
- **Selenium** — widest browser support (Safari, Edge, even mobile WebDriver), oldest API. Pick if you need Safari testing or have an existing Selenium codebase. Most detectable of the three.
All three are equally easy to detect on a default install. Patched variants exist for Playwright/Puppeteer (Camoufox, PatchRight, undetected-chromedriver, SeleniumBase UC), so the stealth ecosystem is the practical tiebreaker.
### Example
```python
# Async Playwright with a residential proxy and useragent override
from playwright.async_api import async_playwright
async def scrape(url, proxy_url):
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False, # run with a visible browser process
proxy={"server": proxy_url},
)
ctx = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36",
locale="en-US",
viewport={"width": 1920, "height": 1080},
)
page = await ctx.new_page()
await page.goto(url, wait_until="domcontentloaded")
await page.wait_for_timeout(2000) # let XHRs settle
html = await page.content()
await browser.close()
return html
# Works on simple sites; recognized by stronger systems like Akamai/Kasada — Camoufox or PatchRight present more consistent fingerprints.
```
### FAQ
**Q: Is Playwright better than Puppeteer for scraping?**
For Node-only Chromium scraping they're interchangeable - pick by team familiarity. Playwright wins if you need Python or Firefox/WebKit. Both lose to anti-bot systems on default settings, and both have patched variants that fix it.
**Q: How does Playwright interact with Cloudflare-protected sites?**
On free-tier Cloudflare and Bot Fight Mode it often works with a residential proxy (a real-looking home IP address). Against Cloudflare Bot Management Enterprise it is typically flagged: the JA4 (a TLS handshake fingerprint) plus CDP signals are recognized. Production setups for authorized access tend to use a patched fork such as Camoufox or a managed API.
**Q: Why use scrapy-playwright instead of just Playwright?**
When the crawl is bigger than ~1000 URLs and you want Scrapy's queue, retries, deduplication, and item pipelines, but only some pages need a browser. scrapy-playwright lets you mark specific requests as needing a browser; the rest go through the cheap, fast HTTP path.
---
## What Is Puppeteer?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-puppeteer
**Puppeteer is Google's Node.js library for driving a Chromium browser from code, over the Chrome DevTools Protocol (CDP) - the same channel Chrome's own DevTools use to talk to the browser.** Released in 2017, it predates Playwright by three years and was the de facto standard for Chrome automation until Playwright's multi-browser support reframed the category. It is still the right pick when the project is Node-only, Chromium-only, and benefits from the larger Puppeteer-specific stealth-plugin ecosystem (puppeteer-extra-plugin-stealth, puppeteer-real-browser).
### Quick facts
- **Vendor:** Google (open-source, Apache 2.0)
- **Language:** Node.js / TypeScript only
- **Browser:** Chromium / Chrome (Firefox support is experimental)
- **Protocol:** Chrome DevTools Protocol (CDP)
- **Stealth ecosystem:** puppeteer-extra-plugin-stealth, puppeteer-real-browser
### Puppeteer vs Playwright in practice
The two APIs are about 85% the same - both give you the same building blocks (page, frame, request, response) under similar method names. The differences that matter for scraping:
- **Auto-waiting** — Playwright waits for an element to be ready to act on before clicking or typing; Puppeteer only waits when you tell it to. So Puppeteer scripts end up with more explicit waitForSelector calls.
- **Parallel contexts** — A context is an isolated session (its own cookies and storage) inside one browser. Playwright's browserContext is cleaner for running several of these at once. Puppeteer supports it too, but the API is older.
- **Languages** — Puppeteer is Node-only. If your stack is Python, Playwright is the only choice.
- **Stealth plugins** — Puppeteer's stealth ecosystem is older and more mature. puppeteer-extra-plugin-stealth has more patches than its Playwright equivalent, though both lose to Function.toString() inspection equally.
For a brand-new scraping project in Node, default to Playwright unless your team already has Puppeteer code. Puppeteer is not deprecated, but the active feature investment has shifted to Playwright.
### puppeteer-extra and the stealth plugin
The puppeteer-extra plugin system, paired with puppeteer-extra-plugin-stealth, is the most-cited anti-detection plugin ecosystem for Puppeteer. The stealth plugin runs ~17 separate patches: it hides navigator.webdriver (the flag that openly says "a script is driving this browser"), fixes the plugin array, patches WebGL parameters, normalises the User-Agent, masks the chrome.runtime object, and so on.
It addresses every detection a 2019 anti-bot system used. It does not hold up against 2024+ vendors that check via Function.toString() (see that entry) - a trick that reads back a function's source code - or that look for CDP runtime artifacts (traces left by that DevTools connection). Each of the 17 patches is a JS function whose source is visible to toString(); Kasada, recent Akamai, and PerimeterX flag this stack on first request.
For Puppeteer in production against hard targets, the modern approach is puppeteer-real-browser (driving a real Chrome rather than headless Chromium) or switching to a C++-patched variant like CloakBrowser.
### When to actually use Puppeteer
Three situations where Puppeteer is the right pick over Playwright:
- The codebase is already on Puppeteer and switching would be busywork.
- The project depends on a Puppeteer-only library (puppeteer-cluster, puppeteer-screen-recorder) that has no Playwright equivalent.
- The team specifically wants the older, smaller API - Puppeteer does less than Playwright, which some teams find easier to reason about.
For everything else, the answer is Playwright. The CDP protocol, the Chromium binary, and the detection surface are identical - the real difference is how the API feels to use and which languages it reaches.
### Example
```javascript
// Puppeteer with stealth plugin and a residential proxy
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({
headless: false, // don't advertise headless
args: ['--proxy-server=http://residential:port'],
});
const page = await browser.newPage();
await page.authenticate({ username: 'user', password: 'pass' });
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
'(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'
);
await page.goto('https://target.com', { waitUntil: 'domcontentloaded' });
const html = await page.content();
console.log(html.length);
await browser.close();
})();
// Stealth plugin handles simple checks; surfaces to Function.toString() inspection at Kasada.
```
### FAQ
**Q: Is Puppeteer dead?**
No - it's still actively maintained by Google and ships with each Chrome major version. The momentum has shifted to Playwright (more languages, multi-browser), but Puppeteer remains a reasonable Node-only choice, and its stealth-plugin ecosystem is larger.
**Q: Can I use Puppeteer with Python?**
There's pyppeteer (a community Python port), but it has been unmaintained for years. For Python, use Playwright instead.
**Q: Why does the stealth plugin not work against Kasada?**
Kasada calls Function.prototype.toString() on the methods the stealth plugin patches. A real, built-in browser method returns "[native code]"; the plugin's JavaScript replacements return their own patch source code instead - a dead giveaway. The plugin patches ~17 methods, and every one fails this check. PatchRight (which patches Playwright's source rather than the runtime) is the equivalent fix on the Playwright side.
---
## What Is Selenium?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-selenium
**Selenium is the original cross-browser automation framework — the W3C WebDriver standard predates Puppeteer by a decade.** In plain terms, it lets your code remotely drive a real web browser. It works across Chrome, Firefox, Safari, Edge, and even mobile browsers (using Appium under the hood), all through one API, and you can write that code in Python, Java, Ruby, C#, JavaScript, Kotlin, and more. In 2026 it remains the right pick for scraping when you need Safari or mobile-browser support, or when you have existing Selenium tests to repurpose. For new Python or Node scrapers, Playwright has overtaken it.
### Quick facts
- **Standard:** W3C WebDriver protocol (oldest of the browser-automation standards)
- **Languages:** Python, Java, C#, JavaScript, Ruby, Kotlin, and others
- **Browsers:** Chrome, Firefox, Safari, Edge, mobile via Appium
- **Default detection:** navigator.webdriver === true on every browser (W3C-mandated)
- **Stealth variants:** undetected-chromedriver, SeleniumBase UC mode, selenium-stealth
### Why Selenium is still relevant in 2026
Three durable reasons to pick Selenium:
- **Safari support.** Playwright's WebKit isn't Safari — it's the WebKit rendering engine without Safari's actual app around it. Testing or scraping real Safari requires Selenium plus safaridriver (Apple's WebDriver helper for Safari).
- **Mobile browsers.** Appium (the mobile sibling of Selenium) drives mobile Chrome, mobile Safari, and native phone apps through the same WebDriver API. No other framework reaches all of that.
- **Existing test code.** Half of QA automation in the enterprise is Selenium. If your team already maintains a test suite, reusing the framework for scraping is faster than rewriting it.
For a brand-new ("greenfield") Python or Node Chromium scraper, Playwright is the better pick — a more modern API, faster startup, and better parallelism (running many browsers at once). Selenium's WebDriver wire protocol — the back-and-forth messaging used to send each command to the browser — adds about a millisecond of overhead per command, which adds up over a long script.
### Selenium's detection surface
Of the three big browser-automation frameworks, Selenium is the easiest for anti-bot systems to spot, because it leaves several telltale signs (fingerprints):
- **WebDriver is W3C-standardised to set navigator.webdriver = true**. Every Selenium browser exposes this flag by default, and any website can read it in one line of JavaScript. Anti-bot scripts test it as their first check.
- **Selenium injects identifying properties into window** — keys like window.cdc_*, window.$cdc_*, and others (window is the global object every web page can inspect) that anti-bot scripts scan for.
- **The WebDriver wire protocol leaves timing artifacts** — the delay between a command and its response differs from real human input by a measurable amount.
- **The chromedriver binary itself** (the helper program that controls Chrome) has shipped with the substring "$cdc_" in its source for years — only recently patched in the mainline version.
Plain, unmodified ("vanilla") Selenium gets blocked on any modern protected site. The fixes are the stealth variants in the next section.
### undetected-chromedriver and SeleniumBase UC mode
Two production-ready ways to make Selenium stealthier:
- **undetected-chromedriver (UC)** — patches the chromedriver binary as it downloads to strip out the $cdc_ strings and reset navigator.webdriver. This satisfies most basic (Layer-1 and Layer-2) checks. It is still visible to Function.toString() inspection — a trick where a site reads the source code of the override function and sees it was tampered with.
- **SeleniumBase UC mode** — wraps undetected-chromedriver in a pytest-friendly API (pytest is the standard Python test framework), adds automatic clicking of the Cloudflare Turnstile challenge, and gives you a clean set of sb.uc_* methods. This is the default choice when you want Selenium plus stealth plus a test framework.
- **selenium-driverless** — drops the WebDriver layer entirely and drives Chrome straight through raw CDP (Chrome DevTools Protocol, the browser's native control channel). This removes the WebDriver fingerprint, but you also lose the cross-browser support that made you choose Selenium in the first place.
Even with UC, Selenium still loses to Kasada, F5 Shape, and recent Akamai. For those, switch to Camoufox, CloakBrowser, or a managed API — at that point the WebDriver protocol itself is the bottleneck.
### Example
```python
# undetected-chromedriver with a residential proxy
import undetected_chromedriver as uc
options = uc.ChromeOptions()
options.add_argument(f"--proxy-server=http://residential:port")
driver = uc.Chrome(options=options, version_main=131)
try:
driver.get("https://target.com")
driver.implicitly_wait(5)
print(driver.page_source[:500])
finally:
driver.quit()
# Handles simple webdriver checks. For Cloudflare BM or Kasada, behaviour differs by framework.
```
### FAQ
**Q: Should I learn Selenium in 2026?**
For brand-new scraping projects, learn Playwright instead — a newer API, fewer ways to be detected, and multi-browser support without WebDriver. Learn Selenium if you need Safari or mobile testing, or if your team already maintains Selenium tests you can extend.
**Q: What's the difference between undetected-chromedriver and SeleniumBase UC mode?**
undetected-chromedriver is the underlying library that patches the chromedriver binary and runtime to remove the default automation markers. SeleniumBase UC mode wraps that library in a friendlier API, adds pytest integration, and includes built-in helpers (Cloudflare auto-click, session reuse). Want quick stealth in a pytest project? Use SeleniumBase UC. Want minimal dependencies? Use undetected-chromedriver directly.
**Q: How does Selenium with stealth variants behave against Cloudflare?**
Against Bot Fight Mode and Turnstile, undetected-chromedriver plus a residential proxy (an IP address from a real home internet connection) typically completes Turnstile on most sites you are authorized to access. Against Cloudflare Bot Management Enterprise it generally does not. For those workflows teams move to Camoufox or a managed API.
---
## What Is Scrapy?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-scrapy
**Scrapy is the industry-default crawler framework for Python.** It does everything *around* the actual HTTP request so you don't have to: it keeps a queue of URLs to visit, retries failures, skips duplicate URLs, runs the scraped data through processing steps (item pipelines), paces requests (throttling), runs many requests at once (concurrency), and offers a middleware system where you plug in proxies, fingerprinting, and stealth tools. The bare HTTP layer underneath (built on Twisted, a Python networking library) is too easy to detect for protected sites in 2026 - but the framework wrapped around it is genuinely irreplaceable once a crawl grows past a few thousand URLs.
### Quick facts
- **Vendor:** Scrapy project (originally Zyte / formerly Scrapinghub); BSD-3 license
- **Language:** Python (>= 3.9)
- **Built-in:** Queue, retries, dedup, pipelines, throttling, concurrency, settings layering
- **Ecosystem:** 200+ middleware packages — scrapy-playwright, scrapy-camoufox, scrapy-redis, scrapy-stealth
- **Where it loses:** Twisted-based default HTTP layer fails against modern anti-bot
### What Scrapy gives you that a script can't
For a 100-URL scrape, a single Python script with curl_cffi and a loop is fine. Past ~1000 URLs the problems pile up: what to retry, how to avoid scraping the same URL twice (dedupe), where to write results, how to pace requests per site, and how to pick up again after a crash. Scrapy handles all of this out of the box:
- **Built-in queue** with priority, depth tracking, and disk-backed persistence (so you can resume a crawl after killing it).
- **Per-domain throttling** via AUTOTHROTTLE — automatically slows down or speeds up based on how fast the site responds.
- **Request deduplication** — Scrapy fingerprints each URL so it never fetches the same one twice, even across restarts.
- **Item pipelines** — chain steps like validators, deduplicators, and database writers together with a single declaration.
- **Settings layering** — project defaults can be overridden per spider, which can be overridden again by command-line flags.
- **The downloader-middleware abstraction** — the hook where every modern stealth tool plugs in, including the Go TLS sidecar pattern.
Rebuilding all this for any non-trivial crawl is weeks of work. Scrapy is mature, BSD-licensed, and one pip install away.
### Why bare Scrapy fails on protected sites
Scrapy's built-in downloader (Twisted-based, supporting HTTP/1.1 and HTTP/2) has never looked like Chrome, and that is exactly what gives it away. Its JA4 TLS fingerprint isn't Chrome's (TLS is the encryption behind https, and JA4 is a label derived from how a client opens that connection - it acts like a signature), its HTTP/2 SETTINGS frame isn't Chrome's, and its default User-Agent literally says "Scrapy/X.Y". Any anti-bot vendor blocks this at Layer 1 (see the four-layer model) before a single line of HTML is served.
The fix lives in the downloader-middleware system. Two production patterns:
- **scrapy-impersonate / scrapy-curl-cffi** — swaps Scrapy's downloader for curl_cffi, which reproduces a real browser's TLS handshake. Works with medium-strength anti-bot configurations and is easy to set up.
- **Scrapy + Go TLS sidecar** — full Chrome impersonation via utls in a separate Go service. Produces a Chrome-consistent handshake at the network layer. More moving parts to run, but worth it for high-volume authorized scraping of protected sites you are permitted to access. See the dedicated entry.
For sites that need JavaScript to run, scrapy-playwright or scrapy-camoufox swap the downloader for a real browser on a per-request basis. Browsers are expensive, so apply browser middleware only to the specific requests that need it via meta={"playwright": True}.
### Scaling Scrapy beyond one machine
By default Scrapy runs in a single process. Three ways to scale out:
- **scrapy-redis** — pulls URLs from a shared Redis queue. Multiple workers across machines draw from the same queue and write to the same dedup set. The simplest way to distribute Scrapy.
- **Scrapyd** — a daemon that deploys packaged spiders (eggs) and runs them through an HTTP API. Handy for cron-driven crawls and as a stepping stone toward Kubernetes.
- **Zyte (Scrapy Cloud)** — managed Scrapy hosting from the original Scrapy team. You deploy a spider with one command and the platform handles queueing, retries, and monitoring.
At enterprise scale, the more common choices are estela (a Kubernetes-native Scrapy orchestrator) or a self-hosted scrapy-cluster (backed by Kafka). The framework itself scales fine — the real work is wiring up the surrounding queue and storage infrastructure to match.
### Example
```python
# A minimal Scrapy spider with curl_cffi for TLS impersonation
import scrapy
from curl_cffi import requests
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://target.com/category/widgets"]
custom_settings = {
"DOWNLOAD_DELAY": 1.0,
"AUTOTHROTTLE_ENABLED": True,
"ITEM_PIPELINES": {"myproject.pipelines.DedupePipeline": 300},
}
def parse(self, response):
for link in response.css("a.product-tile::attr(href)").getall():
yield response.follow(link, callback=self.parse_product)
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_product(self, response):
yield {
"url": response.url,
"title": response.css("h1::text").get(),
"price": response.css(".price::text").get(),
}
# For protected sites: add a curl_cffi or Go-sidecar downloader middleware.
```
### FAQ
**Q: When should I use Scrapy vs just a Python script?**
Use a plain script for one-off scrapes under ~1000 URLs. Reach for Scrapy when the crawl recurs, spans many thousands of URLs, needs retries and dedupe, or will outlive your patience for maintaining the queue logic yourself. There's more boilerplate up front, but the operational payoff is huge.
**Q: Can Scrapy use a headless browser?**
Yes, via scrapy-playwright or scrapy-camoufox. These wrap a browser as a downloader middleware, so you can flag the specific requests that need browser rendering and let everything else take the cheap HTTP path. Mixing browser and non-browser requests in one spider is the typical production setup.
**Q: Is Scrapy still maintained?**
Yes. Zyte (founded by the original Scrapy team) sponsors active development, Python 3.13 support landed in 2024, and major releases keep coming on a roughly annual cadence. The Twisted dependency raises eyebrows, but it's stable and well-tested.
---
## What Is mitmproxy?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-mitmproxy
**mitmproxy is a free tool that sits between an app and the internet so you can read and change the HTTPS traffic passing through it.** The name comes from "man-in-the-middle": it acts as a proxy in the middle of the connection and decrypts the traffic, which is normally encrypted (HTTPS). Because it's scriptable in Python, you can also rewrite, log, or replay any request automatically. In scraping it's the go-to tool for figuring out what a site or app actually sends. You run it as a CLI (mitmproxy), a browser UI (mitmweb), or a headless engine (mitmdump), and it accepts inline Python scripts that can change any request while it's in flight. The first step of the scraping decision flow is "intercept the mobile app first" — and mitmproxy is how you do that.
### Quick facts
- **Vendor:** mitmproxy project (open-source, MIT)
- **Language:** Python (server); scripts in Python
- **Modes:** CLI (mitmproxy), web UI (mitmweb), headless replay (mitmdump)
- **Use case in scraping:** Mobile API discovery, request inspection, replay & rewrite
- **Limitation:** Certificate pinning — many apps refuse the mitmproxy CA on devices that enforce pinning
### What mitmproxy is for
There are two main jobs it does in scraping:
- **Mobile API discovery.** Install mitmproxy's certificate (the credential a device trusts to verify HTTPS) on an Android emulator or jailbroken iPhone, point the device's proxy setting at mitmproxy, and use the target app normally. Every request becomes readable — the endpoints it calls, the auth tokens it sends, how it signs requests, how it pages through results. This is how scrapers find the unprotected mobile backends sitting behind sites that pay Akamai to protect their websites.
- **Web request inspection and replay.** When a scraper is misbehaving, route it through mitmproxy and re-send individual requests with tweaked headers (the r key opens a request editor). Using the inline Python scripting, you can rewrite requests on the fly without editing the scraper itself.
mitmweb (the browser UI) is the easiest for one-off use; mitmproxy (the keyboard-driven terminal UI) is faster once you learn it; mitmdump runs without a UI, which is handy in CI or scripted captures.
### mitmproxy vs HTTP Toolkit vs Charles Proxy vs Burp Suite
Four tools cover the intercepting-proxy category, with overlapping use cases:
ToolBest forCost
**mitmproxy**CLI/scripting, automation, repeatable capturesFree
**HTTP Toolkit**GUI-driven mobile intercept; one-click device setupFree + Pro ($10/mo)
**Charles Proxy**Veteran GUI, polished macOS experience$50 one-time
**Burp Suite**Security recon, intruder/repeater, MCP serverFree / Pro $475/yr
For scraping reconnaissance specifically, mitmproxy is the default — it's free, scriptable, and built squarely around the intercept-and-replay loop. Burp Suite can do the same things, but it's really a penetration-testing tool, and the price reflects that.
### The certificate-pinning wall
Roughly half of mainstream mobile apps *pin* their TLS certificates — the app ships with the expected server certificate's fingerprint baked in and refuses to talk to anything else. That means mitmproxy's certificate, which you installed on the device, is rejected, and the app just shows a network error.
Three escalation steps when pinning blocks you:
- **Try a different app version.** Older versions of the same app often skip pinning. Sideload an APK (the Android install file) from a few releases back via apkpure or similar.
- **Frida + certificate unpinning (for apps you are authorized to test).** Frida is a tool that injects code into a running app. Running frida-server on the device plus fridantiroot.js on your machine switches off both okhttp3.CertificatePinner and the Java TrustManagerFactory — the two common pinning mechanisms. This works against most apps. See the mobile API scraping playbook for the full workflow.
- **objection / static reverse engineering.** When pinning is built into native code (banking apps, some games), Frida's default scripts aren't enough. objection handles more cases; truly custom pinning means disassembling the app by hand. By this point you're spending more effort on the intercept than the scraping is worth.
### Example
```python
# inline mitmproxy script — extract auth tokens and pagination cursors
# save as tokens.py, run with: mitmproxy -s tokens.py
from mitmproxy import http
import json
class TokenExtractor:
def __init__(self):
self.tokens = {}
def response(self, flow: http.HTTPFlow) -> None:
# Capture bearer tokens from any login endpoint
if "/login" in flow.request.path and flow.response.status_code == 200:
try:
body = json.loads(flow.response.text)
if "access_token" in body:
self.tokens[flow.request.host] = body["access_token"]
print(f"captured token for {flow.request.host}")
except json.JSONDecodeError:
pass
# Log cursor-based pagination for later reuse
if "X-Next-Cursor" in flow.response.headers:
print(f"{flow.request.path} cursor: {flow.response.headers['X-Next-Cursor']}")
addons = [TokenExtractor()]
```
### FAQ
**Q: Why use mitmproxy instead of Wireshark?**
Wireshark only sniffs raw network traffic, so for HTTPS you just see encrypted bytes you can't read. mitmproxy terminates the TLS connection using its own certificate, so you see the actual plaintext request and response bodies. In short: Wireshark is for low-level network debugging; mitmproxy is for the HTTPS application traffic that scrapers actually care about.
**Q: Can mitmproxy intercept HTTP/3 / QUIC?**
Not yet at production quality. There's an experimental HTTP/3 mode, but it lags behind the official spec. For QUIC-only services (some Google properties), you currently force the client to fall back to HTTP/2 using an upstream rule, then proxy that instead.
**Q: Is mitmproxy detectable by the server?**
Mostly no. Because mitmproxy runs on your own local network, the server just sees a normal Chrome or mobile-app TLS handshake coming from your machine. It only finds out it's being intercepted if your client adds tell-tale headers (mitmproxy doesn't) or if the app itself reports it — some apps phone home with proxy-status flags, in which case you'd disable that telemetry.
---
## What Is SeleniumBase?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-seleniumbase
**SeleniumBase is a Python framework for automating and testing browsers, built on top of Selenium 4. Its two notable features, UC Mode and CDP Mode, are designed to make automated Chrome behave more like an ordinary browser session.** UC Mode (Undetected-Chromedriver Mode) adjusts the chromedriver program, starts Chrome on its own first, and briefly drops the automation connection during page loads and clicks — so the running browser presents a consistent configuration. It is one of the few tools of its kind that also includes built-in CAPTCHA handling. As with any automation framework, use it only on sites and data you own, control, or are permitted to access.
### Quick facts
- **Type:** Selenium 4 framework + UC Mode / CDP Mode
- **Language:** Python (pytest/unittest integration)
- **Standout feature:** Built-in Turnstile / reCAPTCHA handling via PyAutoGUI
- **Key technique:** Disconnect/reconnect driver during page loads and clicks
- **Main limitation:** UC Mode detectable in true headless; slower than vanilla Selenium
### How UC Mode and CDP Mode work
UC Mode combines three tricks. First, it **patches the chromedriver binary** (chromedriver is the program Selenium uses to control Chrome) to randomise the window.cdc_* values — telltale variables that automation tools leave behind and that websites scan for. Second, it uses a **browser-first launch**: Chrome starts as a normal, clean process and chromedriver connects to it afterwards, instead of Chrome being opened by the driver and carrying automation signatures from the very first moment. Third — the clever part — it **disconnects the driver** (driver.service.stop()) during page loads and clicks. It schedules the navigation or click through JavaScript while disconnected, then reconnects. This keeps the browser's runtime state consistent during the moments a page is most actively inspecting its environment.
CDP Mode goes a step further: it controls the page directly through the Chrome DevTools Protocol (the same channel Chrome's own dev tools use), so the usual WebDriver fingerprints are not present. It is slower than plain Selenium but presents a cleaner environment, and you can mix it with WebDriver when you need to.
### Built-in CAPTCHA handling
SeleniumBase is the only tool in this comparison that handles CAPTCHAs automatically. uc_gui_click_captcha() and uc_gui_handle_captcha() use **PyAutoGUI** — a library that controls the real mouse and keyboard at the operating-system level — to move the cursor along natural curves and click the Turnstile or reCAPTCHA checkbox at a small random offset. Because PyAutoGUI works at the operating-system level rather than inside the browser, the interaction looks like ordinary mouse input. This is intended for verification on services you are authorized to access. The catch: it needs an actual display, so on Linux you must run it under xvfb (a virtual screen) rather than true headless (no screen at all).
### Strengths, costs, and when to use it
The repo ships 200+ working examples that run against real protected sites (Cloudflare, Imperva, DataDome, Kasada, PerimeterX, reCAPTCHA). **Use it when:** you want built-in CAPTCHA handling, a full pytest/unittest testing framework, or are working with protected sites you are authorized to access in Python. **Costs:** a steep learning curve, a heavy set of dependencies, and speeds 2–5× slower than plain Selenium because of the disconnect/reconnect overhead. UC Mode is also detectable in true headless, so use xvfb instead. It is commonly paired with residential proxies (IP addresses tied to home internet connections) and incognito=True.
### Example
```python
from seleniumbase import SB
# UC Mode + built-in CAPTCHA handling + residential proxy
# Use only on sites and data you are permitted to access.
with SB(uc=True, proxy="user:pass@host:port", incognito=True) as sb:
sb.uc_open_with_reconnect("https://your-authorized-site.com", 4)
sb.uc_gui_click_captcha() # handles Turnstile / reCAPTCHA via PyAutoGUI
sb.assert_text("Success")
```
### FAQ
**Q: What is UC Mode in SeleniumBase?**
UC Mode (Undetected-Chromedriver Mode) is SeleniumBase's consistency layer. It adjusts the chromedriver program to randomise the cdc_ markers that automation tools normally leave behind, starts Chrome before connecting the driver, and disconnects the driver during page loads and clicks, keeping the browser's runtime state consistent during sensitive operations.
**Q: Can SeleniumBase solve CAPTCHAs automatically?**
Yes — it is one of the few tools in this comparison that includes CAPTCHA handling out of the box. uc_gui_click_captcha() uses PyAutoGUI to click Turnstile and reCAPTCHA challenges at the operating-system level, like ordinary mouse input, on services you are authorized to access. It needs a display (xvfb on Linux), so it does not work in true headless mode.
**Q: Why is UC Mode slower than normal Selenium?**
The disconnect/reconnect step adds 0.1–1 second of delay each time it touches a sensitive operation. That makes UC Mode roughly 2–5× slower and CDP Mode 1.5–3× slower than plain Selenium. It is the cost of keeping a consistent browser session during sensitive operations.
**Q: Does UC Mode work in headless mode?**
Not reliably — UC Mode can be detected in true headless (running with no screen). On Linux the recommended fix is a virtual display with xvfb=True, which gives Chrome a real place to render while still running on a server.
---
## What Is XDriver?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-xdriver
**XDriver is a browser-automation tool for Playwright (a browser-automation library): one command swaps Playwright's internal driver files for versions that reduce common automation tells.** You run x_driver activate; it backs up the original driver, drops in patched copies of crConnection.js / crPage.js / browserContext.js / frames.js, and from then on your normal Playwright code runs with CDP leaks closed (CDP is the Chrome DevTools Protocol, the channel Playwright uses to control the browser — and a common automation tell). No changes to your scripts. x_driver deactivate puts the originals back. The catch: it only works with one exact Playwright version.
### Quick facts
- **Type:** In-place Playwright driver patcher (Chromium)
- **Language:** Python
- **Activation:** One command: x_driver activate / deactivate
- **Covers:** Runtime.enable removal, binding obfuscation, WebRTC, service workers
- **Main limitation:** Version-locked to Playwright 1.52.0; single-author beta
### How XDriver patches Playwright
Patchright is a separate forked package you install instead of Playwright; XDriver takes a different route and edits the Playwright you already have installed. x_driver activate backs up the driver/package folder to driver/package_1, copies its patched files over the originals, and tweaks playwright/__init__.py. Your scripts keep importing the normal playwright module but now run on the hardened driver — zero code changes.
The patched files close several automation tells. They avoid Runtime.enable (a CDP command anti-bot scripts watch for) and use Page.createIsolatedWorld instead; they strip and randomise Playwright's own markers in injected scripts; they hide binding names inside isolated worlds; they filter WebRTC ICE candidates so your real IP cannot leak (WebRTC is the browser's peer-to-peer feature that can expose local network addresses); and they block service-worker registration, which can otherwise be used to fingerprint automation.
### Coverage and the version-lock trade-off
The project reports passing Cloudflare, Turnstile, Kasada, DataDome, PerimeterX, Imperva and Fingerprint.com, plus 100% anonymous on CreepJS and strong scores on BrowserScan, Rebrowser, and Whoer (these are public fingerprinting and bot-detection test sites). The trade-off is maintenance risk. Because it patches exact file internals, XDriver is **locked to Playwright 1.52.0**; it is a one-person beta (v1.0.1), works only with Chromium, and if the backup gets corrupted you have to restore by hand. It is invasive by design — it rewrites files inside your library folder.
### XDriver vs Patchright — which to pick
Both deliver similar Chromium CDP stealth, just in different ways. **XDriver**: replaces files in place, needs no import changes, and toggles on and off easily — but it is pinned to one Playwright version and maintained by a single author. **Patchright**: a separately maintained package that keeps pace with new Playwright releases, supports Python, Node.js and .NET, and has an active community — but you import from patchright instead of playwright. Pick XDriver for quick testing on a codebase you cannot modify; pick Patchright for production and version flexibility.
### Example
```python
# Terminal: pip install x_driver && pip install playwright==1.52.0
# playwright install chromium && x_driver activate
# Then your EXISTING Playwright code runs patched — no import change:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://bot.sannysoft.com") # uses patched driver
browser.close()
# Terminal: x_driver deactivate # restores original Playwright
```
### FAQ
**Q: How is XDriver different from Patchright?**
Both add CDP stealth to Playwright on Chromium. XDriver patches your installed Playwright files in place, so you change nothing in your code — but it only works with Playwright 1.52.0. Patchright is a separate package you import from, supports more languages, and keeps up with new Playwright releases. Use XDriver for quick on/off testing; use Patchright for production.
**Q: Why is XDriver locked to one Playwright version?**
It edits the exact internals of Playwright's driver files, and those files change from release to release. A patch built for 1.52.0 would break a different version, so it is pinned to that one. That is the main reason it suits short-term testing rather than long-lived production systems.
**Q: Do I need to change my code to use XDriver?**
No. After running x_driver activate, you import the normal playwright module and your existing scripts run on the hardened driver. That drop-in behaviour is its main appeal for existing Playwright codebases. Run x_driver deactivate to undo it.
**Q: Is XDriver production-ready?**
Be cautious. It is a single-author beta (v1.0.1), locked to one Playwright version, and rewrites files in your library folder — and if the backup is corrupted you must recover by hand. For production, Patchright is the safer choice.
---
## What Is Scrapling?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-scrapling
**Scrapling is an all-in-one Python scraping framework that bundles fetching, parsing, anti-detection, and crawling behind one API — it is a layer above the other tools, not a competitor.** In other words, instead of wiring several libraries together yourself, you import one package. It offers three ways to fetch a page, from light to heavy: curl_cffi for TLS-impersonated HTTP — TLS being the encryption layer behind https, and "impersonated" meaning it mimics a real browser's TLS signature (fastest); standard Playwright for pages that need JavaScript to render; and a StealthyFetcher that wraps **Patchright or Camoufox** for the most browser-consistent configuration, including automatic Cloudflare challenge handling. Its standout feature is adaptive element tracking: when a site changes its page layout, Scrapling re-finds the elements your selectors used to match.
### Quick facts
- **Type:** All-in-one scraping framework (fetch + parse + crawl + stealth)
- **Language:** Python
- **Three tiers:** Fetcher (curl_cffi) / DynamicFetcher (Playwright) / StealthyFetcher (Patchright or Camoufox)
- **Unique feature:** Adaptive selectors auto-relocate when the DOM changes
- **Also includes:** Scrapy-like spider, fast parser, MCP server for AI workflows
### The three-tier fetching system
Scrapling gives you three fetchers, and you pick one based on how hard the target site is to scrape. **Tier 1 — Fetcher** uses curl_cffi to send plain HTTP requests that carry a real browser's TLS fingerprint (the JA3/JA4 signature a browser shows during the https handshake) for Chrome/Firefox/Safari/Edge, plus realistic headers generated by browserforge. No JavaScript runs, so there is no browser to fingerprint at all — it is the fastest option (~10MB footprint) and handles roughly 90% of pages. **Tier 2 — DynamicFetcher** is plain Playwright, used when a page builds its content with JavaScript; it adds helpers like wait_selector and network-idle waits but no special stealth. **Tier 3 — StealthyFetcher** wraps Patchright (the default) or Camoufox (set use_camoufox=True) and exposes flags like solve_cloudflare=True, block_webrtc=True, hide_canvas=True, and disable_webgl=True to reduce the signals anti-bot systems look for.
### Cloudflare handling and adaptive selectors
Set solve_cloudflare=True and Scrapling spots the Cloudflare Turnstile or interstitial challenge page, waits for the challenge iframe to load, interacts with it, and waits for the redirect to the real page — all without paying for an external CAPTCHA-solving service. As the analysis stresses, it is *not* cracking the CAPTCHA; instead Patchright/Camoufox present a standards-compliant browser environment, so the page loads without an extra challenge step.
The unique feature is **adaptive element tracking**. Normally a CSS selector breaks the moment a site is redesigned and your scraper stops working. Scrapling can save the structural context of an element it matched — roughly, where it sat in the page and what surrounded it — and later re-find that same element by fuzzy structural matching even after the layout shifts. No other tool in this comparison offers this. It pairs with a fast lxml-based parser, a Scrapy-like spider with pause/resume checkpoints, and an MCP server for Claude/Cursor workflows.
### When to use Scrapling
**Use it when:** you want one framework for an entire pipeline rather than stitching separate tools together; you have a mix of protection levels (cheap HTTP for most pages, browser stealth only for the hard ones); you scrape sites that frequently change their page layout; or you want AI-integrated scraping. **Note:** it does not add stealth of its own — Tier 3 inherits whatever Patchright or Camoufox provides — so against the hardest enterprise targets you are still limited by those engines. Scrapling is a Python orchestration layer that ties the pieces together, not a new detection-handling technique.
### Example
```python
from scrapling import Fetcher, StealthyFetcher
# Tier 1: fast TLS-impersonated HTTP for easy pages
html = Fetcher().get("https://example.com", stealthy_headers=True)
# Tier 3: Patchright/Camoufox under the hood for protected pages
page = StealthyFetcher().get(
"https://cloudflare-protected.com",
solve_cloudflare=True,
block_webrtc=True,
hide_canvas=True,
)
print(page.css_first("h1::text"))
```
### FAQ
**Q: Is Scrapling a stealth tool or a framework?**
A framework. It does not introduce any new detection-handling method of its own — it coordinates other tools (curl_cffi for TLS impersonation, Patchright/Camoufox for browser configuration) behind one API, then adds parsing, crawling, and adaptive selectors on top. Think of it as a layer that sits above Camoufox and Patchright, not a competitor to them.
**Q: What are the three fetching tiers?**
Fetcher (curl_cffi HTTP with TLS impersonation — fastest, no browser involved), DynamicFetcher (plain Playwright for pages that need JavaScript to render, no special stealth), and StealthyFetcher (Patchright or Camoufox for the most browser-consistent configuration with automatic Cloudflare challenge handling). You choose the tier per target depending on how heavily it is protected.
**Q: How does solve_cloudflare work?**
It detects the Cloudflare Turnstile or interstitial page, waits for the challenge iframe, interacts with it, and waits for the redirect to the real content. It does not solve the CAPTCHA cryptographically — the underlying Patchright/Camoufox browser presents a standards-compliant environment, so the page loads without an extra challenge step.
**Q: What is adaptive element tracking?**
Scrapling can remember the structural fingerprint of an element it once matched — where it lives in the page and what is around it. When a site redesign breaks your CSS selector, it re-finds the same element by fuzzy structural matching instead of just failing. It is Scrapling's most distinctive feature among these tools.
---
## What Is Obscura?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-obscura
**Obscura is an open-source headless browser engine written from scratch in Rust — not a fork or patch of Chrome or Firefox.** A headless browser is one with no visible window, driven by code. Obscura runs a page's JavaScript using V8 (Google's JS engine, embedded here via the deno_core crate) against a page structure built by html5ever, and it speaks the Chrome DevTools Protocol — the same control channel real Chrome exposes — so Puppeteer/Playwright can drive it. At ~30MB and ~85ms per page it is by far the lightest tool here. But it has *no layout engine, no CSS cascade, and no real canvas/WebGL* — it never actually draws the page — and that is exactly what limits it against fingerprint-aware anti-bots.
### Quick facts
- **Type:** From-scratch headless engine (V8 + html5ever)
- **Language:** Rust (CLI/engine); any language via CDP clients
- **Footprint:** ~30MB binary, ~85ms page load — high concurrency
- **Stealth:** JavaScript shim (navigator overrides) + optional TLS via --features stealth
- **Hard limit:** No layout/rendering — getBoundingClientRect returns all zeros
### How Obscura works
Obscura is a Cargo workspace (a multi-part Rust project) of six crates: a CLI (fetch/scrape/serve + load balancer), a CDP WebSocket server that automation tools connect to, a Page abstraction, an HTTP client (reqwest by default, or wreq for TLS impersonation — copying a real browser's encryption handshake — under --features stealth), the V8 JavaScript runtime, and an html5ever DOM with the selectors crate for CSS queries. When you navigate to a page, it fetches the HTML, parses it, fetches the CSS in parallel (but only keeps it as a string — it never applies it), starts V8 from a precompiled snapshot for fast boot, and runs the page's scripts.
All anti-detection lives in a single 3,035-line bootstrap.js shim that runs before the page's own scripts. It fakes a browser environment in JavaScript: it defines navigator with webdriver=undefined (the flag that gives automation away), a Chrome 145 user-agent, userAgentData/UA-CH payloads (the structured browser-identity hints sites read), a 5-plugin list, and stubs for mediaDevices/battery/permissions. There are no C++ or binary patches — every override is plain JavaScript.
### Why it is weak against real anti-bots
Obscura is not a full browser, and detectors notice. Because it has no layout engine — nothing that computes where elements sit on screen — getBoundingClientRect() returns {0,0,0,0} for every element and getComputedStyle returns placeholder values. Real browsers never do that, so Layer-5 (rendering/layout) probes catch it at once. Its canvas, WebGL, and audio are not real implementations either, so any fingerprinting service that hashes the actual pixels a browser draws will flag it instantly. The user-agent is hardcoded Linux. In practice it satisfies only basic navigator.webdriver-style checks; against DataDome, Kasada, Akamai, PerimeterX, or even Cloudflare beyond the free tier, it fails.
### When to use Obscura
**Use it when:** you need to run JavaScript on *unprotected* pages at high volume and a real Chrome instance (~200MB each) is too heavy — Obscura's ~30MB workers let you run many at once on the same machine. The repo suggests a hybrid setup: Obscura handles the easy bulk, while Patchright/Camoufox handle the few protected targets. **Avoid it when:** the target uses any fingerprint-aware or layout-probing detection — Obscura has no answer to those. Think of it as a lightweight JS-rendering engine, not an anti-detect browser.
### Example
```bash
# Lightweight JS rendering for unprotected pages, optional TLS stealth
obscura fetch https://example.com --dump-text
# Or run as a CDP server and drive it from Puppeteer/Playwright clients
obscura serve --port 9222
# Build with TLS impersonation: cargo build --release --features stealth
```
### FAQ
**Q: Is Obscura a real browser?**
No. It is a from-scratch engine that runs JavaScript (V8) against an html5ever DOM, but it has no layout engine, no CSS cascade, no compositor, and no real canvas/WebGL/audio — so it never actually renders the page. It reimplements just enough of the browser surface to execute page scripts, which is why it is so lightweight.
**Q: Why does Obscura fail against anti-bot systems?**
Its missing layout engine is a dead giveaway: getBoundingClientRect returns all zeros and getComputedStyle returns stub values, which Layer-5 rendering probes catch immediately because a real browser never does that. Its canvas/WebGL are not real, so fingerprinting services that hash the actual drawn output flag it. It only satisfies basic navigator.webdriver checks.
**Q: When would I choose Obscura over Chrome?**
When you need to render JavaScript on unprotected pages at high concurrency. At ~30MB and ~85ms per load you can run far more workers in parallel than with ~200MB Chrome instances. The recommended pattern is hybrid — Obscura for the easy bulk, Patchright/Camoufox for the protected pages.
**Q: Does Obscura impersonate TLS?**
Optionally. TLS is the encryption layer behind https, and its handshake leaves a recognizable fingerprint. Built with --features stealth, Obscura swaps reqwest for the wreq client to present a Chrome 145 TLS fingerprint. That is a single profile, not the broad impersonation library that Scrapling's curl_cffi tier offers.
---
## Anti-Detect Browser Tools Compared
URL: https://scrappey.com/qa/web-scraping-apis/anti-detect-browser-tools-comparison
**Anti-detect browser tools aim to present a consistent, real-looking browser configuration so that automated sessions render the same fingerprint signals a normal browser would — they work in very different ways, and none produces a perfectly indistinguishable result.** The eight most-discussed open-source tools fall into a few groups: custom browser builds (Camoufox, built on Firefox, and CloakBrowser, built on Chromium), patched versions of Playwright (Patchright, XDriver), wrappers around Selenium (SeleniumBase, Botasaurus), a tool that wires several of them together (Scrapling), and a browser engine written from scratch in Rust (Obscura). Which one is right depends on how the target detects bots and how tough it is.
### Quick facts
- **C++ engine stealth:** Camoufox (Firefox), CloakBrowser (Chromium)
- **Playwright CDP stealth:** Patchright (package), XDriver (in-place patch)
- **Selenium-based:** SeleniumBase (UC/CDP + CAPTCHA), Botasaurus (human mouse)
- **Framework / engine:** Scrapling (all-in-one), Obscura (lightweight Rust)
- **Universal truth:** IP reputation and TLS matter more than fingerprint polish
### The five detection layers
Websites check for bots in five separate layers, and each tool only covers some of them. **Layer 1 — Protocol:** giveaways from the way automation talks to the browser, like the timing of the Runtime.enable command in CDP (Chrome DevTools Protocol, the channel a tool uses to remote-control Chrome). Patchright, XDriver, and CloakBrowser hide this; Camoufox avoids it entirely by using Juggler, Firefox's own control channel. **Layer 2 — Fingerprinting:** tiny differences in how your machine draws graphics or plays audio — canvas, WebGL, audio, screen. Only the C++ tools (Camoufox, CloakBrowser) set these values from inside the browser engine; tools that inject JavaScript leave traces. **Layer 3 — Behavioural:** how human your mouse movement and timing look — Botasaurus and CloakBrowser lead here. **Layer 4 — Network:** the TLS handshake fingerprint (TLS is the encryption behind https; its handshake forms a signature called JA3/JA4). Only Scrapling's curl_cffi tier and Obscura's stealth build reproduce a browser-like handshake here, and WebRTC/DNS consistency still needs handling. **Layer 5 — Layout/rendering:** whether the page actually renders like a real browser, e.g. getBoundingClientRect values and genuine canvas output — only real-browser tools pass this, which is why Obscura, having no real layout engine, fails here.
### How the eight tools compare
ToolEngineStealth approachBest for
**Camoufox**FirefoxC++ fingerprint + JugglerFingerprint rotation
**CloakBrowser**Chromium33 C++ patches + humanizeChromium C++ stealth
**Patchright**ChromiumCDP patch (no Runtime.enable)Playwright stealth
**XDriver**ChromiumIn-place driver patchQuick Playwright stealth
**SeleniumBase**ChromeUC/CDP + PyAutoGUICAPTCHA solving
**Botasaurus**ChromeBézier mouse + CDP eventsHuman behaviour
**Scrapling**MixedOrchestrates the above + TLSFull pipeline
**Obscura**Rust/V8JS shim + optional TLSLightweight bulk
Realistic success rates from the analysis, by how strong the target's defences are: basic protection (Cloudflare Free) hits 90%+ with the tool alone; medium protection (CF Pro, PerimeterX) lands at 60–80%; enterprise protection (Akamai, DataDome) manages only 20–40% — though that climbs to 70–85% once you add residential proxies (proxies that route through real home internet connections). Custom machine-learning defences stay under 20% even with good tooling.
### The hard truth — and where a managed API fits
No tool is truly undetectable, and detection is a constant arms race. The point the analysis keeps coming back to: **your IP address's reputation matters more than how clever your stealth is** — even a perfect fingerprint fails when it comes from a datacenter IP, and the TLS handshake signature is nearly impossible to fully fake from inside a real browser. Behaviour adds up too: scraping at the same rhythm eventually gets flagged no matter how lifelike your mouse movements are.
That is why teams running at high volume tend to push the hard parts onto a server instead. A managed API like Scrappey takes care of fingerprinting, residential proxies, and TLS impersonation behind a single request — you give up the control of running your own browser setup in exchange for not having to keep maintaining it as detection changes. For learning, testing, and keeping everything self-hosted, the open tools above are the right pick; for production at scale against hard targets, a managed layer takes the ongoing maintenance off your plate.
### Example
```python
# The recommended combination for enterprise targets:
# protocol stealth (Patchright) + fingerprint rotation (Camoufox),
# orchestrated by Scrapling and always behind residential proxies.
from scrapling import StealthyFetcher
page = StealthyFetcher().get(
"https://enterprise-protected.com",
use_camoufox=True, # Firefox C++ fingerprint engine
solve_cloudflare=True,
proxy="http://user:pass@residential:port",
)
print(page.status, page.css_first("title::text"))
```
### FAQ
**Q: Which anti-detect browser tool is the best?**
There is no single best — it depends on the target. For fingerprint rotation use Camoufox; for Chromium C++ stealth use CloakBrowser; for Playwright code use Patchright; for UC-mode stealth use SeleniumBase; for human-like behaviour use Botasaurus; for a full pipeline that ties several tools together use Scrapling. For heavily-protected public pages, teams combine Patchright and Camoufox with residential proxies.
**Q: Are any of these tools truly undetectable?**
No. Every analysis in this comparison stresses that detection is an arms race and no tool is undetectable. The C++ tools (Camoufox, CloakBrowser) come closest at the fingerprint layer, but the TLS handshake signature, your IP's reputation, and behaviour patterns that build up over time still catch even a perfect fingerprint.
**Q: What matters more — the tool or the proxy?**
Usually the proxy. Your IP address's reputation is the single biggest factor: the best stealth tool fails from a datacenter IP, while a modest tool on a clean residential IP (one routed through a real home connection) often succeeds. Success rates against enterprise protection roughly double when you add residential proxies.
**Q: When should I use a managed API instead of these tools?**
When you scrape hard targets at scale and do not want to maintain a browser stack as detection keeps changing. Open tools give you control and are ideal for learning and self-hosting; a managed API like Scrappey handles fingerprinting, verification workflows, proxies, and TLS on its own servers so you do not have to keep up with the arms race yourself.
---
## What Is jsoup?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-jsoup
**jsoup is a free Java library that reads HTML and lets you pull data out of it.** You give it a web page, and it turns the raw HTML into a DOM (the tree of elements that makes up a page). From there you can find elements with CSS selectors - the same div.price style patterns you use in stylesheets - and grab their text or attributes. It is the go-to HTML parser for Java web scraping, just as Beautiful Soup is for Python.
### Quick facts
- **Language:** Java
- **Purpose:** HTML parsing + extraction via CSS selectors
- **Analogy:** Java's Beautiful Soup
- **Renders JavaScript?:** No - static HTML only
- **Best for:** Server-rendered pages on JVM stacks
### What jsoup does
jsoup takes HTML from a string, file, or URL and builds a clean DOM (the element tree of the page). You then query it with jQuery-style selectors: doc.select("div.price") returns every matching element, and you read each one's text or attributes. It also cleans up broken markup and can sanitize untrusted HTML - stripping out tags that could be unsafe. For pulling data from static pages, it is short to write and fast to run.
### When to use jsoup
Use jsoup when you are working in Java (or another JVM language) and scraping pages whose content is already in the HTML the server sends - product listings, articles, tables. It is a great fit for small-to-medium jobs where the data is right there in the initial HTML. It is not the right tool for JavaScript-heavy single-page apps, because jsoup does not run JavaScript - it only reads the HTML as delivered.
### jsoup's limits for modern scraping
jsoup only parses HTML. It does not run JavaScript, rotate proxies (swap the IP address each request comes from), or handle anti-bot defenses. On a protected site you will receive a 403 or a Cloudflare challenge page instead of the real content, and on a single-page app you will get an empty shell with no data. The usual fix is to first fetch fully rendered HTML through a web scraping API, then hand that HTML to jsoup to parse.
### FAQ
**Q: Is jsoup like Beautiful Soup?**
Yes - jsoup is basically the Java version of Python's Beautiful Soup: an HTML parser that lets you extract data using CSS selectors.
**Q: Can jsoup scrape JavaScript-rendered pages?**
No. jsoup reads static HTML and does not run JavaScript, so content that a page builds with JavaScript (like single-page apps) never shows up. Pair it with a headless browser or a scraping API that returns the fully rendered HTML.
**Q: Is jsoup free?**
Yes, it is open source under the MIT license.
**Q: Can jsoup handle Cloudflare or anti-bot systems?**
No - it has no proxy or anti-bot handling of its own. Send your requests through proxies or a scraping API first, then parse the returned HTML with jsoup.
---
## What Is Data Parsing?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-data-parsing
**Data parsing is the process of taking raw, messy data and turning it into a clean, structured format your program can use.** In web scraping, that means converting the tangled HTML a server sends back into neat fields - titles, prices, dates - that your application can store and search. Think of it as unpacking a shipping box and sorting the contents onto labeled shelves. Parsing is the step between fetching a page and actually having data you can work with.
### Quick facts
- **What it is:** Raw data into structured output
- **In scraping:** HTML into fields (JSON/CSV/DB)
- **Common tools:** Beautiful Soup, jsoup, lxml, regex, CSS/XPath
- **Inputs:** HTML, JSON, XML, plain text
- **Goal:** Clean, consistent, queryable data
### How data parsing works
A parser reads raw input and gives it structure in a few steps. First it tokenizes the text - splits it into meaningful chunks like tags, words, and symbols. Then it builds a model of the document; for HTML that model is a DOM tree (a nested map of every element on the page). From there you point at the pieces you want - usually with CSS selectors or XPath, two query languages for picking elements out of that tree - and convert them into typed values like numbers or dates. The output is a predictable shape, such as one JSON object per product, instead of a wall of markup.
### Data parsing in web scraping
After you fetch a page, parsing is where the value gets created. You select the elements that hold each field, pull out their text or attributes, and normalize them - that means cleaning them up so every record looks the same: stripping currency symbols off prices, putting dates in one format, filling in or flagging missing fields. Done well, a single parser can turn thousands of slightly different pages into one clean, uniform dataset.
### Getting clean structured output reliably
Parsers are fragile. When a site changes its markup, your selectors stop matching and data quietly disappears - no error, just empty fields. So resilient selectors (ones that don't depend on tiny layout details), validation, and monitoring all matter. To avoid the constant parser-maintenance treadmill, some scraping APIs return **already-structured** data for common targets, or hand you fully rendered HTML through a web scraping API that's clean enough to parse with a tool like jsoup or Beautiful Soup.
### FAQ
**Q: What's the difference between data parsing and data extraction?**
Extraction is getting the data out of the source; parsing is structuring that raw data into usable fields. In scraping the two overlap - you parse the fetched HTML in order to extract the data.
**Q: What tools parse HTML?**
Common ones are Beautiful Soup and lxml/PyQuery (Python), jsoup (Java), and Cheerio (Node). You can also use CSS selectors, XPath, or regex (pattern matching on text) for targeted cases.
**Q: Why does my parser keep breaking?**
Because sites change their markup, which breaks the selectors your parser relies on. Use resilient selectors, validate your output, and set up alerts on missing fields so you catch breakage early.
**Q: Can I get already-parsed structured data?**
Yes. Some scraping APIs return structured JSON for popular sites, so you don't have to write or maintain parsers yourself.
---
## What Is Web Scraping as a Service?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-web-scraping-as-a-service
**Web scraping as a service (WSaaS) is a managed, cloud-based offering that handles web data extraction for you through an API or dashboard - including the proxies, browsers, and anti-bot handling - so you don't build and maintain scraping infrastructure yourself.** In plain terms: for data you own or are permitted to access, you send the service a URL, and it sends back the page's data. The provider deals with all the parts that usually break - swapping IP addresses, handling bot-detection challenges, running a real browser - so you just make a request and get a result.
### Quick facts
- **What it is:** Managed scraping via API or dashboard
- **You skip:** Proxies, headless browsers, anti-bot handling, scaling
- **Delivery:** API request into HTML/JSON, or scheduled datasets
- **Buyers:** Devs & data teams without scraping infra
- **Vs DIY:** Faster, more reliable, pay-per-use
### What web scraping as a service handles for you
The hard, maintenance-heavy parts of scraping aren't reading the data off a page - they're the infrastructure around it. WSaaS takes care of: rotating residential proxies (sending each request from a different real-home IP address so you don't get blocked), solving CAPTCHAs, handling anti-bot challenges, rendering JavaScript (running a page's scripts so dynamic content actually appears), retrying failures, and scaling up. All of that lives behind a single endpoint, so for authorized targets your code goes from 'maintain a fleet of scrapers' to 'call an API.'
### When to use it
WSaaS makes sense in three situations. First, when your targets are protected by anti-bot services like Cloudflare, DataDome, or PerimeterX that block plain scripts. Second, when you need scale and uptime you don't want to babysit. Third, when running scrapers simply isn't your core business. The clearest signal: if you're spending more time fighting bans and proxies than actually using the data, it's time to hand that work off.
### DIY scrapers vs web scraping as a service
Building your own scraper gives you maximum control - but you also own the proxy bills, the bans, the fingerprint cat-and-mouse (the constant tweaking to keep your requests looking like a real browser), and the upkeep every time a site changes its defenses. WSaaS trades that ongoing, unpredictable cost for a flat per-request fee. Scrappey is a web scraping API that handles authorized browser and verification workflows on JavaScript-heavy sites you are permitted to access and returns the data, so you skip the infrastructure entirely (see also the Web Access API).
### FAQ
**Q: What's the difference between WSaaS and a scraping API?**
They're closely related: a scraping API is the most common form of web scraping as a service. WSaaS is the broad idea (someone else runs the scraping for you); the API is simply the way you tap into it - the request you send and the data you get back.
**Q: Is web scraping as a service legal?**
The service itself is just a tool - legality depends on what you scrape and how you use it. Sticking to public data, respecting each site's terms and the laws that apply to you, and avoiding personal or copyrighted data keep you on safer ground. This isn't legal advice.
**Q: Why not just build my own scraper?**
You can, but proxies, anti-bot handling, and constant maintenance cost real time and money - and they grow as your targets add more defenses. WSaaS offloads all of that so you can focus on using the data instead of keeping the pipeline alive.
**Q: What does web scraping as a service cost?**
Pricing is usually per-request or credit-based, so you pay only for what you use rather than paying to keep idle infrastructure running.
---
## What Is PyQuery?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-pyquery
**PyQuery is a Python library for parsing and manipulating HTML and XML using a jQuery-like syntax.** If you have used jQuery in the browser to grab elements, PyQuery gives you that same feel in Python. It is built on lxml (a fast, C-based HTML/XML parser), so you select elements with CSS selectors and chain operations the same way you would in the browser - a familiar, concise alternative to Beautiful Soup for pulling data out of markup.
### Quick facts
- **Language:** Python (built on lxml)
- **Syntax:** jQuery-style CSS selection
- **Purpose:** Parse/extract/manipulate HTML & XML
- **Renders JavaScript?:** No - static HTML only
- **Alternative to:** Beautiful Soup, lxml
### What PyQuery does
PyQuery loads HTML from a string, URL, or file and hands it to you through a jQuery-like API. You give it a CSS selector - for example pq('div.price') picks out matching elements - and then chain methods to read text, read attributes, or move around the DOM (the page's tree of nested tags). Because it sits on lxml, parsing is fast, and the short syntax feels natural to anyone comfortable with jQuery.
### PyQuery vs Beautiful Soup
Both parse static HTML in Python; the difference is mostly style. PyQuery wins on familiarity and brevity if you think in jQuery selectors; Beautiful Soup is more widely used, more forgiving of messy markup, and has a larger community. For raw speed, both can lean on lxml. The choice is mostly about ergonomics - they solve the same problem.
### PyQuery's scraping limits
Like other parsers, PyQuery only reads the HTML you give it - it doesn't run JavaScript and has no proxy or anti-bot handling. So on its own it can't render a single-page app (a site that builds its content with JavaScript in the browser) or handle a Cloudflare challenge. The reliable pattern is to fetch fully rendered HTML through a web scraping API, then parse it with PyQuery (or Beautiful Soup).
### FAQ
**Q: PyQuery vs Beautiful Soup - which should I use?**
Use PyQuery if you like jQuery-style selectors and concise chaining; use Beautiful Soup for its larger community and more forgiving parsing of messy HTML. Both handle static HTML well, so it largely comes down to which style you prefer.
**Q: Can PyQuery scrape JavaScript pages?**
No - it only parses static HTML, the raw markup as downloaded. For content built by JavaScript in the browser (an SPA), render the page first with a headless browser or scraping API, then parse the result with PyQuery.
**Q: Is PyQuery still maintained?**
It's a mature, stable library built on lxml. It changes less often than Beautiful Soup, but it remains reliable and usable for HTML/XML parsing.
**Q: Does PyQuery handle anti-bot protection?**
No. It has no proxy or anti-bot features of its own - pair it with rotating proxies or a scraping API to reach sites that block plain requests.
---
## Browser Automation Engine Benchmarks
URL: https://scrappey.com/qa/web-scraping-apis/browser-automation-engine-benchmark
**A browser-automation-engine benchmark drives several automation stacks through the same set of targets and records, side by side, how often each one reaches real page content, how much memory and CPU it burns, and how its fingerprint scores.** Instead of arguing which engine is "most human", a benchmark runs every engine - real Playwright, patched builds like Patchright, Firefox-based Camoufox, Selenium/SeleniumBase, CDP drivers, and anti-detect browsers - against the same checklist and reports the numbers. The open-source techinz/browsers-benchmark suite (Python 3.8+, MIT) is a good reference implementation, and the tables below are taken from its published example run.
### Quick facts
- **What it scores:** Detection-pass rate, memory + CPU, reCAPTCHA v3 score, CreepJS trust, IP/WebRTC leak
- **Engines covered:** 23 configs: Playwright, Patchright, Camoufox, Selenium/SeleniumBase, NoDriver/ZenDriver, anti-detect browsers
- **Targets probed:** A spread of modern protection stacks plus major retail, search, and social sites
- **Biggest confound:** IP reputation - a flagged IP sinks even a perfect fingerprint, so a clean proxy per engine is required
- **Source:** github.com/techinz/browsers-benchmark (Python, MIT) - pluggable targets + engines
### The four families of metrics
A meaningful benchmark separates capability from cost and measures both on the same run. The metrics group into four families:
- **Detection-pass rate** - the share of targets where the engine reached the real page instead of an interstitial or verification screen. Targets span a range of modern protection stacks, so the rate is a coarse measure of how "real" the session looks end to end.
- **Resource cost** - peak memory (MB) and CPU (%) per engine. This is where the differences are largest: a lightweight anti-detect profile can run at ~120 MB while a full headful Chromium or Firefox session uses 1,000-1,400 MB. At fleet scale this decides how many sessions fit on a box.
- **Fingerprint scores** - a reCAPTCHA v3 score (0 = bot, 1 = human) read from a public scorer, and a CreepJS trust/bot reading. These probe how coherent the browser fingerprint looks to a real detector rather than to a single boolean check.
- **Network hygiene** - whether the IP seen by the site is the proxy IP (good) or the real IP (bad), and whether WebRTC leaks a different address. A stealthy fingerprint over a leaking connection is still caught.
### Sample results from one benchmark run
These tables are from the suite's published example run (full data, charts, and screenshots are in the repo under results/example). Read them as broad tiers, not a precise leaderboard - a single run is noisy, and each engine used a different clean proxy. The repo labels its first column "bypass rate" — really an access rate: the share of target sites where the engine reached real content.
**Detection-pass rate** (higher is better):
EnginePass rate (%)
**patchright**100.0
**cloakbrowser**90.0
**camoufox_headless**90.0
**nodriver-chrome**80.0
**adspower**80.0
**seleniumbase-cdp-chrome**80.0
**adspower_headless**70.0
**tf-playwright-stealth-firefox**70.0
**tf-playwright-stealth-firefox_headless**70.0
**zendriver-chrome**70.0
**cloakbrowser_headless**60.0
**tf-playwright-stealth-chromium**60.0
**playwright-chrome**60.0
**playwright-firefox_headless**60.0
**playwright-firefox**60.0
**zendriver-chrome_headless**60.0
**tf-playwright-stealth-chromium_headless**50.0
**selenium-chrome (no proxy)**50.0
**playwright-chrome_headless**40.0
**nodriver-chrome_headless**40.0
**patchright_headless**40.0
**camoufox**30.0
**selenium-chrome_headless (no proxy)**30.0
**Resource cost** - peak memory and CPU per engine (lower is better). Note how the lightest options (the anti-detect profiles) and the best-scoring options sit at opposite ends, so the "winner" depends on whether you optimise for pass rate or for sessions-per-server:
EngineMemory (MB)CPU (%)
**adspower_headless**1234.9
**adspower**1307.0
**playwright-chrome_headless**5177.4
**tf-playwright-stealth-chromium_headless**5228.2
**zendriver-chrome_headless**81821.5
**tf-playwright-stealth-chromium**91328.6
**cloakbrowser_headless**93627.2
**selenium-chrome_headless (no proxy)**94810.1
**zendriver-chrome**100938.8
**selenium-chrome (no proxy)**103429.8
**tf-playwright-stealth-firefox_headless**103444.7
**playwright-firefox_headless**106870.3
**cloakbrowser**108254.4
**playwright-chrome**110328.5
**seleniumbase-cdp-chrome**115751.9
**playwright-firefox**116175.5
**camoufox**118153.3
**tf-playwright-stealth-firefox**126167.7
**nodriver-chrome_headless**127524.4
**patchright_headless**127716.6
**patchright**131453.3
**camoufox_headless**131888.6
**nodriver-chrome**138947.4
**Fingerprint scores.** In this run almost every working engine scored ~0.90 on reCAPTCHA v3 - a reminder that one score separates the obviously-broken from the rest, not the good from the great. A few configs returned no score because the scorer page stopped responding mid-test. CreepJS trust/bot percentages all read 0.00 here because CreepJS upstream temporarily disabled those scores, so in practice the suite now leans on CreepJS mainly for the WebRTC-leak check: if the WebRTC IP differs from the real IP, the proxy is not leaking.
### What the numbers tend to show
Across published runs a few patterns repeat. **Headful beats headless** on the same engine almost every time - the headless variant of an engine routinely lands 20-40 points lower on detection-pass rate, because headless-specific tells leak through (note in the table how camoufox headful scores 30 but camoufox_headless scores 90, a reminder that per-engine tuning, not just the mode, drives the result). **Patched and engine-level stealth lead**: Patchright (a CDP-patched Playwright that avoids the Runtime.enable tell) and Camoufox (which sets fingerprint values from inside the Firefox engine rather than via injected JavaScript) tend to top the table, while plain Playwright and plain Selenium sit lower. **Stealth costs resources**: the engines near the top on detection are often the heaviest on CPU, so optimise for the axis you care about. **Anti-detect browsers trade off differently** - a managed anti-detect profile can be by far the lightest option (~120-130 MB) while still scoring mid-pack on detection.
### Running and extending the suite
The suite is a Python project with a modular layout (config/, engines/, utils/targets/, utils/report/) so both *what* it tests and *which* engines it runs are pluggable. The essentials:
- **Setup** - create a venv, pip install -r requirements.txt, then install the browser engines you want: playwright install, camoufox fetch, patchright install chromium. Anti-detect browsers that need a local desktop app and API key are optional.
- **Proxies are required, one per engine** - list them in documents/proxies.txt, at least as many as the engines you test. Protocols matter: Playwright takes HTTP/HTTPS only, NoDriver takes SOCKS5 only, and Selenium runs without a proxy - so you need a mix. The benchmark reports which protocols are missing.
- **Run** - python main.py produces summary.md, benchmark_results.json, and a media/ folder of dashboard charts and per-target screenshots.
Adding a **target** is a Target(...) definition plus a check function that returns True when real content rendered (see the code example below). Adding an **engine** means subclassing the right base and registering it:
// engines/ - subclass the base that matches your stack
class CustomEngine(BrowserEngine): ... # from scratch
class CustomEngine(PlaywrightBase): ... # Playwright-based
class CustomEngine(SeleniumBase): ... # Selenium-based
// then register it in config/engines.py
base_engines = [
{ "class": CustomEngine,
"params": { "headless": True, "name": "custom_engine", "browser_type": "chromium" } },
]That extensibility is the practical value of the project: it is less a fixed leaderboard than a harness you point at your own targets and engines.
### Reading the results without fooling yourself
The single most important caveat in any benchmark like this is that **IP reputation usually matters more than the engine**. The suite requires a clean proxy precisely because a home or datacenter IP that has been flagged by prior automation will fail targets regardless of how good the fingerprint is - so you would be measuring your IP, not the engine. Results are also noisy per run (a target can be down, rate-limiting can kick in, a fingerprint scorer can change), which is why these tables are best read as broad tiers, not leaderboards to two decimal places.
The deeper lesson is the same one that runs through fingerprinting in general: passing depends on coherence, not on any one trick. The engines that win are the ones whose TLS handshake, headers, fingerprint surfaces, and exit IP all tell one consistent story - not the ones that spoof a single field hardest. That coherence is also why teams running at scale often move the hard parts server-side: a managed web-data API such as Scrappey handles fingerprinting, residential routing, and TLS matching behind one request, so the coherence is maintained for you as detection evolves - while a self-hosted engine from a benchmark like this remains the right pick for learning, testing, and full control.
### Example
```python
# A browser-engine benchmark drives every engine through the SAME targets
# and records pass/fail, resource use, and fingerprint scores per engine.
# Targets are pluggable - a check function returns True when real content rendered.
from engines.base import BrowserEngine
async def check_render(engine: BrowserEngine) -> bool:
# True -> the engine reached the real page
# False -> it was held at a verification / interstitial screen
blocked, _html = await engine.locator('//div[@id="challenge"]')
return not blocked
# Each engine is run headless AND headful, because the headless variant
# usually scores lower on the same detection and fingerprint checks.
# Every engine gets its own clean proxy so you measure the ENGINE,
# not the reputation of one shared IP.
```
### FAQ
**Q: What does a browser automation benchmark actually measure?**
Four things at once, per engine: how often it reaches real page content across a set of targets (detection-pass rate), how much memory and CPU it uses, how its fingerprint scores on probes like reCAPTCHA v3 and CreepJS, and whether the network connection leaks the real IP via the proxy check or WebRTC. Measuring capability and cost together is the point - the best stealth engine is often the most resource-hungry.
**Q: Why do headless browsers score worse in these benchmarks?**
Because headless mode leaves detectable tells - missing or default window/screen metrics, rendering differences, and other headless-specific signals - that a real detector reads. On the same engine, the headless variant typically lands well below the headful one on detection-pass rate, which is why benchmarks run both.
**Q: Why does the benchmark insist on a clean proxy for every engine?**
Because IP reputation usually outweighs the engine. A home or datacenter IP already flagged by past automation will fail targets no matter how good the fingerprint is, so you would be measuring the IP rather than the engine. Giving each engine its own clean proxy isolates the variable you actually want to compare.
---
## How Do You Choose an Anti-Detect Browser Tool?
URL: https://scrappey.com/qa/web-scraping-apis/how-to-choose-an-anti-detect-browser-tool
**Choosing an anti-detect browser tool comes down to matching the tool's strengths to the detection layer you actually face - no single tool is best at everything, and none is truly undetectable.** The eight most-discussed open-source options each cover a different slice: custom browser builds (Camoufox on Firefox, CloakBrowser on Chromium) harden fingerprints at the C++ level; patched Playwright builds (Patchright, XDriver) hide the automation protocol; Selenium wrappers (SeleniumBase, Botasaurus) add UC-mode stealth and human-like behaviour; and frameworks/engines (Scrapling, Obscura) trade real rendering for speed. This is the practical decision companion to the technical tool comparison, built from the open-source source-code analysis at pim97/anti-detect-browser-tools-tech-comparison (which is sponsored by Scrappey).
### Quick facts
- **Tools compared:** Camoufox, Patchright, SeleniumBase, Botasaurus, XDriver, CloakBrowser, Scrapling, Obscura
- **C++ engine stealth:** Camoufox (Firefox) and CloakBrowser (Chromium) set fingerprint values natively
- **Lightest:** Scrapling HTTP tier (~10 MB) and Obscura (~30 MB) vs ~200 MB for real browsers
- **Decision rule:** Match the tool to the detection layer; combine tools for hard targets
- **Universal truth:** IP reputation and TLS fingerprint matter more than fingerprint polish
### The capability matrix
The fastest way to choose is to compare what each tool actually hardens. The C++ builds set values inside the engine (so injected JavaScript cannot detect the patch); the Playwright patches focus on the automation protocol; the HTTP-tier tools impersonate the TLS handshake instead of rendering a real page. Memory footprint splits the field sharply - real browsers sit near 200 MB while the HTTP/engine tools are an order of magnitude lighter.
ToolTypewebdriver flagRuntime.enableFP rotationHuman mouseTLS impersonationMemory
**Camoufox**Firefox buildC++ nativeJuggler (no CDP)highmedno~200 MB
**Patchright**Playwright patchyesyeslowlowno~200 MB
**SeleniumBase**Selenium + UCyesyeslowmedno~200 MB
**Botasaurus**Selenium wrapperyesnolowhighno~200 MB
**XDriver**Playwright CDP patchyesyesnolowno~200 MB
**CloakBrowser**Chromium buildC++ nativeyesmedhighno~200 MB
**Scrapling**All-in-one frameworkvia Patchright/Camoufoxvia Patchrightnonoyes (curl_cffi HTTP)~10 MB HTTP
**Obscura**Rust V8 engineJS shimn/a (custom engine)small poolnooptional~30 MB
"FP rotation" is fingerprint rotation; "TLS impersonation" reproduces a browser-like JA3/JA4 handshake. Real browsers (top six) all render genuine layout and canvas/WebGL; Obscura has no layout engine, so layout-probe checks (getBoundingClientRect) return zeros.
### Match the tool to the detection layer
Detection happens in five stacked layers, and tools differ by which they cover. **Layer 1, protocol** - automation tells like CDP's Runtime.enable timing (hardened by Patchright, XDriver, CloakBrowser; sidestepped entirely by Camoufox via Firefox's Juggler). **Layer 2, fingerprinting** - canvas/WebGL/audio/screen values, set natively only by the C++ builds. **Layer 3, behavioural** - mouse and timing patterns (Botasaurus and CloakBrowser lead). **Layer 4, network** - TLS fingerprint, WebRTC/DNS leakage, and IP reputation. **Layer 5, layout/rendering** - real-browser-only checks that headless engines without a layout engine fail. The practical mapping:
Your situationPickWhy
**Truly native fingerprint spoofing + rotation**CamoufoxC++-level values JS cannot detect; statistically accurate profiles
**Maximum Chromium stealth, free**PatchrightProtocol-level CDP hardening (no Runtime.enable tell)
**Chromium C++ stealth + Playwright API**CloakBrowser33 source-level patches + one-flag humanize
**Human-like mouse behaviour**BotasaurusBest Bezier-curve mouse implementation
**Built-in Turnstile/reCAPTCHA (UC mode)**SeleniumBaseBuilt-in Turnstile/reCAPTCHA handling in UC mode
**Drop-in for existing Playwright code**XDriver / CloakBrowserNo code changes; replace the import
**All-in-one fetch + parse + crawl**ScraplingTLS-impersonated HTTP tier + adaptive selectors + spider
**Lightweight high-concurrency**Obscura~30 MB / ~85 ms page load, single Rust binary, no Chrome
### Realistic success rates
The honest numbers from the analysis, grouped by how strong the target's protection is. Notice how much the proxy column moves the result - clean residential IPs roughly double the success rate against the hardest tier, which is the analysis's central point: **IP reputation usually matters more than the tool**.
Protection levelTool alone+ residential proxies
**Basic (simple checks)**90%+99%+
**Medium**60-80%90%+
**Enterprise**20-40%70-85%
**Custom ML-based**under 20%50-70%
For the hardest tiers the recommended pattern is to combine layers - for example native fingerprint rotation (Camoufox) with protocol stealth (Patchright), always behind residential proxies. Validate any setup against public detector pages (Sannysoft, BrowserScan, CreepJS, Pixelscan) before trusting it in production.
### The hard truth - and where a managed API fits
The analysis is blunt about the ceiling: **no tool is truly undetectable, and detection is an arms race**. Three realities dominate - a flagged or datacenter IP sinks even a perfect fingerprint; the TLS handshake signature is nearly impossible to fully fake from inside a real browser; and a steady scraping rhythm accumulates into a behavioural signal no mouse-curve can hide. That is why the maintenance cost of a self-hosted stealth stack grows over time as detection evolves. Open tools remain the right pick for learning, testing, and full self-hosted control. For production at scale against hard targets, many teams instead push the hard parts onto a server: a managed web-data API such as Scrappey handles fingerprinting, residential routing, and TLS matching behind one request, trading the control of running your own browser for not having to maintain it as detection changes. The full per-tool source-code breakdowns live in the comparison repository.
### Example
```text
Decision shortcut - pick by the layer that's blocking you:
Failing on automation tells (Runtime.enable, webdriver)
-> Patchright / XDriver / CloakBrowser (protocol stealth)
Failing on canvas / WebGL / audio fingerprint
-> Camoufox or CloakBrowser (C++ native values)
Failing on mouse / timing behaviour
-> Botasaurus or CloakBrowser (human movement)
Failing on a verification workflow
-> SeleniumBase (UC mode)
Need speed / high concurrency, pages are easy
-> Scrapling HTTP tier (~10 MB) or Obscura (~30 MB)
Hard enterprise target:
Camoufox (fingerprint rotation) + Patchright (protocol)
+ residential proxies # IP reputation dominates
```
### FAQ
**Q: Which anti-detect browser tool is the best?**
There is no single best - it depends on the detection layer you face. For native fingerprint rotation use Camoufox; for Chromium C++ stealth use CloakBrowser; for protocol stealth with the Playwright API use Patchright; for UC-mode stealth use SeleniumBase; for human-like behaviour use Botasaurus; for an all-in-one framework use Scrapling; for lightweight high-concurrency use Obscura. On heavily-protected public pages, teams typically combine fingerprint rotation, protocol-level browser tooling, and residential proxies.
**Q: What matters more, the tool or the proxy?**
Usually the proxy. IP reputation is the single biggest factor: the best stealth tool fails from a flagged datacenter IP, while a modest tool on a clean residential IP often succeeds. In the measured success rates, adding residential proxies roughly doubles the result against enterprise-grade protection.
**Q: Which tools are the lightest on resources?**
The non-browser options. Scrapling's HTTP tier (TLS impersonation via curl_cffi) runs around 10 MB and Obscura's Rust V8 engine around 30 MB, versus roughly 200 MB for any real-browser tool. The trade-off is that the lightweight tools do not render real layout, so they fail layout- and canvas-based checks that need a genuine browser.
**Q: When should I use a managed API instead of these tools?**
When you scrape hard targets at scale and do not want to maintain a browser stack as detection keeps changing. The open tools give you control and are ideal for learning and self-hosting; a managed API like Scrappey handles fingerprinting, verification workflows, residential routing, and TLS on its own servers so you do not have to keep up with the arms race yourself.
---
## What Is a User Agent?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-a-user-agent
**A user agent is a short text string a client sends in the User-Agent HTTP header to tell a server what software is making the request.** Every time a browser loads a page it announces itself - the browser name and version, the rendering engine, and the operating system. A real Chrome request looks like Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36. Servers read this to decide what to send back, and bot-detection systems read it as the very first signal of whether a request came from a real browser or a script.
### Quick facts
- **Lives in:** The User-Agent HTTP request header
- **Format:** Product tokens: browser, engine, OS, version
- **Default for scripts:** python-requests/2.x, curl/8.x, Go-http-client - instant giveaways
- **Used for:** Content negotiation, analytics, and bot detection
- **On its own:** Weak signal - easily spoofed, so detectors cross-check it
### How a user agent works
The user agent travels in the User-Agent request header, one of the standard headers sent with every HTTP request. A server can branch on it - serving a lighter page to an old browser, a mobile layout to a phone, or a block page to something it does not recognize. The string is purely advisory: nothing stops a client from sending whatever it wants, which is exactly why a scraper can set a Chrome user agent even though no Chrome is involved. The default user agents that HTTP libraries send are the problem. Out of the box, Python's requests library sends python-requests/2.31.0, curl sends curl/8.4.0, and Go sends Go-http-client/1.1. Any of those is a one-line rule for a site to block, because no human visitor ever sends them.
### Why user agents matter for web scraping
Setting a realistic user agent is the single cheapest improvement you can make to a scraper, and the most common rookie mistake is forgetting it. But a believable string is necessary, not sufficient. Modern bot detection treats the user agent as one claim among dozens and checks whether the rest of the request agrees with it. If you claim to be Chrome 126 on Windows but your TLS handshake fingerprint matches a Python library, your header order is alphabetical (browsers use a fixed, non-alphabetical order), and your Client Hints are missing, the user agent becomes evidence against you rather than cover. A coherent identity beats a fancy user agent every time.
### Rotating and matching user agents
Two rules cover most real-world use. First, rotate user agents from a pool of current, real strings so a long job does not send ten thousand requests from the exact same fingerprint - but keep the pool fresh, because a Chrome 90 string in 2026 is as suspicious as a default one. Second, and more important, match every other signal to the user agent you send: the Sec-CH-UA Client Hints, the Accept-Language header, the navigator properties a headless browser exposes, and the TLS fingerprint all have to tell the same story. Mismatched signals are what lie-detection systems look for.
### Example
```python
import requests
# A realistic, current user agent (and matching headers)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/126.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
}
resp = requests.get('https://example.com', headers=headers)
# Note: the user agent alone is not enough - TLS and header order
# still reveal 'requests'. A scraping API matches all signals for you.
```
### FAQ
**Q: What does a user agent string look like?**
A typical desktop Chrome user agent is 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36'. It lists a compatibility token, the rendering engine, the OS, and the browser version. The 'Mozilla/5.0' prefix is a historical artifact that nearly every browser still sends.
**Q: Is changing my user agent enough to avoid blocks?**
No. Setting a realistic user agent removes the most obvious giveaway, but detectors cross-check it against your TLS fingerprint, header order, Client Hints, and behavior. If those contradict the user agent you claim, the mismatch flags you. A good user agent is the floor, not the ceiling.
**Q: Should I rotate user agents when scraping?**
Rotating from a pool of current, real user agents helps spread a large job across more plausible identities. But rotating the user agent while leaving every other signal identical does little - the strings have to be current, and the rest of the fingerprint has to match each one.
**Q: Where is the user agent set?**
It is the value of the User-Agent HTTP request header. In code you set it explicitly (a headers dict in Python, a header option in curl). Browsers set it automatically based on the build. Servers and analytics tools read it from the incoming request.
---
## What Is Rate Limiting?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-rate-limiting
**Rate limiting is a control that caps how many requests a single client can make to a server within a fixed time window.** A site might allow 60 requests per minute per IP address, or 1,000 per hour per API key. Go over the limit and the server stops serving you - usually with an HTTP 429 Too Many Requests response, sometimes a temporary block. Sites use it to protect their infrastructure from overload, keep one user from hogging capacity, and slow down automated traffic.
### Quick facts
- **Measured in:** Requests per second, minute, hour, or day
- **Keyed by:** IP address, API key, account, or session
- **Typical signal:** HTTP 429, sometimes 503; Retry-After header
- **Common algorithms:** Token bucket, leaky bucket, fixed/sliding window
- **Scraper fix:** Throttle, back off on 429, spread load across IPs
### How rate limiting works
A server tracks how many requests have come from a given key - most often an IP address or an API token - and compares that count against an allowance. The two most common implementations are the token bucket and the sliding window. A token bucket hands each client a bucket that refills at a steady rate (say, one token per second up to a cap); each request spends a token, and when the bucket is empty, requests are rejected until it refills. A sliding window counts requests over the trailing N seconds. When you exceed the allowance, the server returns a 429 status, often with a Retry-After header telling you how many seconds to wait. Well-behaved clients read that header and pause; clients that ignore it and keep hammering frequently earn a longer, harder block.
### Why rate limiting matters for web scraping
Rate limiting is the wall most scrapers hit first, before any CAPTCHA or fingerprint check. Fire requests as fast as your code can loop and you will trip a per-IP limit within seconds, and from then on the site sees a stream of 429s instead of data. The naive fix - add a sleep between requests - works for small jobs but does not scale: one IP at a polite rate is still one IP, and the whole job runs at that single IP's allowance. The real fix is to spread the load. Distribute requests across a pool of rotating proxies so no single IP exceeds the per-IP cap, while keeping each individual IP's rate plausibly human.
### Staying under the limit
A robust scraper handles rate limits in layers. Respect the Retry-After header when you get a 429 and back off exponentially - wait 1s, then 2s, then 4s - instead of retrying instantly. Cap your concurrency so you are not sending fifty parallel requests to a site that allows ten. Spread traffic across many IPs via a rotating proxy pool, and add small random jitter to your timing so the pattern looks organic rather than a metronome. A managed scraping API folds all of this together: it pools IPs, paces requests per target site, and retries on 429 automatically, so you send one call and get the data without tuning backoff by hand. See also throttling and request retries.
### Example
```python
import time, requests
def get_with_backoff(url, max_tries=5):
delay = 1.0
for attempt in range(max_tries):
r = requests.get(url)
if r.status_code != 429:
return r
# honor Retry-After if the server sent one
wait = int(r.headers.get('Retry-After', delay))
time.sleep(wait)
delay *= 2 # exponential backoff
raise RuntimeError('rate limited after retries')
# At scale, rotate IPs instead of just sleeping - one IP = one allowance.
```
### FAQ
**Q: What is the difference between rate limiting and a 429 error?**
Rate limiting is the policy - the rule that caps requests per time window. A 429 Too Many Requests error is the response a server sends when you break that rule. The 429 often includes a Retry-After header telling you how long to wait before the limit resets.
**Q: How do scrapers get around rate limits without breaking them?**
By spreading load rather than overwhelming one endpoint: distributing requests across a pool of rotating IPs so no single address exceeds its per-IP allowance, capping concurrency, and backing off when a 429 appears. The goal is to stay under each IP's limit, not to overpower the limit itself.
**Q: What algorithms power rate limiting?**
The common ones are token bucket and leaky bucket (which refill or drain at a steady rate), and fixed-window or sliding-window counters (which tally requests over a time span). Token bucket is popular because it allows short bursts while enforcing a steady average rate.
**Q: Why do sites use rate limiting?**
To protect server capacity from overload, ensure fair access so one client cannot starve others, control infrastructure cost, and slow down automated abuse like credential stuffing or aggressive scraping. It is a core reliability tool, not only an anti-bot measure.
---
## What Is a CAPTCHA?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-a-captcha
**A CAPTCHA is a challenge a website uses to tell a human visitor apart from an automated script.** The name stands for Completely Automated Public Turing test to tell Computers and Humans Apart. It might ask you to type distorted letters, click every image with a traffic light, tick an "I'm not a robot" box, or - increasingly - nothing visible at all, while the page silently scores your behavior in the background. The goal is always the same: let people through while stopping bots, spam, and automated abuse.
### Quick facts
- **Stands for:** Completely Automated Public Turing test to tell Computers and Humans Apart
- **Common types:** Image grids, distorted text, checkbox, slider, invisible/scored
- **Modern examples:** reCAPTCHA v2/v3, hCaptcha, Cloudflare Turnstile, FunCaptcha
- **Returns:** A token the site verifies server-side before allowing access
- **Used to stop:** Spam, credential stuffing, scalping, and automated scraping
### How a CAPTCHA works
A CAPTCHA embeds a small widget in the page, tied to the site by a public site key. When the widget decides you have passed - because you clicked the right images, or because your behavior scored as human - it issues a token: a long, opaque string. The page attaches that token to its next request, and the site's server checks it against the CAPTCHA provider's API before letting the request through. Older CAPTCHAs put a visible puzzle in front of you. Newer ones (reCAPTCHA v3, Cloudflare Turnstile) are invisible: instead of a puzzle they watch signals like mouse movement, timing, browser fingerprint, and IP reputation, then hand out a risk score. A high score passes silently; a low score triggers a fallback challenge.
### Types of CAPTCHA
The main families are text CAPTCHAs (type the warped characters), image CAPTCHAs ("select all squares with a bus"), checkbox CAPTCHAs (the reCAPTCHA "I'm not a robot" tick, which is really a behavioral check disguised as a click), slider or puzzle CAPTCHAs (drag a piece into place), and invisible or scored CAPTCHAs that show nothing unless you look suspicious. The trend is away from puzzles a human finds annoying and toward silent scoring, because the puzzle itself was never the real test - the real test is whether the surrounding signals are consistent with a genuine browser session.
### Why CAPTCHAs matter for web scraping
A CAPTCHA is the most visible layer of bot defense, and most non-trivial scraping projects meet one eventually. Hitting one stalls the request until the challenge is resolved. The durable approach is to not trigger it in the first place by behaving like ordinary traffic: use quality residential proxies, send a coherent browser fingerprint, keep request rates moderate, and reuse session cookies so repeated visits share one consistent session rather than appearing as a thousand strangers. When a challenge does appear, a CAPTCHA solver completes it and returns a token. Note the distinction: the CAPTCHA is the test, the solver is the software that completes it. A managed scraping API focuses on the durable side — minimizing the challenges that appear in the first place through quality proxies, coherent fingerprints, and human-paced requests.
### Example
```python
import requests
# Let a scraping API fetch the page and handle any CAPTCHA inline
resp = requests.post(
'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY',
json={
'cmd': 'request.get',
'url': 'https://example.com/protected',
}
)
# Proxy, fingerprint, and CAPTCHA challenge are resolved server-side
html = resp.json()['solution']['response']
```
### FAQ
**Q: What does CAPTCHA stand for?**
CAPTCHA stands for 'Completely Automated Public Turing test to tell Computers and Humans Apart.' It is named after Alan Turing's idea of a test to distinguish a machine from a person - here automated and run at web scale to filter bots from real visitors.
**Q: What is the difference between a CAPTCHA and a CAPTCHA solver?**
A CAPTCHA is the challenge a website presents - the puzzle or silent behavioral check. A CAPTCHA solver is separate software that completes that challenge automatically and returns a passing token. The site issues the CAPTCHA; the solver answers it.
**Q: What are the main types of CAPTCHA?**
Text CAPTCHAs (read distorted characters), image CAPTCHAs (select matching pictures), checkbox CAPTCHAs (the 'I'm not a robot' tick), slider/puzzle CAPTCHAs, and invisible or scored CAPTCHAs that judge your behavior silently. Modern sites lean heavily on the invisible, scored kind.
**Q: Why am I getting CAPTCHAs when scraping?**
Usually because something in your traffic looks automated: a datacenter IP with no history, a default or mismatched user agent, a non-browser TLS fingerprint, or an inhuman request rate. Fix those signals first - most CAPTCHAs are triggered by them, not by random chance.
---
## What Are Request Retries?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-request-retries
**Request retries are the practice of automatically re-sending an HTTP request that failed, instead of giving up on the first error.** Networks drop packets, servers return temporary 503s, and rate limiters fire 429s - many failures are transient and succeed on a second or third attempt. Retry logic detects which failures are worth retrying, waits a sensible interval (ideally growing each time), and re-sends the request, turning a flaky connection into a reliable one without manual intervention.
### Quick facts
- **Retry on:** Timeouts, connection resets, 429, 500, 502, 503, 504
- **Do not retry on:** Most 4xx (400, 401, 403, 404) - retrying will not help
- **Backoff:** Exponential (1s, 2s, 4s...) plus random jitter
- **Safety:** Safe for idempotent requests (GET); care needed for POST
- **Cap:** A max-attempts limit prevents infinite loops
### How retry logic works
A retry wrapper sits around your HTTP call. When a request fails, it inspects the failure to decide whether retrying could help. Transient errors - a network timeout, a connection reset, or a 5xx server error - are worth retrying because the cause is temporary. A 429 rate-limit is retriable too, after waiting for the limit to reset. Most 4xx client errors are not: a 404 means the page is not there, and a 401 means you are not authenticated, so re-sending the identical request just wastes attempts. When a retry is warranted, the wrapper waits, then re-sends, up to a maximum number of attempts before it gives up and raises the error.
### Exponential backoff and jitter
Retrying instantly is the wrong move - if a server is overloaded, an immediate retry adds to the load that caused the failure. The standard pattern is exponential backoff: wait 1 second, then 2, then 4, then 8, doubling the delay each attempt so a struggling server gets room to recover. On top of that, add jitter - a small random offset to each wait - so that many clients retrying at once do not all fire again at the exact same instant (the "thundering herd" problem). When the server sends a Retry-After header, honor it: it tells you exactly how long to wait, which beats any guess. This same backoff logic underpins polite rate-limit handling.
### Why retries matter for web scraping
At scale, transient failures are not the exception - they are constant. A job fetching a million pages will hit thousands of timeouts and temporary blocks purely by volume, and a scraper without retries simply loses that data. Good retry logic recovers it automatically. The one thing to watch is idempotency: a GET is safe to retry because re-fetching a page has no side effect, but retrying a POST that places an order could place it twice, so non-idempotent requests need care (an idempotency key, or no retry). A managed scraping API builds retries in - it detects soft failures like a block page returned with a 200 status, rotates to a fresh IP, and retries transparently, so you get the data instead of an error. See also self-healing scrapers.
### Example
```python
import time, random, requests
RETRIABLE = {429, 500, 502, 503, 504}
def fetch(url, max_tries=4):
for attempt in range(max_tries):
try:
r = requests.get(url, timeout=10)
if r.status_code not in RETRIABLE:
return r # success or a non-retriable error
except requests.RequestException:
pass # timeout / connection reset - retry
sleep = (2 ** attempt) + random.uniform(0, 0.5) # backoff + jitter
time.sleep(sleep)
raise RuntimeError(f'failed after {max_tries} attempts')
```
### FAQ
**Q: Which HTTP errors should I retry?**
Retry transient failures: network timeouts, connection resets, and 5xx server errors (500, 502, 503, 504), plus 429 rate limits after a wait. Do not retry most 4xx client errors (400, 401, 403, 404) - they signal a problem with the request itself that re-sending will not fix.
**Q: What is exponential backoff?**
Exponential backoff is a retry strategy where the wait between attempts doubles each time - 1s, 2s, 4s, 8s - instead of staying constant. It gives an overloaded server progressively more time to recover and prevents a tight retry loop from making the problem worse.
**Q: Why add jitter to retries?**
Jitter is a small random delay added to each backoff interval. Without it, many clients that failed at the same moment would all retry at the same moment, hammering the server in synchronized waves (the 'thundering herd'). Jitter spreads those retries out over time.
**Q: Is it safe to retry any request?**
Idempotent requests like GET are safe to retry because repeating them has no side effect. Non-idempotent requests like a POST that creates a record need care - a blind retry could duplicate the action. Use an idempotency key, or avoid retrying writes, to stay safe.
---
## What Is a Web Unblocker?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-a-web-unblocker
**A web unblocker is a managed service that sits between your scraper and a target site, automatically handling the proxies, browser rendering, and verification needed to retrieve a public page reliably.** Instead of you assembling a proxy pool, a fingerprint engine, and CAPTCHA handling yourself, you send a URL to the unblocker and it returns the page content - choosing the right IP, rendering the page, handling verification, and retrying on failure behind the scenes. It is the "just give me the HTML" layer of a modern scraping stack.
### Quick facts
- **Also called:** Unblocker, scraping proxy, web data API, unlocker
- **Handles:** Proxy selection, fingerprinting, challenges, retries, geo-targeting
- **Interface:** Send a URL, get HTML/JSON back - one request
- **Billing:** Usually per successful request or per GB of traffic
- **Best for:** Reliably reaching well-defended public pages at scale
### How a web unblocker works
A web unblocker exposes a single endpoint. You send it a target URL (and optional settings like country or whether to render JavaScript), and it orchestrates everything required to fetch that page successfully. Internally it picks an appropriate IP from a large pool - residential or mobile for hard targets - attaches a coherent browser fingerprint that matches that IP, sends correct headers and a realistic user agent, and where needed renders the page in a real browser. If the page returns a verification step, the unblocker handles it; if a request soft-fails (an error page returned with a 200 status), it rotates and retries. You get back the finished content and never see the machinery.
### Unblocker vs raw proxies
A raw proxy gives you a different IP and nothing else - you still own the hard parts: rotating addresses, matching fingerprints to each one, handling verification, detecting soft failures, and retrying. That works, but it is a substantial engineering project to build and maintain as defenses change. A web unblocker bundles all of those concerns behind one request and keeps them updated as target sites evolve. The trade-off is control versus convenience: raw proxies are cheaper per GB and fully under your control, while an unblocker costs more but removes the maintenance burden and usually delivers a higher success rate on heavily-protected public pages. Many teams use cheap datacenter proxies for easy targets and an unblocker only for the harder ones.
### Where an unblocker fits
The term "web unblocker" is a generic product category, not a single brand - several vendors offer one under names like unblocker, unlocker, or web data API. They all solve the same problem: turning the messy, ever-shifting work of reliably retrieving a public page into a stable API call. Scrappey is this kind of service. You send a URL to one endpoint and it returns the HTML or parsed JSON, having handled proxy rotation, fingerprinting, JavaScript rendering, and verification in a single call - so your code focuses on using the data, not on the retrieval plumbing.
### Example
```python
import requests
# One call: the unblocker picks the IP, matches a fingerprint,
# renders JS, resolves any challenge, and retries on soft failure.
resp = requests.post(
'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY',
json={
'cmd': 'request.get',
'url': 'https://example.com/protected',
'autoparse': True,
}
)
data = resp.json()['solution']['response'] # finished content
```
### FAQ
**Q: What is the difference between a web unblocker and a proxy?**
A proxy only changes your IP address - you still handle fingerprinting and retries yourself. A web unblocker bundles the proxy with fingerprint matching, JavaScript rendering, and automatic retries behind one request, so you send a URL and get back the page content.
**Q: Is a web unblocker the same as a web scraping API?**
They overlap heavily. 'Web unblocker' emphasizes the access layer - reliably retrieving a public page. A 'web scraping API' often adds parsing, structured output, and rendering on top. In practice a full scraping API includes unblocker functionality; the terms are frequently used interchangeably.
**Q: When should I use a web unblocker instead of raw proxies?**
Use raw proxies for easy targets where you only need a different IP and want the lowest cost. Reach for an unblocker on heavily-protected public pages where building and maintaining your own fingerprinting and retry logic would cost more engineering time than the service does.
**Q: How is a web unblocker billed?**
Most unblockers bill per successful request or per gigabyte of traffic, and many only charge for requests that actually return usable content. This aligns cost with results, unlike raw proxies which typically bill for all bandwidth regardless of whether a request succeeded.
---
## What Is a CSS Selector?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-a-css-selector
**A CSS selector is a pattern that picks out specific elements in an HTML document by matching their tag, class, id, attributes, or position.** Originally built to apply styles in CSS, selectors are now the most common way scrapers locate the data they want on a page. .price matches every element with class "price"; #main matches the element with id "main"; div.product > a[href] matches links that are direct children of a product div. Parsing libraries use the exact same selector syntax browsers do.
### Quick facts
- **Targets by:** Tag, class (.), id (#), attribute ([ ]), and position
- **Comes from:** CSS styling - reused for element selection in scraping
- **Used in:** BeautifulSoup (select), Playwright, Puppeteer, Cheerio, Scrapy
- **Combinators:** descendant (space), child (>), sibling (+, ~)
- **Alternative:** XPath - more powerful, more verbose
### CSS selector syntax
Selectors are built from a few primitives. A bare tag name (a, div, h2) matches elements of that type. A dot prefixes a class (.product), a hash prefixes an id (#header), and square brackets match attributes ([data-id] for presence, a[href^="https"] for a value that starts with "https"). You combine these to narrow the match: div.card h3 means an h3 anywhere inside a div with class "card", while div.card > h3 means an h3 that is a direct child. Sibling combinators (+ for the next element, ~ for any later sibling) and pseudo-classes (:first-child, :nth-child(2)) handle position. Chaining without a space (a.btn.primary) requires all conditions on one element.
### How scrapers use CSS selectors
Extraction is two steps: fetch the HTML, then query it with selectors. In Python, BeautifulSoup's select() and select_one() take a CSS selector and return matching elements; Scrapy and parsel offer .css(); in Node, Cheerio mirrors jQuery's $('.price'); and browser automation tools like Playwright and Puppeteer use selectors to both find and interact with elements (page.click('button.submit')). The skill is writing selectors that are specific enough to grab the right data but loose enough to survive small markup changes - anchoring on a stable class or data attribute rather than a brittle chain of div > div > div that breaks the moment the layout shifts.
### CSS selectors vs XPath
CSS selectors and XPath are the two query languages for HTML, and most tools support both. CSS selectors are shorter and more readable for the common cases - selecting by class, id, or attribute - which is why they are the default for most scraping. XPath is more powerful: it can walk back up to a parent, select an element by its text content, or apply functions, none of which plain CSS can do. A practical rule: reach for CSS first because it is cleaner, and drop to XPath when you need to match on text or navigate upward through the tree. Getting selectors right matters more than which language you pick - and a parsing layer that returns structured fields, like an autoparse step, removes the need to hand-write selectors for common page types.
### Example
```python
from bs4 import BeautifulSoup
html = '<div class="product"><a href="/p/1">Widget</a>' \
'<span class="price">$19.99</span></div>'
soup = BeautifulSoup(html, 'html.parser')
name = soup.select_one('div.product > a').get_text() # 'Widget'
price = soup.select_one('.price').get_text() # '$19.99'
link = soup.select_one('a[href^="/p/"]')['href'] # '/p/1'
print(name, price, link)
```
### FAQ
**Q: What is the difference between a CSS selector and XPath?**
Both locate elements in HTML. CSS selectors are shorter and read more cleanly for selecting by tag, class, id, or attribute. XPath is more powerful - it can select by text content, navigate to a parent element, and use functions - but it is more verbose. Most scrapers use CSS by default and XPath when they need its extra reach.
**Q: How do I select an element by class with a CSS selector?**
Prefix the class name with a dot: '.price' matches every element with class 'price'. To require a tag too, write 'span.price'. To require multiple classes on one element, chain them with no space: '.btn.primary' matches elements that have both classes.
**Q: Which scraping libraries support CSS selectors?**
Most of them: BeautifulSoup (select/select_one), Scrapy and parsel (.css()), Cheerio in Node, and browser automation tools like Playwright, Puppeteer, and Selenium. The syntax is the same one browsers use, so a selector you test in DevTools works in your scraper.
**Q: What makes a CSS selector robust?**
Anchor on stable hooks - a meaningful class name, an id, or a data attribute - rather than a long chain of generic tags like 'div > div > div', which breaks the moment the layout changes. Specific enough to match only your target, loose enough to survive minor markup edits.
---
## What Is an XPath Selector?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-an-xpath-selector
**XPath (XML Path Language) is a query language for navigating the tree structure of an HTML or XML document to select elements by their path, attributes, or text content.** Where a CSS selector matches patterns, XPath describes a route through the document: //div[@class="price"] selects every div with class "price" anywhere in the tree, and //a[contains(text(),"Next")] selects links whose visible text contains "Next" - something plain CSS cannot do. It is the more powerful of the two main element-selection languages used in scraping.
### Quick facts
- **Stands for:** XML Path Language
- **Selects by:** Path, attribute, text content, and position
- **Starts with:** // (anywhere) or / (absolute from root)
- **Unique powers:** Match by text, walk to parent/ancestor, use functions
- **Used in:** lxml, Scrapy, Selenium, Playwright, Puppeteer
### XPath syntax
An XPath expression reads as a path through the document tree. // means "search anywhere from here," while a single / steps to a direct child. //div selects every div; //div[@class="card"] filters by attribute; //ul/li[1] takes the first li child of a ul (XPath indexes from 1, not 0). Predicates in square brackets are the heart of XPath: [@id="main"], [contains(@class,"btn")], and [text()="Buy now"] all filter the current matches. Axes let you move in any direction - /parent::, /following-sibling::, /ancestor:: - so you can select an element and then climb to its container, which CSS cannot express. Functions like contains(), starts-with(), and normalize-space() handle messy real-world markup.
### How scrapers use XPath
In Python, lxml and Scrapy expose .xpath() on a parsed document, returning matching nodes you can read text or attributes from. Selenium and Playwright accept XPath to locate elements for clicking or reading. The classic use case is data that has no clean class to grab - a price that sits in an unlabeled span next to a label, where the only reliable anchor is "the span that follows the element containing the text 'Price'." XPath expresses that directly: //*[text()="Price"]/following-sibling::span. That ability to select relative to text and to walk the tree in any direction is why XPath survives on pages where CSS selectors run out of road.
### XPath vs CSS selectors
XPath and CSS selectors target the same elements but trade off differently. CSS is shorter, more readable, and faster to write for the common cases (by class, id, attribute), which is why it is the default. XPath wins when you need to select by visible text, navigate upward to a parent or ancestor, or apply a function - capabilities CSS simply lacks. The cost is verbosity and a steeper learning curve. A pragmatic workflow uses CSS for the 80% of selections that are straightforward and switches to XPath for the awkward 20% where text-matching or tree-walking is the only stable hook. As with CSS, brittle absolute paths (/html/body/div[3]/div[2]) break on any layout change - anchor on attributes or text instead. A managed parsing step can also return structured fields directly, sparing you hand-written selectors for common page types.
### Example
```python
from lxml import html
doc = html.fromstring(
'<div class="product"><span>Price</span>'
'<span class="value">$19.99</span></div>'
)
# Select by attribute
price = doc.xpath('//span[@class="value"]/text()')[0] # '$19.99'
# Select relative to text - XPath can do this, CSS cannot
next_to_label = doc.xpath('//span[text()="Price"]/following-sibling::span/text()')[0]
print(price, next_to_label)
```
### FAQ
**Q: What is XPath used for in web scraping?**
XPath locates elements in an HTML document so a scraper can extract their text or attributes. It is especially useful when there is no clean class or id to target - for example selecting an element by its visible text, or selecting a value relative to a nearby label, which CSS selectors cannot express.
**Q: What is the difference between // and / in XPath?**
A double slash (//) searches anywhere in the document from the current node, so //div finds every div at any depth. A single slash (/) selects a direct child one level down, so /html/body/div matches only a div that is an immediate child of body. Use // for flexible matching, / for precise paths.
**Q: Is XPath better than CSS selectors?**
Neither is universally better. XPath is more powerful - it can match by text, walk to a parent, and use functions - while CSS selectors are shorter and easier to read for common selections. Most scrapers use CSS by default and reach for XPath only when they need its extra capabilities.
**Q: Why does my XPath break when the site changes?**
Usually because it relies on an absolute positional path like /html/body/div[3]/div[2], which depends on the exact layout. Any markup change shifts the positions and the path no longer matches. Anchor instead on stable attributes or text content, e.g. //div[@class='price'], so the expression survives layout edits.
---
## What Is JavaScript Rendering?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-javascript-rendering
**JavaScript rendering is the process of executing a page's JavaScript in a real browser engine so that content built on the client side appears before you extract it.** Many modern sites send a near-empty HTML shell and then build the visible page with JavaScript - fetching data, assembling the DOM, and injecting content after load. A plain HTTP request only sees the empty shell. Rendering runs that JavaScript (in a headless browser) so the finished page exists to scrape, the same way it would in a normal browser.
### Quick facts
- **Needed when:** Content is built client-side (SPAs, lazy-loaded data)
- **Runs in:** A headless browser engine (Chromium, Firefox, WebKit)
- **Tools:** Playwright, Puppeteer, Selenium, or a rendering API
- **Cost:** Slower and heavier than a raw HTTP request
- **Faster alternative:** Call the underlying JSON/XHR API directly when one exists
### How JavaScript rendering works
A rendering scraper launches a real browser engine - usually headless Chromium via Playwright or Puppeteer - and points it at the URL. The browser does what any browser does: downloads the HTML, runs the scripts, makes the background XHR/fetch calls those scripts trigger, and builds the final DOM. The scraper then waits for the content to be ready (a specific element to appear, or the network to go idle) and reads the rendered HTML. This is heavier than a raw request because you are running an entire browser per page - more CPU, more memory, more time - but it is the only way to see content that does not exist until JavaScript creates it.
### When you actually need rendering
Rendering is essential for client-side apps, but it is overused. The test is simple: fetch the raw HTML and look for your data. If it is already in the response - many sites server-render or embed data in a <script> JSON blob - you do not need a browser at all, and a plain HTTP request is an order of magnitude faster and cheaper. If the raw HTML is an empty shell and your data only appears after scripts run, you need rendering. A middle path is often best: open the network tab, find the JSON API the page's JavaScript is calling, and request that endpoint directly. Calling the underlying API skips the browser entirely and returns clean structured data.
### Rendering and anti-bot detection
Running a headless browser solves the content problem but creates a detection one. Automation frameworks leak signals - a navigator.webdriver flag set to true, missing or inconsistent browser properties, a TLS or fingerprint that does not match the user agent the browser claims - and headless detection looks for exactly these. So rendering at scale is not just "run a browser"; it is run a browser whose fingerprint is coherent, behind a residential IP, at a human pace. That coherence is hard to maintain across thousands of sessions, which is why a managed scraping API that renders JavaScript and matches a believable fingerprint in the same call is often more reliable than a self-hosted headless fleet.
### Example
```python
# Render in a headless browser to get client-side content
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com/app')
page.wait_for_selector('.product') # wait until JS injects the data
html = page.content() # fully rendered DOM
browser.close()
# Or send the URL to a scraping API with render=true and skip the fleet.
```
### FAQ
**Q: When do I need JavaScript rendering to scrape a page?**
When the data you want is not in the raw HTML response and only appears after the page's JavaScript runs - typical of single-page apps and lazy-loaded content. If the data is already present in the initial HTML, you do not need rendering and a plain HTTP request is faster and cheaper.
**Q: What is the difference between rendering and a plain HTTP request?**
A plain HTTP request downloads the HTML the server sends and stops. JavaScript rendering also runs that HTML in a real browser engine, executing scripts and background API calls so client-side content is built before you read it. Rendering is far heavier but sees content a raw request never will.
**Q: Is there a faster alternative to rendering JavaScript?**
Often yes: find the JSON API the page's JavaScript calls (visible in the browser network tab) and request that endpoint directly. It skips the browser entirely and returns clean structured data, which is faster and lighter than rendering - when such an endpoint exists and is accessible.
**Q: Does running a headless browser get me blocked?**
It can, because automation leaves signals like navigator.webdriver=true and fingerprint inconsistencies that headless-detection systems look for. Rendering reliably at scale requires a coherent fingerprint, a residential IP, and human-like pacing - not just launching a browser in headless mode.
---
## What Are Regular Expressions (Regex)?
URL: https://scrappey.com/qa/web-scraping-apis/what-are-regular-expressions
**A regular expression (regex) is a compact pattern that describes a set of strings, used to find, match, and extract text.** The pattern \d{3}-\d{4} matches a phone-number fragment like "555-1234"; [\w.-]+@[\w.-]+ matches an email-shaped string. In web scraping, regex is the right tool for pulling structured fragments - prices, dates, IDs, emails - out of text you have already extracted, and the wrong tool for parsing HTML structure itself, which selectors handle.
### Quick facts
- **Matches:** Text patterns: digits, words, repetition, position
- **Core tokens:** \d digit, \w word char, . any, + one-or-more, * zero-or-more
- **Capture:** Parentheses ( ) capture the matched part for extraction
- **Good for:** Prices, dates, emails, IDs, cleaning extracted text
- **Bad for:** Parsing HTML tree structure - use CSS/XPath instead
### Regex basics
A regex is built from literal characters plus metacharacters that mean "a class of characters" or "a repetition." Character classes: \d is any digit, \w any word character (letter, digit, underscore), \s whitespace, and . any character; square brackets define a custom class like [A-Z]. Quantifiers control repetition: + means one or more, * zero or more, ? optional, and {2,4} a specific count range. Anchors ^ and $ tie the match to the start or end of a line. Parentheses create capture groups - the part of the match you actually want to pull out. So price:\s*\$(\d+\.\d{2}) finds "price: $19.99" and captures just "19.99".
### How regex fits into web scraping
The clean division of labor is: use CSS selectors or XPath to navigate the HTML and grab the right element, then use regex on that element's text to extract the precise value. You select the .price span, then regex pulls the number out of "Only $19.99 today!". Regex also shines at cleanup - stripping whitespace, normalizing phone formats, splitting a combined string - and at scanning free-form text where there is no element boundary at all, like finding every email or URL in a blob of page copy. Used this way, on text rather than on markup, regex is fast, dependency-free, and exactly the right tool.
### When not to use regex
The classic mistake is trying to parse HTML structure with regex - matching tags, walking nested elements, extracting an attribute by pattern-matching the raw markup. HTML is not a regular language: it nests arbitrarily, tags can be malformed, attributes reorder, and whitespace varies, so a regex that works on one page silently breaks on the next. Use a real HTML parser (which builds a proper tree) for structure, and reserve regex for the leaf-level text inside the elements that parser hands you. If your regex is matching <div or href=, that is the signal to switch to a selector. Keeping regex on the text side and selectors on the structure side keeps both robust.
### Example
```python
import re
text = 'Contact: [email protected] - Price: $19.99 (was $24.99)'
email = re.search(r'[\w.-]+@[\w.-]+\.\w+', text).group() # '[email protected]'
price = re.search(r'\$(\d+\.\d{2})', text).group(1) # '19.99' (captured)
all_prices = re.findall(r'\$\d+\.\d{2}', text) # ['$19.99', '$24.99']
print(email, price, all_prices)
```
### FAQ
**Q: Should I use regex to parse HTML?**
No. HTML nests arbitrarily and varies in formatting, so regex patterns that match markup break easily and unpredictably. Use a real HTML parser with CSS selectors or XPath to navigate structure, then apply regex only to the text inside the elements you have extracted.
**Q: What is a capture group in regex?**
A capture group is a part of a pattern wrapped in parentheses, marking the substring you want to pull out of a larger match. In '\$(\d+\.\d{2})' the parentheses capture just the number, so matching '$19.99' lets you retrieve '19.99' separately from the dollar sign.
**Q: What is regex good for in web scraping?**
Extracting structured fragments from text you have already selected - prices, dates, phone numbers, emails, IDs - and cleaning or normalizing that text. It is also useful for scanning free-form copy where there is no element boundary, such as finding every URL in a block of text.
**Q: What do \d, \w, and + mean in a regex?**
\d matches any digit (0-9), \w matches any word character (letter, digit, or underscore), and + means 'one or more of the preceding token'. So \d+ matches one or more digits, like '42' or '1000'. These are among the most common building blocks of a pattern.
---
## What Is OCR in Web Scraping?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-ocr-web-scraping
**OCR (optical character recognition) is technology that converts text shown inside an image into machine-readable text characters.** Some data on the web is not real text - it is a picture of text: a price baked into a product image, a phone number rendered as a graphic to deter scraping, a scanned document, or a chart label. A normal scraper sees only an image file and no characters. OCR reads the pixels, recognizes the letters and numbers, and outputs them as a string you can store, search, and process.
### Quick facts
- **Stands for:** Optical Character Recognition
- **Converts:** Text-in-an-image into selectable, machine-readable text
- **Common engines:** Tesseract, plus cloud and vision-model OCR services
- **Used for:** Scanned PDFs, image-rendered text, charts, screenshots
- **Accuracy depends on:** Image resolution, contrast, font, and layout
### How OCR works
An OCR engine takes an image and runs it through several stages. First it preprocesses the image - converting to grayscale, increasing contrast, deskewing, and removing noise - so the characters stand out cleanly. Then it segments the image into regions, lines, words, and individual character shapes. Finally it classifies each shape against learned character models and assembles the result into text, often with confidence scores per character. Classic engines like Tesseract use trained character recognition; modern OCR increasingly uses neural networks and vision-language models that read messy, real-world images - skewed receipts, stylized fonts, text over busy backgrounds - far more accurately than older template-based methods. Output quality tracks input quality: a crisp, high-resolution image with good contrast reads near-perfectly, while a low-res or cluttered one produces errors.
### Why OCR matters for web scraping
OCR fills the gap where data exists on the page but not as text. Common cases: e-commerce sites that render prices or specs as images specifically so simple scrapers cannot read them; contact details shown as graphics; scanned PDFs and document archives with no text layer; infographics and charts where the numbers live only in the picture; and screenshots captured during a scrape. Without OCR, all of that is invisible to extraction logic. With it, a scraper can pull the value out of the image and treat it like any other field. Pair OCR with a screenshot step - render the page, capture the relevant region, and OCR it - and you can recover data that resists every text-based selector.
### OCR in a scraping pipeline
OCR is a post-processing step, not a fetch step. The flow is: retrieve the page and its images (rendering with JavaScript if the images load client-side), identify the image regions that hold text, pass those to an OCR engine, then validate and clean the output - OCR mistakes "0" for "O" and "1" for "l", so numeric fields deserve a sanity check. Reach for OCR only when the data genuinely isn't available as text; if a value exists as real characters anywhere in the HTML or an underlying API, extract that instead, because it is exact and far cheaper than recognizing pixels. A managed scraping API that handles rendering and screenshots in the same call makes the capture half of an OCR pipeline straightforward, leaving you just the recognition step.
### Example
```python
import pytesseract
from PIL import Image
# Text rendered as an image (e.g. a price baked into a product graphic)
img = Image.open('product_price.png')
text = pytesseract.image_to_string(img).strip() # 'Only $19.99'
# OCR confuses 0/O and 1/l - validate numeric fields before trusting them
import re
price = re.search(r'\$(\d+\.\d{2})', text)
print(price.group(1) if price else 'no price found')
```
### FAQ
**Q: What does OCR stand for?**
OCR stands for Optical Character Recognition. It is the technology that reads text contained inside an image - a photo, scan, or graphic - and converts it into machine-readable characters that software can store, search, and process.
**Q: When do scrapers need OCR?**
When the target data is displayed as an image rather than as text: prices or contact details rendered as graphics to deter scraping, scanned PDFs with no text layer, chart and infographic labels, or screenshots. In all of these a normal scraper sees only pixels, and OCR is what recovers the characters.
**Q: How accurate is OCR?**
It depends on the image. Clean, high-resolution images with good contrast and a standard font read with very high accuracy. Low-resolution, skewed, stylized, or cluttered images produce errors - commonly confusing 0 with O and 1 with l - so numeric and ID fields should be validated after recognition.
**Q: Should I use OCR if the data exists as real text?**
No. If a value is available as actual text anywhere in the HTML or an underlying API, extract that directly - it is exact, fast, and cheap. Use OCR only as a fallback for data that genuinely exists solely as an image, since recognizing pixels is slower and can introduce errors.
---
## Is Web Scraping Legal?
URL: https://scrappey.com/qa/web-scraping-apis/is-web-scraping-legal
**Scraping publicly available data is generally legal, but legality depends on *what* you collect, *how* you collect it, and *what you do with it* — not on web scraping as an activity in itself.** Courts in several jurisdictions have repeatedly found that accessing information a website makes public does not, on its own, break the law. The risk lives in the details: collecting personal data, ignoring a site's Terms of Service, copying copyrighted content, or hammering a server hard enough to disrupt it can each create liability even when the underlying scraping is fine.
Not legal advice
This is a plain-English overview for developers, not legal advice. Laws differ by country and change over time — consult a qualified lawyer for your specific use case.
### Quick facts
- **Short answer:** Public data: generally legal. Depends on what & how.
- **Biggest risk areas:** Personal data, copyright, ToS, server load
- **Key US case:** hiQ Labs case — scraping public data not a CFAA violation
- **Key EU law:** GDPR — personal data has obligations even when public
- **Safe default:** Public, non-personal data; honor robots.txt & rate limits
### The short answer
There is no law called "the web scraping law." Web scraping is automated reading of web pages, and reading public information is not illegal. What can be illegal is a *specific* combination of facts around a scrape. The four questions that actually decide legality are:
- **Is the data public**, or behind a login / paywall you had to break through?
- **Does it contain personal data** about identifiable people?
- **Is the content copyrighted**, and are you republishing it?
- **Did you agree to Terms** that prohibit scraping, and did your scraper harm the site?
Get those right and most scraping of public, non-personal data sits comfortably on the legal side. Get them wrong and even "just reading a page" can turn into a contract, privacy, or copyright problem.
### United States: the CFAA and "public" data
The headline US statute people worry about is the **Computer Fraud and Abuse Act (CFAA)**, which criminalizes accessing a computer "without authorization." The key question has been whether scraping a public website counts.
In **the landmark hiQ Labs case**, the Ninth Circuit held that scraping data a site makes *publicly available* (no login required) does not violate the CFAA — there's no "authorization" to exceed when the data is open to everyone. The Supreme Court's **Van Buren v. United States** decision narrowed the CFAA in a compatible direction, focusing it on cases where someone reaches an access control they were not authorized to cross, rather than on violating usage policies.
The practical takeaway: **public means public**. Data behind a password, paywall, or technical access control is a different story — breaking through an access gate is where CFAA exposure starts. Note that hiQ ultimately lost on *contract* grounds (breach of the site's Terms), which is exactly why the "how" matters as much as the "what."
### Personal data: GDPR and CCPA
The fact that personal data is *visible* on a public page does not make it free to collect and store. Under the EU/UK **GDPR**, processing personal data (names, emails, profiles) requires a lawful basis, and data subjects have rights regardless of where the data was found. The US **CCPA/CPRA** imposes similar obligations in California.
- **Aggregating public personal data at scale** is one of the most litigated and regulated areas of scraping.
- **Non-personal data** — prices, product specs, sports scores, public filings — carries far less privacy risk.
If your scrape can avoid personal data entirely, it sidesteps the single largest category of legal risk. When you do need it, document a lawful basis and minimize what you keep.
### Copyright and database rights
Scraping *facts* (a price, a temperature, a stock level) is generally safe — facts aren't copyrightable. Copying **creative expression** — articles, photos, reviews, descriptions — and republishing it can infringe copyright even if you scraped it from a public page. The EU additionally recognizes a *sui generis database right* protecting substantial extractions from a database.
Using scraped content for **analysis, indexing, or internal research** tends to be lower risk than **republishing it verbatim** in competition with the source. When in doubt, store and transform the data rather than mirroring the original work.
### Terms of Service, robots.txt and server load
Even when statutes don't bite, a site's **Terms of Service** can. If you clicked "I agree" or accessed an area gated by terms that ban automated collection, scraping may be a *breach of contract* — the ground hiQ actually lost on. Anonymous access to a fully public page is a weaker basis for a ToS claim, but the safest path is simply to respect the rules you're on notice of.
Two technical courtesies also reduce both legal and practical risk:
- **Honor robots.txt** and published crawl guidance.
- **Rate-limit yourself.** A scraper that degrades a site's service can move you from "reading public data" toward trespass-to-chattels or computer-misuse territory. Polite, paced requests matter — both legally and so you don't get 429-rate-limited or blocked.
### A practical checklist for staying on the right side
None of this is legal advice, but these habits keep most scraping projects defensible:
- **Scrape public data** — don't break through logins, paywalls, or access controls.
- **Avoid or minimize personal data**; if you must collect it, have a lawful basis and a retention limit.
- **Use facts, not verbatim creative content**; transform rather than republish.
- **Respect robots.txt and Terms** you're genuinely on notice of.
- **Rate-limit and identify yourself** where appropriate; never disrupt the target's service.
- **Check the laws of your jurisdiction** and the target's — and ask a lawyer for anything high-stakes.
Tools like a managed web scraping API help on the *how* by pacing requests and managing infrastructure for **publicly accessible** data — but the legal responsibility for *what* you collect and how you use it always stays with you.
### FAQ
**Q: Is web scraping legal?**
Scraping publicly available, non-personal data is generally legal in most jurisdictions, and courts (e.g. the hiQ Labs case in the US) have held that accessing public data does not by itself violate computer-access laws like the CFAA. Legality depends on what you collect (avoid personal and copyrighted data), how you collect it (do not break through logins or overload servers), and whether you respect the site’s Terms of Service. This is not legal advice.
**Q: Is it legal to scrape data behind a login?**
It is much riskier. Breaking through a password, paywall, or other access control is where laws like the US CFAA come into play, because you are accessing data that is not public. Public pages that require no login are far safer to scrape than anything gated behind authentication.
**Q: Can I get sued for web scraping?**
Yes, even when no criminal law is broken. The common civil claims are breach of contract (violating a site’s Terms of Service), copyright infringement (republishing creative content), privacy violations (mishandling personal data under GDPR/CCPA), and trespass-to-chattels (overloading a server). Scraping public, non-personal facts politely avoids most of these.
**Q: Does robots.txt make scraping illegal if I ignore it?**
robots.txt is a voluntary standard, not a law, so ignoring it is not automatically illegal. But it signals the site owner’s wishes, can support a Terms-of-Service or trespass claim, and ignoring it often leads to your traffic being blocked. Honoring robots.txt and rate-limiting your requests is the safer, more sustainable approach.
**Q: Is scraping personal data legal?**
Personal data being publicly visible does not make it free to collect. Under GDPR (EU/UK) and CCPA/CPRA (California), processing personal data carries legal obligations regardless of where it was found. Aggregating public personal data at scale is one of the most regulated and litigated areas of scraping — minimize personal data or get legal advice before collecting it.
---
## How to Scrape Website Data to Excel
URL: https://scrappey.com/qa/web-scraping-apis/scrape-website-data-to-excel
**To scrape website data into Excel, fetch the page through a scraping API that returns structured JSON, load the rows into a Python list of dictionaries, then write them to an .xlsx file with pandas (DataFrame.to_excel) or openpyxl.** This works even on JavaScript-rendered tables and pages behind browser verification, where Excel's built-in Power Query "Get Data from Web" returns an empty preview because it only sees the raw HTML, not the rendered DOM. The pattern is always the same three steps: get JSON, shape it into rows, write the workbook.
### Quick facts
- **Best Python library:** pandas (one-line to_excel) or openpyxl (fine-grained cell control)
- **Output format:** .xlsx (native Excel) or .csv for no-code import via Data > From Text/CSV
- **Why not Power Query:** It reads raw HTML only; JS-rendered or verification-gated tables come back empty
- **Engine needed:** pip install openpyxl (pandas uses it as the .xlsx writer engine)
- **Scrappey output:** autoparse:true returns parsed JSON ready to drop into a DataFrame
### Write scraped rows to .xlsx with pandas
**The fastest path is pandas: collect your scraped records as a list of dictionaries and call DataFrame.to_excel("out.xlsx", index=False).** pandas uses openpyxl under the hood as the .xlsx engine, so pip install pandas openpyxl is all you need.
The key idea is that a scraping API gives you JSON, and JSON maps cleanly onto rows and columns. Each dictionary becomes one Excel row; the dictionary keys become the header row. For a product table you might collect {"name": ..., "price": ..., "url": ...} per item, append each to a list, then hand the whole list to pd.DataFrame(rows). Setting index=False keeps Excel from adding an extra unnamed column. To split data across multiple sheets in one workbook, open a pd.ExcelWriter and call to_excel once per sheet with different sheet_name values. If a page already exposes a clean HTML <table>, pandas.read_html() can pull every table into DataFrames in a single line.
### When you need openpyxl directly
**Use openpyxl directly when you want Excel-specific formatting, multiple sheets built incrementally, or to append rows to an existing workbook without reloading everything into memory.** pandas is great for a clean one-shot dump; openpyxl gives you per-cell control.
With openpyxl you create a Workbook, grab the active sheet, write a header with ws.append([...]), then loop your scraped records calling ws.append([row["name"], row["price"]]) for each. From there you can bold the header (cell.font = Font(bold=True)), set column widths (ws.column_dimensions["A"].width = 40), freeze the top row (ws.freeze_panes = "A2"), or add a new sheet per category with wb.create_sheet. This streaming append style also lets you handle pagination: keep appending rows as you fetch each page, then call wb.save() once at the end so the entire crawl lands in a single workbook.
### No-code route, blocks, and empty tables
**If you would rather not write Excel code, write the rows to a CSV file and open it in Excel via Data > From Text/CSV; if your table comes back empty, the page is rendering with JavaScript and needs a real browser fetch first.**
One Excel-specific CSV gotcha: write the file as UTF-8 with a BOM so accented characters and currency symbols display correctly. In Python that is open("out.csv", "w", newline="", encoding="utf-8-sig") with csv.DictWriter, or simply df.to_csv("out.csv", index=False, encoding="utf-8-sig"). CSV does flatten structure, so nested or list-valued fields get stringified; prefer .xlsx for nested data. The other common failure is an empty file because the page renders client-side or returns a blocking response. Fetching through a scraping API that renders dynamic content and routes through residential proxies returns the fully rendered HTML or parsed JSON, which you then pass to pandas. For delimiters, escaping, and JSON output, see the sibling guide on exporting scraped data to CSV and JSON.
### Example
```python
import requests
import pandas as pd
# 1. Fetch the page through Scrappey. autoparse returns structured JSON;
# the raw rendered body is always at solution.response.
API_KEY = "YOUR_API_KEY"
resp = requests.post(
f"https://publisher.scrappey.com/api/v1?key={API_KEY}",
json={
"cmd": "request.get",
"url": "https://example.com/products",
"proxyCountry": "UnitedStates",
"session": "excel-export",
"autoparse": True,
},
timeout=180,
)
resp.raise_for_status()
solution = resp.json()["solution"]
# 2. Shape the data into a list of dicts (one dict == one Excel row).
# Replace this with the fields your target page returns.
rows = []
for item in solution.get("response", {}).get("products", []):
rows.append({
"name": item.get("title"),
"price": item.get("price"),
"url": item.get("link"),
})
# 3a. One-line write with pandas (uses openpyxl as the .xlsx engine).
df = pd.DataFrame(rows)
df.to_excel("scraped_products.xlsx", index=False, sheet_name="Products")
print(f"Wrote {len(df)} rows to scraped_products.xlsx")
# 3b. No-code alternative: CSV with a BOM so Excel renders UTF-8 correctly,
# then open via Data > From Text/CSV.
df.to_csv("scraped_products.csv", index=False, encoding="utf-8-sig")
# pip install requests pandas openpyxl
```
### FAQ
**Q: Why does Excel Get Data from Web return an empty table?**
Excel Power Query fetches the raw HTML of a page, not the JavaScript-rendered DOM. If the table is built client-side or the page is gated behind browser verification, Power Query sees an empty or partial document. Fetching through a scraping API that renders JavaScript and returns structured JSON, then writing the rows to .xlsx in Python, solves this.
**Q: Do I need a special library to write .xlsx files in Python?**
Yes, you need openpyxl. pandas does not write .xlsx on its own; it uses openpyxl as the engine for DataFrame.to_excel. Install both with pip install pandas openpyxl. If you only need plain CSV, no extra library is required because csv is in the standard library.
**Q: How do I get paginated results into a single Excel workbook?**
Loop over the pages, fetch each one through the API, and keep appending the parsed rows to one list (or append directly to an openpyxl sheet with ws.append). After the loop finishes, build the DataFrame or call wb.save() once. That puts every page into a single workbook instead of one file per page.
---
## What Are Claude Skills?
URL: https://scrappey.com/qa/web-scraping-apis/what-are-claude-skills
**Claude Skills are reusable capability packages - a folder containing a SKILL.md file plus optional scripts and reference files - that Claude discovers and loads on demand to perform a specific task.** Anthropic shipped Skills (officially "Agent Skills") in 2025 as an open standard. The SKILL.md file has YAML frontmatter (at minimum a name and a description) followed by Markdown instructions. Claude reads only the short description until your request matches it, then pulls in the full instructions, then any bundled files - a pattern called progressive disclosure that keeps the context window small. Skills run in Claude Code (from a .claude/skills/ directory), the Claude apps, and the Claude API / Agent SDK.
### Quick facts
- **Format:** Folder with SKILL.md (YAML frontmatter + Markdown) + optional scripts/files
- **Official name:** Agent Skills (Anthropic, 2025); follows the open Agent Skills standard
- **Frontmatter:** name (<=64 chars, lowercase + hyphens), description (<=1024 chars)
- **Loading:** Progressive disclosure - metadata always (~100 tokens), body on trigger, files as referenced
- **Runs in:** Claude Code (.claude/skills/), Claude apps, Claude API / Agent SDK
### Skills vs MCP servers vs tool calls
Skills, MCP servers, and tool use all give a model new abilities, but they sit at different layers and it is easy to confuse them. The short version: a **tool call** is a single function, an **MCP server** is a live connection that exposes many tools over a shared protocol, and a **Skill** is a package of instructions and code that Claude reads off the filesystem.
MechanismWhat it isLifecycleBest for
**Skill**Filesystem folder + SKILL.md (instructions + optional scripts)Reusable across conversations and projectsDomain expertise, multi-step procedures, repeatable workflows
MCP serverA service exposing tools over the Model Context ProtocolPersistent connection, statefulExternal services, real-time data, shared tools across clients
Tool use / function callingA JSON schema for one function in the API requestPer-request onlyStructured input/output for a single operation
The three compose: a Skill can instruct Claude to call an MCP tool, and an MCP server can be the thing a Skill shells out to. A Skill is the cheapest way to teach Claude a procedure, because its body only enters the context window when the task is relevant.
### Anatomy of a SKILL.md
A Skill is a directory whose only required file is SKILL.md. Everything else - reference docs, example files, executable scripts - is optional and loaded lazily.
The YAML frontmatter is small on purpose. The two fields that matter are name (max 64 characters, lowercase letters, numbers and hyphens) and description (max ~1024 characters). The description is the single most important line you write: it is the only text Claude sees for every Skill at all times, and Claude matches your request against it to decide whether to load the rest. State both what the Skill does and when to use it. Claude Code adds optional fields such as allowed-tools (tools the Skill may use without prompting) and invocation controls, but those are extensions on top of the open standard.
Progressive disclosure is the design that makes Skills cheap. There are three levels: **metadata** (name + description, always loaded, roughly 100 tokens per Skill), **instructions** (the SKILL.md body, loaded only when the Skill is triggered), and **resources** (bundled files and scripts, read or executed only when the instructions reference them). A Skill that bundles fifty reference files costs almost nothing in context until one of those files is actually needed - and scripts run via the shell and return only their output, so their source never enters the context window.
### Building a web-scraping Skill
A common, high-value Skill gives Claude reliable access to live web content. On its own, an AI agent can only reason over what is already in its context or its training data; a web-scraping Skill lets it fetch a current page, convert it to clean Markdown, and continue. Because a Skill is just instructions plus code, its reliability is inherited entirely from whatever it calls.
This is where the design choice matters. A naive Skill that does a plain HTTP GET works on simple sites and then fails the moment it hits a JavaScript-rendered page or a site behind verification - it silently returns a challenge page instead of content, and Claude has no way to know. A Skill that shells out to a managed web-data API instead inherits that API's residential proxies and real-browser rendering, and gets back clean Markdown ready for the model. The code example below is a minimal SKILL.md that does exactly this against the Scrappey API; the same shape works for any HTTP-callable backend.
Scrappey publishes a ready-made Skill for this (the scrappey-skill repository, auto-discovered by Claude Code and other SKILL.md-compatible agents) alongside a CLI, so you can drop it into .claude/skills/ rather than writing the wrapper yourself.
### Example
```markdown
---
name: web-fetch
description: Fetch any web page (including JavaScript-rendered or verification-gated sites) and return it as clean Markdown. Use when the user asks to read, summarize, or extract content from a URL.
allowed-tools: Bash
---
# Web Fetch Skill
When the user gives you a URL to read or summarize, fetch it through the
Scrappey API so the page renders and any verification is handled, then work
from the returned Markdown.
Run this (the API key is in the SCRAPPEY_API_KEY environment variable):
```bash
curl -s -X POST "https://publisher.scrappey.com/api/v1?key=$SCRAPPEY_API_KEY" \
-H "Content-Type: application/json" \
-d '{ "cmd": "request.get", "url": "<TARGET_URL>", "markdown": true }' \
| jq -r '.solution.markdown'
```
The Markdown is in `.solution.markdown`. Summarize or extract from that text;
do not parse raw HTML yourself.
```
### FAQ
**Q: Are Claude Skills the same as MCP servers?**
No. A Skill is a folder of instructions and code that Claude reads from the filesystem and loads on demand; an MCP server is a running service that exposes tools to Claude over the Model Context Protocol. They compose - a Skill can tell Claude to call an MCP tool - but a Skill needs no server or network connection of its own.
**Q: How does Claude decide when to use a Skill?**
Claude always has every Skill's name and description in context (roughly 100 tokens each). When your request matches a description, Claude loads that Skill's full SKILL.md body, and then any files the body references. This is why the description should say both what the Skill does and when to use it.
**Q: Can a Skill scrape websites that block bots?**
A Skill is only instructions plus code, so its reliability comes from what it calls. A Skill that does a plain HTTP request will fail on JavaScript-heavy or verification-gated pages. A Skill that calls a managed web-data API inherits that API's browser rendering and residential proxies, and returns clean Markdown the model can use.
---
## What Are AI Agent Tools?
URL: https://scrappey.com/qa/web-scraping-apis/what-are-ai-agent-tools
**AI agent tools are the callable functions an autonomous LLM agent uses to act on the world - searching, fetching web pages, running code, querying APIs - rather than only generating text.** An agent runs a loop: the model reads its context, picks a tool, the runtime executes it, the result is fed back to the model, and the loop repeats until the task is done. Tools reach the model three main ways: native tool use / function calling (JSON schemas passed in the API request), MCP servers (a shared protocol so one tool works across many clients), and framework tools (LangChain, LlamaIndex, CrewAI). Web access is the most common tool category, because most useful tasks need live data the model was never trained on.
### Quick facts
- **Definition:** Functions an LLM agent can invoke to act: search, scrape, browse, run code, call APIs
- **Delivery:** Native tool use / function calling, MCP servers, framework tools (LangChain/CrewAI/LlamaIndex)
- **Most common category:** Web access - search + fetch page content as text or Markdown
- **Agent loop:** read context -> choose tool -> execute -> observe result -> repeat
- **Main failure mode:** Web tools blocked by anti-bot or broken by JavaScript rendering on protected sites
### The agent loop and where tools fit
An LLM on its own only produces text. An *agent* is an LLM wrapped in a loop that can take actions. On each turn the model is shown the conversation plus a list of available tools (each described by a name, a short description, and a JSON schema for its inputs). The model either answers or emits a tool call; the runtime executes that call, appends the result to the context, and runs the model again. The loop ends when the model decides it has enough to answer.
Tools are therefore the agent's entire interface to anything outside the model: a calculator, a SQL query, a file write, a web search, a page fetch. The model never runs code itself - it asks the runtime to, and reads the result. Good tool design is mostly good descriptions: the model chooses a tool the same way it chooses words, by matching your description text against the task.
### How tools reach the model: tool use vs MCP vs frameworks
There are three common ways to wire a tool to an agent, and they are not exclusive.
PathWhat you provideReachesWhen to use
Native tool use / function callingA JSON schema per function in the API callOne model/providerYou control the agent code and want the shortest path
MCP serverA server exposing tools over a shared protocolAny MCP client (Claude Code, Cursor, etc.)One tool, many clients; no per-client glue code
Framework toolA class/function in LangChain, LlamaIndex, CrewAIAgents built in that frameworkYou already build agents in that framework
A Claude Skill is a fourth, complementary layer: not a tool itself, but packaged instructions that tell the model how and when to use the tools it already has.
### Web-access tools are the hard part
Most agent tools are simple: a calculator always returns the right answer, a file read either succeeds or throws. Web-access tools are different, because the web actively resists automated clients. A "fetch this URL" tool that does a plain HTTP request works on a static blog and then returns an almost-empty shell on any site that renders content with JavaScript, or a verification page on any site behind bot protection.
The failure is quiet, which is the dangerous part: the tool returns *something*, the model treats it as the page, and the agent confidently reasons over a challenge screen. Reliable web-access tooling therefore needs real browser rendering and residential IPs underneath - the same infrastructure a web scraping API provides. Pointing an agent's fetch and search tools at a managed web-data API that returns clean Markdown is the most reliable way to give an agent web access; see web scraping for LLMs for the full pipeline.
### Example
```json
{
"name": "fetch_page",
"description": "Fetch a web page and return its content as clean Markdown. Handles JavaScript rendering and verification automatically. Use whenever you need the current content of a URL.",
"input_schema": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The absolute URL to fetch"
}
},
"required": ["url"]
}
}
```
### FAQ
**Q: What is the difference between a tool and an MCP server?**
A tool is a single callable function the model can invoke. An MCP server is a running service that exposes one or more tools over the Model Context Protocol, so any MCP-capable client can discover and call them without custom integration code. A tool is the capability; MCP is one standard way of delivering it.
**Q: Do AI agents need web scraping tools?**
For most real tasks, yes. An LLM only knows its training data and what is in its context, so any task involving current prices, news, documentation, or live state needs a tool that fetches the web. Search finds the page; a fetch or scrape tool returns its content as text the model can read.
**Q: Why do agent web tools fail on some sites?**
A plain HTTP fetch cannot run JavaScript, so it returns an empty shell on modern sites, and it gets a verification page on sites with bot protection. The agent often cannot tell the difference between real content and a challenge page, so reliable web tools need real browser rendering and clean IPs underneath.
---
## What Is llms.txt?
URL: https://scrappey.com/qa/web-scraping-apis/what-is-llms-txt
**llms.txt is a proposed web standard - a Markdown file published at a site's root (/llms.txt) that gives large language models a curated, clean map of the site's most important content.** It was proposed by Jeremy Howard (Answer.AI) in September 2024. The format is plain Markdown: an H1 with the site name, a blockquote summary, then H2 sections each holding a list of links to key pages, optionally annotated. A companion /llms-full.txt can inline the full text of those pages. The motivation is concrete: a normal HTML page is mostly navigation, ads, and scripts that waste an LLM's limited context window, and crawlers may not execute the JavaScript that renders the real content. llms.txt is a hint about what matters - it is not access control, which is what robots.txt is for.
### Quick facts
- **Proposed:** Jeremy Howard / Answer.AI, September 2024
- **Location:** /llms.txt at the domain root; optional /llms-full.txt with full text
- **Format:** Markdown - H1 name, blockquote summary, H2 link sections
- **Purpose:** Curate clean, token-efficient content for LLMs; advisory, not access control
- **vs robots.txt:** robots.txt controls what crawlers may access; llms.txt suggests what content matters
### What goes in an llms.txt file
The format is deliberately minimal so both humans and models can read it. The structure, in order: a single H1 with the project or site name; an optional blockquote giving a one-line summary; optional free-form Markdown with context; then any number of H2 sections, each containing a bullet list of links in the form [name](url): optional note. A section named Optional by convention marks links a model can skip if it is short on context.
The sibling file /llms-full.txt takes this further by inlining the actual Markdown content of the listed pages into one document, so a model can ingest an entire docs site in a single fetch instead of crawling page by page. The code example below shows a typical llms.txt.
### llms.txt vs robots.txt vs sitemap.xml
All three are root-level files about how machines should treat your site, but they answer different questions and are consumed by different clients.
FileQuestion it answersFormatConsumed by
llms.txtWhich content matters most, in clean form?Markdown, human-readableLLMs / AI tools (advisory)
robots.txtWhat may a crawler access?Directives (User-agent / Allow / Disallow)Search and AI crawlers
sitemap.xmlWhat URLs exist and when did they change?XML, machine-readableSearch engine crawlers
The key distinction: robots.txt grants or denies access and AI-specific bot directives belong there, a sitemap is an exhaustive URL index for crawlers, and llms.txt is a curated, opinionated summary aimed at a model's context window. They are complementary, not substitutes.
### Adoption in 2026 and how to generate one
An honest read of adoption: llms.txt is widely published by developer-tool and documentation sites, and a growing number of AI dev tools read it, but it is not a universally honored standard - many models and agents still fetch and parse live HTML rather than looking for /llms.txt first. Treat it as a low-cost, upside-only signal, not a guarantee that a model will use it. Its value also overlaps with retrieval and MCP-based access, where an agent fetches your content on demand.
Generating one means producing clean Markdown for your key pages - which is the same job as scraping a site for LLMs. For a static site you can template it from your own content; for a JavaScript-rendered site you first need each page rendered and converted to Markdown, which a web-data API that returns Markdown does in one call. Once you have the per-page Markdown, assembling llms.txt (the link index) and llms-full.txt (the inlined text) is straightforward.
### Example
```markdown
# Acme Docs
> Acme is a payments API for marketplaces. This file lists the docs an LLM should read first.
## Docs
- [Quickstart](https://acme.dev/docs/quickstart): create an account and make your first charge
- [Authentication](https://acme.dev/docs/auth): API keys, OAuth, and webhooks signing
- [Payments API](https://acme.dev/docs/payments): create, capture, refund
## Optional
- [Changelog](https://acme.dev/changelog): version history
- [Status](https://acme.dev/status): live uptime
```
### FAQ
**Q: Does llms.txt control whether AI crawlers use my site?**
No. Access control belongs in robots.txt (including AI-specific bot user-agents) and in your server rules. llms.txt is purely advisory content curation - it tells a cooperating model which pages matter and provides them in clean form, but it does not block or permit any crawler.
**Q: Do LLMs actually read llms.txt?**
Adoption is growing, especially among documentation and developer-tool sites and the AI coding tools that consume them, but it is not universally honored. Many models and agents still fetch live HTML. Publishing llms.txt is low-cost and upside-only, but do not assume every model will look for it.
**Q: How is llms.txt different from a sitemap?**
A sitemap is an exhaustive, machine-readable XML list of every URL for search crawlers. llms.txt is a short, curated, human-readable Markdown file pointing only at the content that matters most to a language model, optimized for a limited context window rather than full coverage.
---
## Web Scraping for LLMs and RAG
URL: https://scrappey.com/qa/web-scraping-apis/web-scraping-for-llms
**Web scraping for LLMs is the process of fetching web pages and converting them into clean, chunkable text (usually Markdown) that can be embedded into a vector store for retrieval-augmented generation (RAG) or passed directly into an agent's context.** The pipeline is: fetch, clean (strip navigation, ads, and scripts), convert to Markdown, chunk, embed, store. The quality of every later step is capped by the first: if the fetch is blocked or the page never renders, retrieval has nothing real to work with. That is why fetch reliability - anti-bot handling and JavaScript rendering - rather than chunk size, is the limiting factor for most real-world RAG quality.
### Quick facts
- **Pipeline:** fetch -> clean -> Markdown -> chunk -> embed -> vector store
- **Output format:** Markdown - preserves headings/lists/tables, strips nav/scripts, token-efficient
- **Typical chunk:** 300-800 tokens with 10-20% overlap; split on headings where possible
- **Limiting factor:** Fetch reliability (anti-bot, JS rendering), not chunk tuning
- **Freshness:** Re-crawl on a schedule; dedupe by content hash
### The pipeline: from URL to vector store
A retrieval pipeline turns a list of URLs into searchable context in five steps. **Fetch** the page (the hard step, covered below). **Clean and convert** to Markdown, dropping navigation, footers, cookie banners, and scripts while keeping headings, lists, tables, and code. **Chunk** the Markdown into passages. **Embed** each chunk into a vector. **Store** the vectors with their source metadata in a vector database.
Markdown is the preferred intermediate format because it preserves document structure that raw text loses and strips the markup noise that raw HTML carries - a page that is 60 KB of HTML is often 4 KB of Markdown, which means more real content per token. Most managed web-data APIs can return Markdown directly, so the fetch and convert steps collapse into a single call (see the code example).
### Chunking and embedding without losing structure
Chunking is where most retrieval quality is won or lost. A few rules that hold up in practice:
- **Split on structure, not character count.** Break on Markdown headings first, then on paragraphs, so a chunk is a coherent idea rather than an arbitrary slice. This is exactly why you convert to Markdown before chunking.
- **Keep chunks at roughly 300-800 tokens with 10-20% overlap.** Too large and the embedding blurs multiple topics; too small and it loses context. Overlap stops an answer from being cut in half at a boundary.
- **Attach metadata to every chunk** - source URL, page title, heading path, fetch date. You need it for citations, for filtering, and for re-crawl freshness.
- **Dedupe by content hash.** Boilerplate (the same footer or sidebar across a site) otherwise floods your index with near-identical chunks and crowds out real answers.
None of this matters if the text you chunked was wrong - which is the failure mode the next section covers.
### Why pipelines that work in dev break in production
A RAG pipeline tested against a handful of friendly URLs almost always works. The same pipeline pointed at real targets degrades, and the cause is rarely the chunking - it is the fetch. Two things go wrong. First, many pages render their content with JavaScript, so a plain HTTP fetch returns an near-empty shell and you embed nothing useful. Second, sites behind bot protection (such as Cloudflare or DataDome) return a verification or challenge page instead of content.
The insidious part is that both failures return a *200 OK with a body*, so a naive pipeline embeds the challenge page as if it were the article. Now your vector store is poisoned: retrieval surfaces "please enable JavaScript" and "verify you are human" as context, and the model answers from garbage. Retrieval accuracy is capped by scrape reliability, full stop.
The fix is to fetch through infrastructure that renders the page in a real browser, then returns clean Markdown - so the text entering your pipeline is the text a human would see. A managed web-data API does this in one call and, on pay-per-success pricing, you are not billed for the requests that fail, which pairs well with the retry loops a production crawler needs. For the broader context on agent-driven access, see AI agent tools and choosing a scraping API for LLM data.
### Example
```python
import requests, hashlib
ENDPOINT = "https://publisher.scrappey.com/api/v1"
API_KEY = "YOUR_API_KEY"
def fetch_markdown(url: str) -> str:
"""Fetch a URL as clean Markdown via a real browser with verification handling."""
resp = requests.post(
ENDPOINT,
params={"key": API_KEY},
json={"cmd": "request.get", "url": url, "markdown": True},
timeout=180,
)
resp.raise_for_status()
return resp.json()["solution"]["markdown"]
def chunk(md: str, size: int = 600, overlap: int = 80) -> list[str]:
"""Split on blank lines first, then pack into ~size-token-ish chunks."""
paras = [p for p in md.split("\n\n") if p.strip()]
chunks, cur = [], ""
for p in paras:
if len(cur) + len(p) > size * 4: # ~4 chars/token heuristic
chunks.append(cur.strip())
cur = cur[-overlap * 4:]
cur += "\n\n" + p
if cur.strip():
chunks.append(cur.strip())
return chunks
for url in ["https://example.com/docs/quickstart"]:
md = fetch_markdown(url)
for c in chunk(md):
cid = hashlib.sha256(c.encode()).hexdigest()[:16] # dedupe key
# embed(c) -> vector_store.upsert(id=cid, text=c, metadata={"url": url})
print(cid, len(c), "chars")
```
### FAQ
**Q: What format is best for feeding web pages to an LLM?**
Markdown. It preserves document structure (headings, lists, tables) that plain text loses, while stripping the markup noise that raw HTML carries. The result is far more real content per token, which matters directly for both context-window cost and retrieval quality.
**Q: Why does my RAG pipeline return empty or garbage chunks?**
Almost always the fetch step, not the chunking. JavaScript-rendered pages return a near-empty shell to a plain HTTP request, and bot-protected sites return a verification page - both as a 200 OK with a body. The pipeline then embeds that shell or challenge page as if it were content. Render in a real browser so the text you embed is the text a human sees.
**Q: How often should I re-scrape sources for RAG?**
It depends on how fast the source changes - docs and pricing pages monthly or on change, news daily, reference content rarely. Store a fetch date and a content hash per chunk so you can re-crawl on a schedule and skip pages whose hash has not changed, which keeps both cost and index churn down.
---
## Web Scraping to Google Sheets
URL: https://scrappey.com/qa/web-scraping-apis/web-scraping-to-google-sheets
**To get scraped data into Google Sheets you either write rows from code with the gspread library and a Google service account, or pull a published feed into a cell with the built-in IMPORTDATA / IMPORTHTML functions.** The code path gives you full control over authentication, multiple tabs, formatting, and scheduled refreshes; the no-code path is fastest when your data already lives at a public URL as CSV or a plain HTML table. Either way the hard part is usually the scrape itself - JavaScript-rendered tables and browser verification pages return empty rows to naive fetchers - so a scraping API that handles rendering and proxies in one call feeds Sheets the cleanest data.
### Quick facts
- **Two paths:** Python + gspread + service account (full control), or Apps Script / IMPORTDATA (no-code)
- **Auth (code path):** Google Cloud service account JSON key; share the Sheet with the account email
- **Best for:** Recurring scrapes, dashboards, and team-shared data without a database
- **IMPORTDATA limit:** Reads only public CSV/TSV at a URL; no JS rendering, ~50 IMPORT functions per sheet
- **Cleanest input:** Scrappey returns JSON or autoparsed tables you can write straight to rows
### Method 1: Python with gspread and a service account
**The most reliable path is to scrape with an API, shape the result into rows, and write them with gspread authenticated by a Google service account.** Set up auth once: in Google Cloud, create a service account, enable the Google Sheets API and Google Drive API, download the JSON key, then open your Sheet and share it (Editor) with the service account email found in that JSON. After that, gspread opens the spreadsheet by key and writes a 2D list in a single batched update() call, which is far faster and friendlier to API quotas than cell-by-cell writes.
- **Batch, do not loop:** build one list of rows and send it once; avoid a update_cell() call per field.
- **Multiple tabs:** use worksheet() or add_worksheet() to split datasets across sheets in the same workbook.
- **Append vs overwrite:** append_rows() adds to the bottom for incremental runs; clear then write for full refreshes.
- **Headers:** write a header row first so downstream formulas and pivots have named columns.
### Method 2: No-code with IMPORTDATA, IMPORTHTML, and Apps Script
**If your data is already a public CSV or a plain HTML table, you can pull it into a cell with no code at all.** =IMPORTDATA("https://example.com/data.csv") loads a public CSV or TSV; =IMPORTHTML("https://example.com/page","table",1) grabs the first HTML table on a page. These are great for simple, static feeds but they have real limits: they cannot run JavaScript, cannot send custom headers or handle browser verification, and a sheet is capped at roughly 50 IMPORT-family functions. For anything dynamic, write a Google Apps Script function that calls a scraping API with UrlFetchApp.fetch(), parses the JSON, and writes rows with getRange().setValues(). Bind that function to a time-driven trigger to refresh on a schedule directly inside Sheets.
### Get clean data in first: handle JS and verification
**Both methods only work if the scrape returns real data, and that is where most Sheets tutorials quietly fail.** Power Query, IMPORTHTML, and a plain requests.get() all return an empty preview when a site renders its table with JavaScript or shows a browser verification interstitial. A scraping API solves this by rendering the page in a real browser and routing through residential proxies, then handing you the finished HTML or an autoparsed table. With Scrappey you send one POST, set autoparse: true to get structured rows back, and reuse a session so paginated requests share cookies. Loop through pages, collect rows into one list, and write that list to Google Sheets in a single batched call - one workbook, every page, no empty previews.
### Example
```python
import requests
import gspread
# 1) Scrape with Scrappey (handles JS rendering + proxies in one call)
API_KEY = "YOUR_API_KEY"
resp = requests.post(
"https://publisher.scrappey.com/api/v1?key=" + API_KEY,
json={
"cmd": "request.get",
"url": "https://example.com/products",
"proxyCountry": "UnitedStates",
"session": "sheets-job",
"autoparse": True,
},
timeout=180,
)
resp.raise_for_status()
solution = resp.json()["solution"]
# autoparse returns structured data; fall back to raw HTML if you parse yourself
items = solution.get("parsed") or []
# 2) Shape into rows (header first)
rows = [["name", "price", "url"]]
for it in items:
rows.append([it.get("name", ""), it.get("price", ""), it.get("url", "")])
# 3) Authenticate to Google with a service account JSON key.
# In Google Cloud: enable Sheets API + Drive API, create a service account,
# download the key, then share the Sheet (Editor) with the account email.
gc = gspread.service_account(filename="service_account.json")
sh = gc.open_by_key("YOUR_SPREADSHEET_ID")
ws = sh.sheet1
# 4) Write all rows in ONE batched call (fast, quota-friendly)
ws.clear()
ws.update("A1", rows)
print("wrote " + str(len(rows) - 1) + " rows to Google Sheets")
```
### FAQ
**Q: How do I authenticate gspread with Google Sheets?**
Create a service account in Google Cloud, enable the Google Sheets API and Google Drive API, and download the JSON key. Then open your spreadsheet and share it with Editor access to the service account email listed in that JSON file. In code, call gspread.service_account(filename="service_account.json") and open the sheet by its key. The key is the long ID in the spreadsheet URL.
**Q: Can I scrape directly into Google Sheets without any code?**
Yes, if the data is already a public CSV or a plain HTML table. Use =IMPORTDATA("url") for a public CSV or TSV, or =IMPORTHTML("url","table",1) for the first HTML table on a page. These cannot run JavaScript, send headers, or handle browser verification, and a sheet allows only about 50 IMPORT functions. For dynamic pages, use an Apps Script function that calls a scraping API and writes rows with setValues().
**Q: Why does IMPORTHTML or Power Query return an empty table?**
Because the page builds its table with JavaScript after load, or shows a browser verification interstitial, so the raw HTML those tools fetch contains no rows. Render the page in a real browser first. A scraping API like Scrappey renders the page and routes through residential proxies, then returns finished HTML or autoparsed rows you can write straight into Sheets.
---
## How to Export Scraped Data to CSV and JSON (Python)
URL: https://scrappey.com/qa/web-scraping-apis/export-scraped-data-to-csv-json
**Export scraped data to CSV when you need flat, spreadsheet-ready rows, and to JSON when you need to preserve nested structure.** In Python, the built-in csv module writes rows with correct quoting and escaping, while json serializes objects directly. Because a web scraping API like Scrappey already returns JSON, the cleanest pipeline is: parse the response, write JSON for the raw structured record, and flatten to CSV only the fields you want in a spreadsheet.
### Quick facts
- **Best for tabular data:** CSV (one row per record, opens in Excel/Sheets)
- **Best for nested data:** JSON or JSON Lines (preserves arrays and objects)
- **Excel encoding fix:** Write UTF-8 with BOM (utf-8-sig) so accents render
- **Large datasets:** Stream row-by-row or use JSON Lines, never load all in RAM
- **Escaping handled by:** Pythons csv module quotes commas, quotes, and newlines
### CSV vs JSON: which format to use
**Pick CSV for flat tabular records and JSON when fields are nested.** CSV is one row per item with a fixed set of columns; it opens directly in a spreadsheet and is ideal for prices, product listings, or contact rows. JSON keeps arrays and nested objects intact, so it is the right choice when a record has variable-length lists (tags, images, reviews) or sub-objects that would not fit cleanly into columns.
NeedUseWhy
Open in Excel/Google SheetsCSVNative, one row per record
Preserve nested fieldsJSONArrays and objects survive intact
Append millions of recordsJSON LinesOne JSON object per line, append-safe
A common mistake is forcing nested data into CSV: a list field gets stringified or dropped. If a record has both flat and nested parts, write the full record to JSON for fidelity and a flattened subset to CSV for analysis. See exporting to Excel for the .xlsx variant.
### Encoding and escaping pitfalls
**The two failures that corrupt CSV exports are wrong encoding and unescaped delimiters; the csv module and UTF-8 handle both if configured right.** Always open files with encoding="utf-8" and newline="" so the csv module controls line endings. When the file is destined for Excel on Windows, use encoding="utf-8-sig" to write a byte-order mark (BOM) so accented characters and non-Latin scripts render instead of showing mojibake.
- **Commas and quotes in values:** never join fields by hand. The csv writer wraps any field containing a comma, double-quote, or newline in quotes and doubles internal quotes automatically.
- **Inconsistent keys:** use csv.DictWriter with extrasaction="ignore" so a record with a missing or extra key does not throw.
- **JSON unicode:** set ensure_ascii=False in json.dump to keep readable UTF-8 instead of \uXXXX escapes, and indent=2 only for human-readable files (omit it for compact machine files).
CSV flattens structure: a nested list becomes a string. If you must keep a nested field in CSV, serialize just that cell with json.dumps() so it round-trips losslessly.
### Streaming large datasets
**For large jobs, write each record as it arrives instead of building one giant list in memory.** Keep the output file open and call writer.writerow() (CSV) or write one JSON object per line (JSON Lines / NDJSON) per scraped item. This caps memory at one record and lets the job resume or append without rewriting the whole file.
JSON Lines is the scalable cousin of JSON: each line is a standalone JSON object, so you can append with mode "a", stream-read it back line by line, and never need to load the entire array. Plain json.dump of a list, by contrast, requires the whole dataset in RAM and a full rewrite to add records.
- **Append, do not overwrite:** open in append mode for incremental runs; write the CSV header only once.
- **Deduplicate:** track a key (URL or id) in a set, or post-process with sort -u / pandas drop_duplicates().
- **Pagination:** reuse one Scrappey session across pages so cookies and session state persist, and flush each page to disk before fetching the next.
Pair this with dynamic content scraping when the source renders rows in the browser, so the data you stream out is already complete.
### Example
```python
import csv
import json
import requests
API = "https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY"
# 1. Fetch a page through Scrappey (handles JS, proxies, verification).
# autoparse=true asks Scrappey to return structured data when it can.
r = requests.post(API, json={
"cmd": "request.get",
"url": "https://example.com/products",
"proxyCountry": "UnitedStates",
"session": "export-demo",
"autoparse": True,
})
r.raise_for_status()
# The HTML / parsed body lives here. Scrappey already returns JSON natively.
body = r.json()["solution"]["response"]
# Pretend we extracted these records (replace with your own parsing).
records = [
{"name": "Cafe au lait mug", "price": 12.5, "tags": ["kitchen", "ceramic"]},
{"name": "Notebook, A5", "price": 6.0, "tags": ["office"]},
]
# 2. Write full structured records to JSON Lines (nested fields preserved,
# append-safe, streams one object per line for large jobs).
with open("data.jsonl", "a", encoding="utf-8") as f:
for rec in records:
f.write(json.dumps(rec, ensure_ascii=False) + "\n")
# 3. Write a flat, spreadsheet-ready CSV. utf-8-sig adds a BOM so Excel shows
# accents correctly; newline="" lets the csv module control line endings.
fields = ["name", "price", "tags"]
with open("data.csv", "w", encoding="utf-8-sig", newline="") as f:
w = csv.DictWriter(f, fieldnames=fields, extrasaction="ignore")
w.writeheader()
for rec in records:
row = dict(rec)
# CSV cannot hold a list, so serialize the nested field to a string.
row["tags"] = json.dumps(row["tags"], ensure_ascii=False)
w.writerow(row) # quoting/escaping handled automatically
print("wrote data.jsonl and data.csv")
```
### FAQ
**Q: When should I export to CSV instead of JSON?**
Use CSV when your records are flat and you want to open them in Excel or Google Sheets: one row per item, fixed columns. Use JSON (or JSON Lines) when records contain nested arrays or objects you need to preserve, since CSV flattens or stringifies nested fields. Many pipelines write JSON for full fidelity and a flattened CSV for analysis.
**Q: Why do accented characters look wrong when I open my CSV in Excel?**
Excel on Windows assumes a legacy encoding unless the file starts with a UTF-8 byte-order mark. Write the file with encoding="utf-8-sig" in Python so the BOM is added, and accents and non-Latin scripts will render correctly. Also pass newline="" when opening the file so the csv module controls line endings.
**Q: How do I export a very large scrape without running out of memory?**
Write each record to disk as it is scraped instead of collecting everything in one list. For CSV, keep the file open and call writer.writerow() per item; for JSON, use JSON Lines (one JSON object per line) so you can append in mode "a" and read it back line by line. Reuse one Scrappey session across paginated requests and flush each page before fetching the next.
---
## How to Scrape Prices: Build a Price Monitor That Survives Anti-Bot
URL: https://scrappey.com/qa/web-scraping-apis/how-to-scrape-prices
**To scrape prices reliably you fetch each product page through a residential proxy in the right country, parse the current price out of the page (or let a scraping API return it as structured data), store each reading with a timestamp so you have history, and re-check on a schedule.** The mechanics are simple; the hard part is that price pages defend against automated traffic, prices are localized by country and currency, and the price you want (sale price, not the struck-through original) often loads via JavaScript. A code-first web scraping API handles the fetch layer - JavaScript rendering, rotating residential IPs, and browser verification - so your code only deals with parsing, storage, and scheduling.
### Quick facts
- **Five moving parts:** Targets, reliable fetch, price parse, history store, schedule + alerts
- **Why it is hard:** Active blocking, per-country prices/currency, JS-rendered sale prices
- **Fetch layer:** Residential proxy in the buyer geo + JS rendering, one API call
- **Cadence:** Hourly for hot SKUs, daily for the long tail
- **Store:** One timestamped row per reading - never overwrite, keep history
### The five parts of a price monitor
A price monitor is five small jobs wired together: pick targets, fetch the page, parse the price, store the reading, and schedule plus alert.
**1. Pick targets.** Keep a list of product URLs (or marketplace identifiers) you care about. Tag each as hot (fast-moving, competitive SKUs) or long tail.
**2. Fetch reliably.** This is where DIY trackers break. A plain request from a datacenter IP gets blocked, gets the wrong country price, or misses a price that loads via JavaScript. Route each request through a residential proxy in the buyer geography and render the page. A scraping API does all of this in one call.
**3. Parse the price.** Pull the *current* price, not the struck-through original. Many stores expose a clean price in embedded JSON (schema.org Product/Offer markup) which is more stable than scraping a styled price element; auto-parse output or that JSON block is your friend.
**4. Store history.** Append a timestamped row per reading - never overwrite. History is what lets you draw trends and detect drops.
**5. Schedule and alert.** Re-run on a cadence and compare to the last stored value to fire alerts. See the next two sections.
### Fetch and parse: residential proxies, geo, and the right number
The fetch step decides whether your monitor works at all, because retailers challenge automated traffic and localize prices.
**Geo and currency.** A product page served to a US IP and a German IP can show different prices, currency, and availability. Set the proxy country to match the market you sell in - if you track US pricing, fetch from a US residential IP. Getting this wrong silently poisons your data.
**Reliability at scale.** Price pages on large retailers sit behind verification challenges and rate limits, so a naive monitor returns errors instead of prices. A managed fetch layer that handles residential IP routing, browser rendering, and per-site request pacing keeps the monitor stable and the data complete. Holding a consistent session per target keeps each product's readings on one coherent connection, which makes the collected price series cleaner.
**Grabbing the correct number.** Prefer structured data: most stores embed an application/ld+json Product object with a clean offers.price and priceCurrency. That survives layout redesigns far better than a CSS selector aimed at a price element. With autoparse set to true, Scrappey returns parsed fields for many product pages; otherwise parse the embedded JSON yourself. Normalize to a number and store the currency alongside it.
### Schedule, store history, and alert
Re-check on a cadence, append every reading to durable storage, and compare each new price to the last one to decide when to alert.
**Cadence.** Tier your SKUs. Hot, competitive items justify hourly checks; the long tail is fine daily or weekly. Polite, tiered cadence costs less and is far gentler on the target site than hammering every URL every minute - see throttling and polite crawling. Run the loop from cron, a scheduler, or a queue worker with 200 concurrent requests available on every Scrappey plan.
**Store history.** Each run writes one row per SKU: url, price, currency, timestamp. A timestamped table (SQLite, Postgres, or even CSV to start) gives you trend charts and an audit trail. You can later export to CSV/JSON or push to Google Sheets for the business team.
**Alert.** After storing, compare the new price to the previous stored value. Fire a webhook, email, or Slack message when it crosses a threshold (drops more than X percent, or undercuts your own price). That diff-on-write is the entire alerting engine. For a survey of off-the-shelf options, see the price-monitoring tools roundup.
### Example
```python
import json
import re
import requests
API = 'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY'
# A few product pages to watch. Tag cadence in your own scheduler.
TARGETS = [
'https://www.example-store.com/products/widget-pro',
'https://www.example-store.com/products/widget-lite',
]
def extract_price(html):
"""Prefer schema.org Product JSON-LD; it survives redesigns."""
pattern = r'<script[^>]+ld\+json[^>]*>(.*?)</script>'
for block in re.findall(pattern, html, re.S):
try:
data = json.loads(block)
except ValueError:
continue
items = data if isinstance(data, list) else [data]
for item in items:
offers = item.get('offers') if isinstance(item, dict) else None
if isinstance(offers, list):
offers = offers[0] if offers else None
if isinstance(offers, dict) and offers.get('price'):
return float(offers['price']), offers.get('priceCurrency', '')
return None, None
def fetch_price(url):
resp = requests.post(API, json={
'cmd': 'request.get',
'url': url,
'proxyCountry': 'UnitedStates', # match the market you sell in
'session': 'price-monitor-us', # reuse IP + cookies per target
'autoparse': True,
})
html = resp.json()['solution']['response']
return extract_price(html)
if __name__ == '__main__':
for url in TARGETS:
price, currency = fetch_price(url)
# Append one timestamped row per reading - keep full history.
print(url, price, currency)
```
### FAQ
**Q: How do I scrape the sale price and not the original crossed-out price?**
Read the structured data first. Most product pages embed a schema.org Product object in an application/ld+json script tag with a single offers.price field that already reflects the current sale price. That is far more reliable than targeting a styled price element, where the original and discounted prices both appear in the markup and layout changes break your selector.
**Q: How often should I re-check prices?**
Tier your cadence. Hourly per SKU is realistic for hot, competitive items when you rotate residential proxies and reuse a session per target; the long tail is fine daily or weekly. Hammering every URL every minute from one IP is unreliable and wastes requests, so pace per site and only check fast-moving items frequently.
**Q: Do I need a different proxy country for each market I track?**
Yes. Many stores localize price, currency, and availability by the country they detect from your IP, so scraping a US store from a European IP returns misleading numbers. Set the proxy country to the market you sell in for each target, and store the currency next to every price so you never mix markets.
---# HTTP Errors
Status codes scrapers hit constantly. Each entry explains what the code means, why it shows up in scraping, and how to recover from it.
## What Is the 429 Status Code (429 Error)?
URL: https://scrappey.com/qa/http-errors/what-is-a-429-error
**HTTP 429 Too Many Requests is the status code a server returns when a client has sent more requests in a given window than the server's rate limit allows.** A rate limit is simply a cap on how many requests you're allowed to make in a set period of time. The response often includes a Retry-After header — a hint that tells the client how long to wait before trying again. For web scrapers, 429 is the most common form of soft block: the server hasn't decided you're a bot, it's just telling you that you're hitting it too fast and need to slow down.
### Quick facts
- **Status code:** 429
- **Category:** 4xx Client Error
- **Standard header:** Retry-After (seconds or HTTP date)
- **Common causes:** Burst traffic, per-IP rate limits, API quota exhaustion
- **Right response:** Back off, honor Retry-After, slow request rate
### What triggers a 429
Servers set rate limits to protect themselves from abuse and from accidentally being overwhelmed (a denial-of-service, where so many requests pile up that the server can't serve anyone). A 429 is usually triggered by one of three things: too many requests per second from a single IP address, too many requests per minute or hour against one specific URL (endpoint), or using up a per-account quota on an authenticated API. The cutoff point varies wildly — a public API might allow 60 requests per minute, while a protected page might block you after just 10 per minute or fewer. Services like Cloudflare and AWS WAF, and most CDNs (the networks that cache and deliver website content), come with rate-limiting rules built in, so even sites that wrote no custom logic can still serve 429s. Scrapers most often hit them when they launch many workers in parallel without coordinating the total request rate across all of them.
### How to read a 429 response
Always check the Retry-After header first. It's either a number of seconds (Retry-After: 30) or a specific date and time (Retry-After: Wed, 26 May 2026 14:30:00 GMT). Honor whatever it says. If the header is missing, fall back to exponential backoff — wait 1s, then 2s, then 4s, doubling each time, capping at a few minutes. Look at the response body too: some APIs include a JSON object listing the current limit, how much quota you have left, and when it resets, which is far more useful than guessing. Keep logs of when you got 429s, against which endpoint, and from which IP — that record is the data you'll need to tune how many requests you send at once (your concurrency).
### How scrapers handle 429 correctly
The wrong answer is to retry instantly or switch IPs and keep hammering — that turns a gentle rate limit into a hard ban. The right answer has three parts. First, slow your request rate per IP: if you're seeing 429s at 5 requests per second, drop to 2, and slow down further if they keep coming. Second, spread the load across more IPs using proxy rotation (cycling through a pool of IP addresses), but make each IP's pace look human — one request every few seconds, not one every millisecond. Third, queue and retry with backoff: a failed request goes to the back of the line with a delay based on Retry-After plus a little randomness (jitter), so your retries don't all fire at the same instant. Production scrapers treat 429 as routine and plan for it, rather than treating it as an error worth alerting on.
### How to fix a 429 in Python (requests)
The fix has three layers: **honor Retry-After**, **back off exponentially with jitter** when it's missing, and only then **spread load across proxies**. The function below does all three. It parses Retry-After whether the server sends seconds (Retry-After: 30) or an HTTP date, waits the right amount, and caps total attempts so a hard block doesn't loop forever.
The most common mistake — and the reason a lot of "too many requests python" searches end in frustration — is a bare time.sleep(5) retry loop that retries instantly on the next 429 and gets the IP banned. Adaptive backoff plus a realistic User-Agent (the default python-requests/2.x header is itself a rate-limit trigger on many sites) fixes the majority of cases. See the full snippet in the code example below.
If 429s persist after backoff the limit is probably per-IP, not per-account. At that point you need rotating proxies so each IP stays under the threshold — or a managed API that pools IPs and paces requests for you.
### Example
```python
import time, random, requests
from email.utils import parsedate_to_datetime
from datetime import datetime, timezone
HEADERS = { # a real browser UA — the default python-requests UA invites 429s
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0 Safari/537.36",
"Accept": "text/html,application/json",
"Accept-Language": "en-US,en;q=0.9",
}
def retry_after_seconds(resp):
"""Honor Retry-After whether it's seconds or an HTTP date."""
ra = resp.headers.get("Retry-After")
if not ra:
return None
if ra.isdigit():
return int(ra)
try:
when = parsedate_to_datetime(ra)
return max(0, (when - datetime.now(timezone.utc)).total_seconds())
except (TypeError, ValueError):
return None
def get_with_backoff(url, max_retries=5):
session = requests.Session()
session.headers.update(HEADERS)
for attempt in range(max_retries):
resp = session.get(url, timeout=30)
if resp.status_code != 429:
return resp
wait = retry_after_seconds(resp)
if wait is None: # no header -> exponential backoff
wait = (2 ** attempt) + random.uniform(0, 1) # + jitter
print(f"429 received; waiting {wait:.1f}s (attempt {attempt + 1})")
time.sleep(wait)
raise RuntimeError("Still rate-limited after retries -- rotate proxies")
resp = get_with_backoff("https://example.com/api")
print(resp.status_code, len(resp.text))
```
### FAQ
**Q: What's the difference between 429 and 503?**
429 is specifically about how fast the client is sending requests. 503 means the server itself is unavailable — overloaded, mid-deployment, or down. Both can come from rate-limit middleware, but 429 is the one that correctly says "you're going too fast."
**Q: Does using a proxy fix 429s?**
It can. If the limit is per-IP, rotating through different IPs spreads your requests out so each IP stays below the threshold. But if the limit is tied to your account or your fingerprint (the unique profile a server builds from your connection), swapping IPs alone won't help. Figure out which kind of limit you're hitting before throwing more infrastructure at it.
**Q: How long should I wait after a 429?**
Honor Retry-After if it's present. Otherwise the standard approach is exponential backoff: start at 1 second and double each time (1s, 2s, 4s...), adding a bit of random jitter so retries don't bunch up. Cap the wait at 5–10 minutes; if you're still getting 429s after that, you're no longer just rate-limited — you're banned.
**Q: Can a 429 turn into a permanent ban?**
Yes. Repeatedly triggering 429s from the same IP is a strong bot signal, and many sites escalate it into long-term IP blocks. Treat a 429 as a cue to slow down, not as a free allowance to keep retrying.
**Q: How do I fix a 429 error in Python?**
Read the Retry-After header and sleep for exactly that long; if it is absent, use exponential backoff with jitter (1s, 2s, 4s… plus a random fraction). Send a real browser User-Agent instead of the default python-requests one, and use a requests.Session so connections and cookies are reused. If 429s continue, the limit is per-IP — rotate proxies so each IP stays under the cap.
**Q: Why do I get 429 with python-requests but not in my browser?**
Two reasons. The default User-Agent (python-requests/2.x) is an obvious bot signal that many sites rate-limit harder, and scripts fire requests far faster than a human clicks. Set browser-like headers and add delays, or route through a scraping API that handles pacing and fingerprints.
---
## What Is the 499 Status Code (499 Error)?
URL: https://scrappey.com/qa/http-errors/what-is-a-499-error
**HTTP 499 Client Closed Request is a non-standard status code, logged by Nginx (and CDNs like Cloudflare) when the client closes the connection before the server finishes sending a response.** Think of it as hanging up the phone before the other person finishes their sentence. The server never gets to send the response — there's no one left to receive it — so 499 only shows up in server or CDN logs, never as a page in your browser. For scrapers, a 499 usually means your own code gave up first: a timeout set too short, a cancelled request, or a worker that was killed while the origin was still busy producing a slow response.
### Quick facts
- **Status code:** 499 (non-standard, Nginx)
- **Meaning:** Client Closed Request
- **Where you see it:** Nginx / CDN access logs, not in the browser
- **Common causes (scraping):** Client timeout too low, cancelled request, slow origin or proxy
- **Right response:** Raise client/read timeouts, lower concurrency, retry with backoff
### What a 499 actually means
Most 4xx codes describe a problem the server found with your request. A 499 is different: it's recorded *after* the client has already disconnected, so it describes the client's behaviour, not the server's. Nginx created the code specifically to separate "the caller hung up" from a genuine server error.
Here's the typical chain of events. Your HTTP library has a timeout — a maximum time it will wait for a reply. If the page takes longer than that, the library gives up and closes the socket (the open network connection). From the origin server's point of view, that looks like a client that vanished mid-request, so it logs a 499. Cloudflare and other reverse proxies (servers that sit in front of the origin and relay traffic) follow the same convention. Because the response never reaches your code, your scraper sees this as a timeout or connection-aborted error — not as a 499. The 499 itself only ever lives in the target's logs.
### Why scrapers run into 499
The most common cause is a read timeout set too low for the page you're fetching. Heavy pages — ones rendered with JavaScript, or sitting behind an anti-bot challenge — can take many seconds to respond. If your client times out at 5–10 seconds, you give up too early and generate 499s. Other common triggers:
- A proxy that is slow or flaky and stalls the connection.
- A worker process that gets killed mid-request — by running out of memory (OOM), by autoscaling, or by a deploy — while requests are still in flight.
- Cancelling requests in your own code: an aborted async task, or a closed browser tab in a headless (no visible window) run.
Concurrency makes it worse. Hitting a slow origin with many requests at once makes each response slower still, which makes more of your timeouts fire, which produces 499s in bulk.
### How to fix and avoid 499
Start by raising your client's read and connect timeouts so they comfortably exceed the target's worst-case response time. Then retry any timed-out request with exponential backoff and jitter — wait a bit longer before each retry, with a little randomness so your retries don't all fire at the same moment. Lower your concurrency so the origin (and your proxy) can finish answering before your timeouts trip.
Next, audit your proxies: swap out slow or unstable endpoints, and prefer reliable residential proxies for protected sites. Make sure your workers aren't being killed mid-request by memory limits or aggressive autoscaling. Finally, if the origin is slow because it's actively challenging you, the real fix isn't per-request tweaks — it's to use a scraping API that handles the wait and the unblock for you.
### FAQ
**Q: Is 499 a real HTTP status code?**
Not in the official spec. It's a non-standard code introduced by Nginx to mean "Client Closed Request," and it's widely used in server and CDN logs. Browsers never display it because the client has already disconnected by the time it's logged.
**Q: Why do I see 499 when scraping but my request just timed out?**
They're two views of the same event. Your HTTP client reports a timeout or aborted connection; the origin's Nginx logs that same abort as a 499. Increase your timeout and the 499s usually disappear.
**Q: Does a 499 mean I'm blocked?**
Not directly. A 499 is about a closed connection, not a refusal. But it often appears alongside blocking: a site that's slow-walking or challenging your request can push your client past its timeout, producing 499s as a side effect.
**Q: How do I stop getting 499 errors?**
Raise your read and connect timeouts, retry with backoff, reduce concurrency, and use reliable proxies so responses arrive before your client gives up. If a site is slow because it's anti-bot-challenging you, offload the wait to a scraping API.
---
## What Is the 403 Status Code (403 Forbidden Error)?
URL: https://scrappey.com/qa/http-errors/what-is-a-403-error
**HTTP 403 Forbidden means the server understood your request but refuses to answer it.** The difference from 401 is simple: 401 means "we don't know who you are, log in first," while 403 means "we know who you are and the answer is still no." For scrapers, 403 is the classic anti-bot block — the server decided your request looks automated and cut it off before sending the page.
### Quick facts
- **Status code:** 403
- **Category:** 4xx Client Error
- **Common causes (scraping):** Bot detection, missing/wrong headers, blocked IP, geo restriction
- **Common causes (general):** Insufficient permissions, expired auth, IP allowlist
- **Typical body:** "Access Denied", "Forbidden", or a Cloudflare challenge page
### Why 403s happen in scraping
On a normal API, a 403 just means you logged in but aren't allowed to see this particular thing. In scraping it almost always means an anti-bot system spotted you. Common triggers: your IP belongs to a known datacenter range (servers, not homes), a missing or odd User-Agent (the string a browser sends to identify itself), a TLS fingerprint that doesn't match that User-Agent (TLS is the encryption layer behind https, and its handshake quietly reveals which client you really are), too many requests too fast, a geo block on your country, or a behavioral signal picked up on an earlier page. Cloudflare in particular returns 403 with its own branded challenge page as the body — if you see "Cloudflare" and a ray ID in the HTML, that's the source. The 403 is just the symptom; the real decision happened a layer above it.
### How to diagnose a 403
Start by reading the response body — that's where the truth is. A short plain "Forbidden" page usually means a simple WAF rule (a web application firewall, basic pattern-matching at the edge); a Cloudflare/Akamai/DataDome branded page means a dedicated bot-detection service. Check the response headers for cf-ray, x-amzn-waf-action, server: AkamaiGHost, or x-datadome to see which vendor it is. Then check what you're sending: is the User-Agent realistic? Is Accept-Language present? Does your TLS fingerprint match the browser you claim to be? Finally, try the same URL through a residential proxy in the target's main country — if that works, the block was based on your IP; if it still fails, the block is based on your fingerprint.
### How to recover from a 403
The fix depends on the cause. IP-based 403s clear by rotating through residential or mobile proxy addresses. Header-based 403s clear with realistic headers — copy a real browser's headers from DevTools exactly. Fingerprint-based 403s need a real browser stack: Playwright driving a real browser engine, or a managed scraping API that maintains a consistent browser configuration. A 403 that redirects to a CAPTCHA needs a solver. A geo-based 403 needs a proxy in the right country. One rule above all: never retry a 403 without changing something — repeat 403s from the same identity only make the block stronger.
### Fixing a 403 in Python requests
Work through four layers in order — most 403s are solved by the first two:
- **Send a realistic User-Agent.** The default python-requests/2.x string is the single most common cause of a 403 that works fine in a browser.
- **Send the full browser header set** — Accept, Accept-Language, Accept-Encoding, and Referer. A request with only a User-Agent still looks nothing like a real browser.
- **Persist cookies with a requests.Session().** Many sites set a cookie on first load and 403 any request that arrives without it.
- **If 403s survive all of the above**, the site is fingerprinting your TLS/HTTP2 handshake (typical of Cloudflare and DataDome). Plain requests can't change that — switch to curl_cffi with impersonate="chrome", a headless browser, or a managed scraping API.
The code example below shows the headers-plus-session approach and the curl_cffi fallback side by side.
### Example
```python
import requests
# Layer 1+2+3: realistic headers + a session that persists cookies.
BROWSER_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
}
session = requests.Session()
session.headers.update(BROWSER_HEADERS)
resp = session.get("https://example.com", timeout=30)
print(resp.status_code) # often 200 once headers + cookies look real
# Layer 4: still 403? The site is fingerprinting your TLS handshake.
# curl_cffi impersonates a real browser's TLS/JA3, which requests can't do.
# pip install curl_cffi
from curl_cffi import requests as cffi
resp = cffi.get("https://example.com", impersonate="chrome", timeout=30)
print(resp.status_code)
```
### FAQ
**Q: What's the difference between 401 and 403?**
401 Unauthorized means you haven't proven who you are yet — log in and try again. 403 Forbidden means you're already authenticated (or no login is needed) but still aren't allowed in. For scraping without a login, 403 is the one you'll actually run into.
**Q: Does a 403 mean my IP is banned?**
Sometimes. The quickest test is to send the exact same request through a different IP. If the second attempt works, the first IP is on a block list. If it still fails, the block is based on something other than your IP — usually your fingerprint or headers.
**Q: Can changing User-Agent fix a 403?**
It can satisfy the most basic rules but not modern detection. Real systems cross-check the User-Agent against your TLS fingerprint, the order of your headers, signals collected by JavaScript, and your behavior. Swapping the UA alone matters little; making the UA line up consistently with everything else is what counts.
**Q: Is a 403 the same as a Cloudflare block?**
Cloudflare blocks usually do return 403 with their challenge page in the body, but plenty of other sources return 403 too. Read the response body and headers to identify which vendor (Cloudflare, Akamai, DataDome, PerimeterX) is doing the blocking.
**Q: How do I fix a 403 Forbidden error in Python requests?**
Set a real browser User-Agent and the full header set (Accept, Accept-Language, Referer), and use a requests.Session() so cookies persist. If you still get 403, the site is fingerprinting your TLS handshake — switch to curl_cffi with impersonate="chrome", a headless browser, or a scraping API that emulates a real browser.
---
## What Is the 503 Status Code (503 Service Unavailable Error)?
URL: https://scrappey.com/qa/http-errors/what-is-a-503-error
**HTTP 503 Service Unavailable means the server can't handle your request right now — usually because it's overloaded, under maintenance, or deliberately turning traffic away at its outer edge.** Think of it as a shop with a "back in 5 minutes" sign on the door. For scrapers, a 503 most often means a CDN (the network of edge servers that sits in front of a website) or a bot-detection layer chose to drop your request instead of passing it to the real server — or that an upstream service is being rate-limited. A proper 503 includes a `Retry-After` header telling you when to try again.
### Quick facts
- **Status code:** 503
- **Category:** 5xx Server Error
- **Common causes (general):** Server overload, scheduled maintenance, upstream timeouts
- **Common causes (scraping):** Cloudflare "Just a moment" challenge, WAF interstitials, edge throttling
- **Right response:** Honor Retry-After, exponential backoff, escalate if persistent
### Why 503s happen
A 503 comes from one of two places. The first is the real server genuinely struggling: it's overloaded, restarting, mid-deployment, or its database timed out. These 503s are temporary and fix themselves within seconds or minutes. The second — the one scrapers see all the time — is bot detection. Cloudflare historically returned 503 alongside its "Checking your browser…" interstitial (a holding page shown while it inspects you); modern Cloudflare uses 403 for outright blocks and saves 503 for protecting capacity, but the old pattern still shows up. Akamai, AWS WAF, and Imperva also return 503 in various managed-block setups. The quickest way to tell the two apart: read the response body. A vendor-branded page means bot detection; a plain "Service Unavailable" or your own server's error template means real overload.
### How scrapers should respond to 503
Treat 503 as worth retrying — but not blindly. If a `Retry-After` header is present, wait exactly that long. Otherwise use exponential backoff with jitter, meaning you double the wait each time and add a small random offset so many retries don't all fire at once: 2s, 4s, 8s, 16s, capping at a few minutes. If the body shows a bot-detection interstitial, retrying as the same "person" won't help — rotate your proxy (use a different exit IP) and refresh your fingerprint (the bundle of browser traits sites use to recognize you) before trying again. If 503s come in bursts and then recover, that's probably genuine origin overload and your retries will eventually get through. If your scraper gets 503s constantly but a normal browser doesn't, you're being blocked, not throttled.
### Avoiding 503s in the first place
Spread your requests out over time. Set concurrency limits (how many requests run at once) per-domain rather than globally — pointing 50 parallel workers at a single site reliably triggers 503s even on sturdy servers. Reuse TCP/TLS connections: a single HTTP/2 connection can carry many requests, and opening a fresh connection for every request is both slower and more obviously bot-like (TLS is the encryption layer behind https). Cache aggressively — if you scraped a page in the last 24 hours, don't fetch it again. And keep monitoring: a sudden jump in 503s from a target that was stable usually means it just rolled out new rate limits or a new bot-detection rule, and your strategy needs to adapt.
### FAQ
**Q: Is a 503 always temporary?**
By definition, yes — 503 means "come back later." In practice, the bot-detection kind sticks around until you change something about how your scraper presents itself, so it isn't temporary the way the spec assumes.
**Q: What's the difference between 503 and 502?**
502 Bad Gateway means a proxy passed along a broken response it got from an upstream server. 503 means the server itself is refusing to serve right now. Both are 5xx errors and both are worth retrying, but a 502 usually points to a more serious upstream problem.
**Q: Should I treat 503 as success-eventually or as failure?**
Treat it as eventual success on the first 2–3 retries, then as failure after that. Most production scrapers retry 3–5 times with backoff, then mark the URL as failed and move on. Failing fast keeps the queue moving.
**Q: Does a 503 count against my scraping API quota?**
It depends on the provider. Most credit you back for upstream 503s, charge you for transient ones, and usually charge you when the 503 was caused by bot detection. Read the provider's billing docs — this is one of the bigger sources of surprise charges.
---
## What Is a 200 Status Code?
URL: https://scrappey.com/qa/http-errors/what-is-a-200-status-code
**HTTP 200 OK is the standard "success" status code: the server got your request, handled it, and sent back the response you expected.** For a GET request (asking for a page), 200 means the page content is in the reply. For a POST request (sending data), it means the action completed. 200 is the default "everything worked" signal. But for web scrapers there's a catch: a 200 does not always mean the page actually contains the data you wanted.
### Quick facts
- **Status code:** 200
- **Category:** 2xx Success
- **Default success response:** Body contains the requested resource
- **Common gotcha (scraping):** 200 + bot-detection HTML body ("soft block")
### What a 200 OK means
The HTTP spec (the rulebook for how browsers and servers talk) says a 200 means the request was understood, accepted, and the response body holds the result. For a GET, that body is the content at the URL you asked for. For a POST, it's usually a summary of what the action did. In theory, servers should never send 200 when something went wrong — that's the job of the 4xx codes (your mistake) and 5xx codes (the server's mistake). In practice, many servers return 200 anyway and just put an error message inside the body, because that's easier than setting the correct code. Because servers follow the rules loosely, a scraper can't trust a 200 as proof of success without also checking what's actually in the body.
### Why 200 isn't always success for scrapers
Bot-detection systems often answer with a 200 even when they're blocking you — serving a challenge page, a "please enable JavaScript" notice, or an empty layout instead of the real content. Your HTTP client sees status 200 and calls it a win. Your parser then runs over the wrong HTML and either crashes or quietly pulls out nothing. This is called a soft block, and it's the sneakiest failure in scraping: if you only watch status codes, you never notice it happened. Solid production scrapers check two things after every fetch: the status code AND a structural signal that the expected content is really there (a known CSS selector, a specific JSON field, or a minimum response size).
### How to validate a 200 response correctly
Use three layered checks. First, confirm the status code is 200 — if it's anything else, stop here and treat it as a failure. Second, confirm the response body is at least a reasonable size — a real product page is rarely under 5KB, so a 1KB "200 OK" is almost certainly a block page. Third, confirm at least one expected element exists — for example, `soup.select_one('.product-title')` should return an element, not None. If any of the three checks fail, treat the request as failed, queue a retry with a different proxy or fingerprint, and bump a separate "soft block" counter. That separate counter lets you tell "the site is broken" apart from "we're being detected."
### FAQ
**Q: Does 200 mean my scraper worked?**
Only if the body also contains what you asked for. A 200 that returns a "please verify you're human" page in the body is a failed scrape — it just looks successful at the HTTP layer.
**Q: What's the difference between 200 and 204?**
200 OK means success with content in the body. 204 No Content means success with an empty body — common for a DELETE, or for a PUT request that has nothing to send back.
**Q: Can I get a 200 from a CAPTCHA page?**
Yes — most CAPTCHA challenge pages return 200, with the challenge HTML sitting in the body. To catch this you have to check the content itself, not just the status code.
**Q: Should my scraper retry on 200?**
Only if your after-fetch validation fails. If the status is 200 and the body looks correct, you're done. If the status is 200 but the body looks like a block page, retry with a different identity (a new proxy or fingerprint).
---
## What Is Cloudflare Error 1015?
URL: https://scrappey.com/qa/http-errors/what-is-cloudflare-error-1015
**Cloudflare error 1015 "You are being rate limited" means a website is blocking you because you sent too many requests too quickly.** The site owner set up a rate-limiting rule inside Cloudflare (the service that sits in front of many websites), and your traffic tripped it. This is different from a normal HTTP 429 from the origin — the block happens at Cloudflare's edge (its global network of servers, sitting between you and the real site) inside the WAF (Web Application Firewall, the layer that filters traffic), so your request never reaches the actual server. The page shows a Cloudflare-branded error with a ray ID (a unique reference code for that one request) and a note to wait and try again.
### Quick facts
- **Error code:** 1015
- **Layer:** Cloudflare edge (WAF / Rate Limiting Rules)
- **Underlying HTTP status:** 429 (sometimes 403)
- **Configured by:** Site owner, not Cloudflare
- **Typical trigger:** Per-IP request rate exceeded a threshold
### What triggers Cloudflare 1015
Site owners write rate-limiting rules in their Cloudflare dashboard — essentially "if one IP makes more than N requests in M seconds against path X, block it for T minutes." When your scraper crosses that line, Cloudflare's edge serves the 1015 page instead of passing the request along. Typical rules look like 10 requests per 10 seconds for a login page, 100 per minute for catalog pages, or tighter limits on search and pricing endpoints. The ray ID on the error page identifies your specific blocked request — you could hand it to the site owner if you have a legitimate reason, but for unaffiliated scraping that isn't an option.
### How to recover from a 1015
The block is tied to your IP and lasts a set amount of time. Switching to a different IP through proxy rotation (cycling requests across many addresses) clears it right away. The clean approach is to watch the response body for the 1015 marker, flag the current IP as "cooling off" for the block duration (the error message often tells you how long), and send the next requests through a different IP. If you don't rotate proxies, your only choice is to wait — the block usually clears in 5–60 minutes depending on the rule. Note that hammering the same IP during the cooldown extends the block on many setups.
### How to avoid 1015 in the first place
Slow down how often each IP hits the site. The thresholds that trigger 1015 are often around 100+ requests per minute per IP, so staying well under that on every IP in your pool keeps you safe. Spread the work across a larger proxy pool so each individual IP looks like ordinary traffic. Add jitter (small random delays) between requests, since perfectly evenly-spaced requests are exactly the pattern rate-limiting rules look for. And cache: if you grabbed a page recently, don't fetch it again. Most 1015 problems come from scrapers that pile too many simultaneous requests onto a single proxy without measuring the rate.
### FAQ
**Q: Is Cloudflare 1015 the same as HTTP 429?**
The underlying status code is usually 429 (the standard "too many requests" response), but 1015 is Cloudflare's own branded error page. The cause is the same — too many requests — yet knowing it's a Cloudflare rule tells you the block is at the edge and tied to your IP, which shapes how you recover.
**Q: How long does a 1015 block last?**
It depends on the site's rule. Common durations are 1 minute, 10 minutes, and 1 hour. The error page sometimes states the duration; if not, you learn it by testing.
**Q: Will rotating proxies fix 1015?**
Yes — the block is per-IP, so a fresh IP starts with a clean request counter. The catch is that rotating too aggressively can trip other Cloudflare rules (like a managed challenge or bot fight mode), so rotation on its own isn't a full strategy.
**Q: Can a VPN solve Cloudflare 1015?**
Briefly — a VPN gives you a new IP and the counter resets. But VPN IP ranges are well-known and often face stricter rate limits than residential IPs (real home internet addresses). For sustained scraping, residential or mobile proxies are more reliable.
---
## What Is a 402 Error?
URL: https://scrappey.com/qa/http-errors/what-is-a-402-error
**HTTP 402 Payment Required is the status code a server sends to say: "I won't do this until a payment, billing, or quota problem is fixed."** It was set aside in the original HTTP spec for future digital-payment ideas that never fully arrived. In modern web scraping you mostly meet it from APIs (Stripe, OpenAI, scraping APIs themselves) when an API key has expired, a free quota is used up, or a paid plan has lapsed — and occasionally from content sites enforcing paywalls.
### Quick facts
- **Status family:** 4xx — client error
- **Most common cause:** API key out of credits, plan expired, or paywall
- **Retry safe?:** No — server explicitly refused; retrying without fixing billing wastes requests
- **Where you see it:** API responses, paywalled article endpoints, metered crawl services
- **Fix:** Top up credits, rotate to a working key, or upgrade plan
### What 402 actually means in scraping
The 402 code was originally reserved for "digital cash" payments that never caught on. Today it is a general billing signal — usually meaning "you ran out of credits." If your scraping pipeline calls a paid third-party API (Bright Data, Scrappey, an LLM, a CAPTCHA solver) and gets a 402, that provider has decided your account can no longer pay for the work. The response body usually includes a plain-English message naming the quota or plan that ran out.
### How to detect and recover
Treat 402 as a permanent failure, not a temporary glitch. Automatic backoff and retry (waiting, then trying again) will not help — the server isn't overloaded; it is refusing on purpose because of an account policy. In production: alert immediately, pause the job, and send the remaining work to a backup provider if you have one. Log the response body so the on-call engineer can see exactly which quota was hit. Some providers warn you nicely (slowing you down instead of returning 402) as you near the limit — switch those warnings on rather than waiting to hit the wall.
### When the target site returns 402
A small number of paywalled publishers return 402 instead of 401 or 403 to signal a metered article (one you only get a limited number of for free). Scrapers should respect this — getting around a metered paywall to grab paid content is a clear terms-of-service breach. If you have legitimate access (an institutional licence or paid subscription), log in through the proper flow and reuse the session cookie — the token the site gives you after login to prove you're signed in — rather than trying to bypass the 402 directly.
### Example
```python
import requests, sys
resp = requests.post('https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY', json={
'cmd': 'request.get',
'url': 'https://example.com'
})
if resp.status_code == 402:
print('402 Payment Required:', resp.text)
sys.exit(2) # Do not retry — alert on-call
```
### FAQ
**Q: Is 402 the same as 401?**
No. 401 Unauthorized means your credentials are missing or invalid — the server doesn't know who you are. 402 means your credentials are fine, but the account can't pay for the work: quota exhausted, plan expired, or card declined.
**Q: Should I retry a 402?**
Not automatically. The server is refusing on policy, so retrying changes nothing until you fix the billing issue. Worse, each retry still costs you — it burns through your own API budget contacting the upstream provider.
**Q: Why is 402 so rare compared to 401 and 403?**
It was reserved in the HTTP spec without a concrete implementation, so most servers picked 403 for billing failures instead. It is now mostly used by modern APIs and some paywall systems.
---
## What Is a 404 Error?
URL: https://scrappey.com/qa/http-errors/what-is-a-404-error
**HTTP 404 Not Found is the server's way of saying "I understood your request, but there is nothing at this address."** The server is working fine - it just has no page, file, or data at the URL you asked for. On the normal web a 404 is straightforward: the page is gone or never existed. In scraping it is trickier: some anti-bot systems (tools that detect and block automated traffic) send a fake 404 to hide the fact they are blocking you, and JavaScript-heavy sites can show a 404-looking page that is actually fine once the browser runs its scripts.
### Quick facts
- **Status family:** 4xx — client error
- **Honest meaning:** URL does not exist on this server
- **Suspicious meaning:** Anti-bot system returning 404 instead of 403 to obscure the block
- **Retry safe?:** Usually no — but worth trying with a different IP or fingerprint if you suspect cloaking
- **Detection trick:** Compare response from a browser vs your scraper; if browser works, it is a block
### When 404 is honest
Most 404s are real: a typo in the URL, a product that has been delisted, an old article taken down, or a path that never existed. When this happens, record the 404, mark the URL as dead in your work queue, and move on. Repeatedly hitting a dead URL just wastes requests and pushes the target site's rate limiter (the system that throttles clients sending too many requests) to flag your IP.
### When 404 is a block
Some anti-bot stacks deliberately return 404 to scrapers instead of 403, on the theory that "page not found" is less useful to you than "you are blocked" - it gives you less to react to. Cloudflare, DataDome, and a handful of in-house systems do this. The giveaway: the page loads fine in a real browser on your machine but consistently 404s from your scraper. The fix is the same as for any block - a cleaner IP reputation, a more realistic browser fingerprint (the set of signals that make your traffic look like a normal browser), and a slower request rate.
### When 404 is a rendering problem
Single-page apps (sites that load one HTML page and then build every view with JavaScript) often serve the same 404-shaped HTML shell for every URL, with the real content filled in by the browser after a follow-up fetch. If you scrape the raw HTML you see "404" or an empty body; if you actually run the JavaScript, the page loads normally. The clue is a mismatched content-type or a near-empty response body - switch to a JS-rendering API (one that runs the page's scripts for you) or grab the underlying XHR endpoint (the background data request the page makes) directly.
### Example
```python
import requests
def diagnose_404(url):
# Real-browser UA succeeds where bare client 404s → cloaked block
headers = {'User-Agent': 'Mozilla/5.0 (real browser UA)'}
r1 = requests.get(url, headers=headers)
r2 = requests.get(url)
if r1.status_code == 200 and r2.status_code == 404:
return 'cloaked_block'
if r1.status_code == 404 and r2.status_code == 404:
return 'real_404'
return 'inconclusive'
```
### FAQ
**Q: Should I retry 404s in a crawl?**
Usually no - mark the URL dead and move on. It is worth one retry through a different IP or with a real browser fingerprint if you suspect the site is disguising blocks as 404s.
**Q: Why would a site return 404 instead of 403?**
To hide that they are blocking you. A 403 tells the scraper "you are detected, try harder." A 404 tells it "nothing here, give up." It is a deliberate tactic, not a bug.
**Q: How do I crawl an SPA that returns 404 for the raw HTML?**
Either render the JavaScript (with Playwright or a JS-rendering scraping API) or figure out the XHR endpoint the SPA calls to load its data and request that directly - usually cheaper and faster than full rendering.
---
## What Is a 520 Error?
URL: https://scrappey.com/qa/http-errors/what-is-a-520-error
**HTTP 520 is a non-standard Cloudflare status code meaning the origin server returned a response Cloudflare cannot interpret.** Cloudflare is a service that sits in front of many websites and forwards traffic to the site's real server (the "origin"). A 520 means the origin answered, but the answer was broken: it may have closed the connection mid-response, returned malformed headers (the metadata at the top of an HTTP reply), sent an empty body where Cloudflare expected content, or crashed entirely. From a scraper's point of view it is a 5xx — a server-side problem, not a block — but the cause varies enough that diagnosing it takes care.
### Quick facts
- **Status family:** 5xx — server error (Cloudflare-specific)
- **Meaning:** Origin returned empty, unknown, or invalid response to Cloudflare
- **Common causes:** Origin crash, malformed headers, premature close, origin firewall blocking Cloudflare
- **Retry safe?:** Yes — with exponential backoff; transient in most cases
- **Distinguishes from:** 521 (origin down), 522 (origin timeout), 523 (origin unreachable)
### What 520 actually indicates
520 is Cloudflare's catch-all: it uses this code when an origin failure does not match any of its more specific 5xx codes. The origin replied, but the reply violated HTTP in some way — oversized headers, a connection reset before the response body finished, a malformed status line, or an empty body where a length was promised. It is not a block aimed at your scraper. Cloudflare returns 520 to all clients when the origin misbehaves.
### How to handle 520 in a scraper
Retry with exponential backoff — wait a little longer before each attempt (1s, 2s, 4s, 8s, max 5 attempts). Most 520s clear within a minute as the origin recovers. If a specific URL returns 520 consistently for an hour, the origin is likely broken — log it, alert, and move on rather than wasting crawl budget. Do not rotate proxies on a 520; the issue is server-side, so a new IP changes nothing.
### When 520 hides a block
Rare but real: a few sites configure their origin firewall to drop scraper traffic at the TCP level (the connection layer, below HTTP), which reaches Cloudflare as a 520 instead of a proper 403. The tell is that the 520s correlate with your traffic specifically — a real browser request from the same IP succeeds. In that case treat it as a soft block: improve your TLS fingerprint (the signature of your client's encrypted-connection setup) and headers, then retry.
### Example
```python
import requests, time
def fetch_with_520_retry(url, max_attempts=5):
for attempt in range(max_attempts):
r = requests.get(url)
if r.status_code != 520:
return r
time.sleep(2 ** attempt) # 1s, 2s, 4s, 8s, 16s
raise RuntimeError(f'520 persisted after {max_attempts} attempts')
```
### FAQ
**Q: Is 520 the same as 502 Bad Gateway?**
Conceptually similar, but 520 is Cloudflare-specific. 502 is the standard code meaning "the upstream server returned an invalid response." 520 is what Cloudflare uses when the failure does not fit any of its more specific 5xx codes (521, 522, 523, 524, 525, 526).
**Q: How long should I retry a 520?**
Five attempts with exponential backoff (each wait longer than the last) covers about 30 seconds of total wait. Beyond that, the origin is sustainably broken, and continuing to retry burns crawl budget without helping.
**Q: Does 520 mean my scraper triggered something?**
Almost never. A 520 is origin-side and affects all clients equally. If only your scraper sees 520s while normal browser traffic succeeds, you are likely looking at a TCP-level block dressed up as a 520 — fix your fingerprint, not your retry logic.
---
## What Is the 401 Status Code (401 Unauthorized)?
URL: https://scrappey.com/qa/http-errors/what-is-a-401-error
**HTTP 401 Unauthorized means the server doesn't know who you are because your request didn't include valid login credentials.** Think of it as a doorman asking "who are you?" — you haven't shown ID yet. This is different from a 403, where the doorman knows you and still won't let you in. A 401 response usually includes a WWW-Authenticate header that names the login method the server expects (Basic, Bearer, etc.). You'll also see it written as "HTTP 401" or just "401 error" / "401 status code."
### Quick facts
- **Status code:** 401
- **Meaning:** Unauthorized
- **Category:** 4xx Client Error
- **Common causes (scraping):** Missing/expired token, wrong API key, no session cookie
- **Right response:** Fix the request / retry with backoff; for disguised blocks use a real-browser unblock
### What a 401 Unauthorized means
A 401 means the server can't identify you — you haven't proven who you are. That's the key difference from a 403, where the server already knows you but still refuses. The response usually carries a WWW-Authenticate header naming the login scheme the server wants: Basic (username/password), Bearer (a token), and so on. In short: a 401 status code means the request is missing valid authentication credentials.
### Why scrapers see 401
Scrapers hit a 401 when the page or API requires you to be logged in. Common causes: a missing or expired session cookie (the small token that proves you already logged in), an absent or wrong API key / Bearer token, or a login step the scraper never completed. Some sites also return a 401 instead of a 403 when an anti-bot layer decides the caller isn't a logged-in human.
### How to fix a 401 error
Give the server the credentials it asks for: refresh an expired token, reuse the cookies from a session that's already logged in, or add the API key header. For pages behind a login, log in once and reuse that session rather than authenticating on every request. If the 401 is really a bot block in disguise on a public page, handle it like a 403 — send realistic headers, use a clean residential IP, and present a real-browser fingerprint via Web Access API.
### FAQ
**Q: Is 401 a client or server error?**
It's a client-side (4xx) error, meaning the problem is in your request — the server is pointing at something you sent (or failed to send).
**Q: Does a 401 mean I'm blocked when scraping?**
Not necessarily. A 401 points at your request, not a ban. But anti-bot layers sometimes return it instead of a 403, so check the response body and headers to see which case you're in.
**Q: How do I fix a 401 error?**
The cause is always the same: your request lacks valid authentication credentials. So the fix is to supply them — correct the missing or wrong part of the request (token, cookie, or API key), then retry.
**Q: What's the difference between 401 and 403?**
A 401 Unauthorized means you haven't proven who you are — provide credentials and retry. A 403 Forbidden means the server knows you (or doesn't need to) but still won't allow it. For scraping that doesn't involve logging in, 403 is usually the one you'll run into.
---
## What Is the 405 Status Code (405 Method Not Allowed)?
URL: https://scrappey.com/qa/http-errors/what-is-a-405-error
**HTTP 405 Method Not Allowed means the page exists, but it won't accept the HTTP method (the verb, like GET or POST) you used to ask for it.** A common example: you send a GET request to a URL that only handles POST. The server replies with 405 and lists the methods it does accept in a response header called Allow. You may also see this written as “HTTP 405” or just “405 error” / “405 status code.”
### Quick facts
- **Status code:** 405
- **Meaning:** Method Not Allowed
- **Category:** 4xx Client Error
- **Common causes (scraping):** Wrong verb (GET vs POST), unsupported HEAD, WAF blocking the method
- **Right response:** Fix the request / retry with backoff; for disguised blocks use a real-browser unblock
### What a 405 Method Not Allowed means
Every HTTP request uses a method (also called a verb) that says what you want to do: GET reads a page, POST submits data, and so on. A 405 means the URL is real, but it doesn't accept the verb you sent — for example, sending GET to a URL that only allows POST. The server tells you which verbs *are* allowed in the Allow header. In short: the endpoint doesn't accept the HTTP method you used.
### Why scrapers see 405
Scrapers usually trigger a 405 by using the wrong verb: doing a GET on a form action that requires POST, POSTing to a resource that is read-only, or using HEAD (a GET that returns headers but no body) where it isn't supported. It can also appear when a WAF — a web application firewall that filters incoming traffic — rewrites or rejects methods it considers unusual.
### How to fix a 405 error
Start with the Allow header in the 405 response — it spells out exactly which methods are permitted. Switch to a supported verb, and make sure your content type and request body match what the endpoint expects. To get this exactly right, open the real site in your browser's DevTools and copy the method, headers, and payload the browser actually sends. If a WAF is the thing rejecting your method, routing the request through a real-browser flow with Web Access API avoids the mismatch.
### FAQ
**Q: Is 405 a client or server error?**
It's a client-side error (the 4xx family), which means the server is flagging something about your request rather than a fault on its own end.
**Q: Does a 405 mean I'm blocked when scraping?**
Not necessarily. A 405 points at your request, not at a ban. That said, some anti-bot layers return a 405 instead of a 403, so check the response body and headers to be sure.
**Q: How do I fix a 405 error?**
The cause is that the endpoint doesn't accept the HTTP method you used, so the fix targets exactly that: correct the offending part of the request (usually the verb), then retry.
**Q: What's the difference between 405 and 404?**
A 404 means the URL doesn't exist at all. A 405 means the URL does exist but doesn't support the method you used — the resource is there, your verb is wrong.
---
## What Is the 406 Status Code (406 Not Acceptable)?
URL: https://scrappey.com/qa/http-errors/what-is-a-406-error
**HTTP 406 Not Acceptable means the server can't return content matching your Accept headers.** When you make a request, your client sends "Accept" headers that say what formats and languages it can handle (via Accept, Accept-Language, or Accept-Encoding). A 406 means the server has nothing that fits what you asked for, so this back-and-forth — called content negotiation — failed. It's also written "HTTP 406" or just "406 error" / "406 status code."
### Quick facts
- **Status code:** 406
- **Meaning:** Not Acceptable
- **Category:** 4xx Client Error
- **Common causes (scraping):** Missing/empty Accept headers, content-negotiation mismatch, anti-bot header check
- **Right response:** Fix the request / retry with backoff; for disguised blocks use a real-browser unblock
### What a 406 Not Acceptable means
Your request's Accept headers (Accept, Accept-Language, or Accept-Encoding) tell the server what you can handle — for example, JSON instead of HTML, or English instead of French. A 406 means the server has no version that matches those demands, so content negotiation failed. In short, a 406 status code is the server saying it can't return content matching your Accept headers.
### Why scrapers see 406
This is common with bare HTTP clients — simple request libraries that, unlike a browser, send no Accept header, an empty one, or a combination no real browser would. Some anti-bot setups deliberately return 406 to requests whose Accept-* headers look automated, because real browsers send a very specific, consistent set.
### How to fix a 406 error
Send realistic, browser-like Accept, Accept-Language, and Accept-Encoding headers — copy them verbatim from a real browser and keep them consistent with your User-Agent (the string that identifies your client). Even just stripping compression headers to make parsing easier can trigger 406. When 406 is an anti-bot signal, a full real-browser request via Web Access API resolves it.
### FAQ
**Q: Is 406 a client or server error?**
It's a client-side (4xx) error, meaning the problem is on your end — the server is pointing at something in your request.
**Q: Does a 406 mean I'm blocked when scraping?**
Not necessarily. A 406 points at your request, not a ban — but anti-bot layers sometimes return it instead of a 403, so check the response body and headers to be sure.
**Q: How do I fix a 406 error?**
The cause is that the server can't return content matching your Accept headers, so the fix targets that. Correct the offending part of the request, then retry.
**Q: What's the difference between 406 and 415?**
406 is about the response: the server can't produce what your Accept headers asked for. 415 (Unsupported Media Type) is about the request: the server won't accept the Content-Type — the format you said your request body is in — that you sent.
---
## What Is the 409 Status Code (409 Conflict)?
URL: https://scrappey.com/qa/http-errors/what-is-a-409-error
**HTTP 409 Conflict means your request clashes with the resource's current state, so the server refuses it.** The server understood what you asked but can't do it because it would conflict with how the data stands right now — common cases are trying to create something that already exists, editing a record that someone else has changed since you last read it, or two writes hitting the same resource at the same moment. You'll also see it written as "HTTP 409" or just "409 error" / "409 status code."
### Quick facts
- **Status code:** 409
- **Meaning:** Conflict
- **Category:** 4xx Client Error
- **Common causes (scraping):** Duplicate create, stale-version edit, racing concurrent writes
- **Right response:** Fix the request / retry with backoff; for disguised blocks use a real-browser unblock
### What a 409 Conflict means
A 409 means the server won't complete your request because it would conflict with the resource's current state. Typical triggers: creating a record that already exists (a duplicate create), editing an item against a stale version (you're working from an old copy someone has since updated), or two writes racing each other to change the same thing. In short, a 409 status code says your request conflicts with the current state of the resource.
### Why scrapers see 409
Plain read-only scraping rarely triggers a 409 — it almost always appears when you're writing or submitting data: creating something that already exists, posting a form twice, or calling an API that uses optimistic-concurrency checks. (Optimistic concurrency means the server tags each version of a resource with an ETag — a fingerprint of that version — and your request sends it back via If-Match; if the version has moved on, you get a 409.) Highly parallel jobs that fire the same action at once can also race into 409s.
### How to fix a 409 error
Make the operation idempotent (safe to repeat without side effects) or check the resource's state before you write. For versioned resources, honor ETag/If-Match and re-fetch the latest version before retrying. De-duplicate submissions and serialize conflicting writes so two workers never act on the same resource at the same time. A 409 is a state/logic problem, not a block — blindly retrying just repeats the same conflict.
### FAQ
**Q: Is 409 a client or server error?**
It's a client-side error (the 4xx family). The server is telling you something in your request is the problem, not that the server itself failed.
**Q: Does a 409 mean I'm blocked when scraping?**
Not necessarily. A 409 points at your request, not a ban. That said, anti-bot layers sometimes return a 409 instead of a 403, so read the response body and headers to confirm what you're actually dealing with.
**Q: How do I fix a 409 error?**
The cause is that your request conflicts with the resource's current state, so the fix targets exactly that: correct the offending part of the request, then retry.
**Q: What's the difference between 409 and 422?**
A 409 Conflict is about the resource's state — for example a version clash or a uniqueness clash. A 422 Unprocessable Entity is about your payload failing validation. Put simply: 409 = state problem, 422 = data problem.
---
## What Is the 422 Status Code (422 Unprocessable Entity)?
URL: https://scrappey.com/qa/http-errors/what-is-a-422-error
**HTTP 422 Unprocessable Entity means the server understood your request perfectly but refused to act on it because the data inside failed a validation check.** The format is correct — valid JSON, all the right pieces — but the actual values don't pass the server's rules: a required field is missing, a value is the wrong type, or it breaks a business rule (like a date in the past). Think of it as a form that's filled in neatly but with an answer the system won't accept. It's also written "HTTP 422" or just "422 error" / "422 status code."
### Quick facts
- **Status code:** 422
- **Meaning:** Unprocessable Entity
- **Category:** 4xx Client Error
- **Common causes (scraping):** Missing required field, wrong type, value out of allowed range
- **Right response:** Fix the request / retry with backoff; for disguised blocks use a real-browser unblock
### What a 422 Unprocessable Entity means
The request's format is fine, but the server can't process what's inside it. This is almost always a validation failure on the payload (the data you sent in the request body): a field is missing, a value is the wrong type, or a value breaks a business rule. In short, a 422 status code means the request was well-formed but failed validation.
### Why scrapers see 422
You'll hit a 422 when you send data to an API or submit a form and one of the fields is invalid — a malformed JSON value, a bad date, or a required parameter you left out. It's about meaning, not formatting, so fixing it like a 400 'bad request' won't help: the body parsed fine, the values just didn't pass validation.
### How to fix a 422 error
Start by reading the response body — most APIs return JSON that lists exactly which fields failed and why. Then fix your payload to match what the server expects: correct the data types, include every required field, and stay within allowed ranges. A reliable trick is to copy a real, working submission captured in your browser's DevTools (the built-in developer tools, opened with F12), including any hidden form fields and tokens. Because a 422 is a problem with the data in your request, changing IPs or headers won't fix it.
### FAQ
**Q: Is 422 a client or server error?**
It's a client-side error (the 4xx family) — the server is telling you the problem is in your request, not on its end.
**Q: Does a 422 mean I'm blocked when scraping?**
Not necessarily. A 422 points at your request, not a ban. That said, some anti-bot systems return it instead of a 403, so read the response body and headers to be sure.
**Q: How do I fix a 422 error?**
The cause is that the request was well-formed but failed validation, so the fix targets exactly that: correct the part of the request that's invalid, then retry.
**Q: What's the difference between 422 and 400?**
A 400 Bad Request usually means the server couldn't even parse the request — the syntax is broken. A 422 means it parsed the request fine, but the values failed validation. Simply put: 400 = can't read it, 422 = read it, didn't like it.
---
## What Is the 451 Status Code (451 Unavailable For Legal Reasons)?
URL: https://scrappey.com/qa/http-errors/what-is-a-451-error
**HTTP 451 "Unavailable For Legal Reasons" means a server is refusing to give you a page because the law — not a technical problem — says it cannot.** Common triggers are government censorship, GDPR or other geo restrictions (rules that limit who can see content based on their country), copyright takedowns, and sanctions. The number is a nod to Ray Bradbury's novel *Fahrenheit 451*, about burning books. You'll also see it written as "HTTP 451," "451 error," or "451 status code."
### Quick facts
- **Status code:** 451
- **Meaning:** Unavailable For Legal Reasons
- **Category:** 4xx Client Error
- **Common causes (scraping):** Geo/GDPR restriction, censorship, copyright takedown, sanctions
- **Right response:** Fix the request / retry with backoff; for disguised blocks use a real-browser unblock
### What a 451 Unavailable For Legal Reasons means
A 451 means the page is being withheld for a legal reason rather than a broken request or a down server. That could be government censorship, GDPR or other geo restrictions, a copyright takedown, or sanctions. The status number references Ray Bradbury's *Fahrenheit 451*. In short: the resource is blocked for legal or geo reasons.
### Why scrapers see 451
Scrapers usually hit 451 because of geo-restriction — the content simply isn't allowed in the country your IP address belongs to. News sites blocking visitors from the EU over GDPR, region-locked catalogues, and blocks on sanctioned countries all show up as 451. It's a deliberate legal or geographic gate, not a bot-detection score, so it has nothing to do with how convincing your scraper looks.
### How to fix a 451 error
Make the request from a region where the content is allowed. Route through residential proxies or a mobile proxy located where the content is legally available — these give you an IP address that looks like it's in that country. Test the same URL from several countries to map which regions are blocked. Keep the legal context in mind: a 451 is the site asserting a real legal restriction. Technically, though, the fix is almost always geo-targeted proxying, which Web Access API can handle automatically.
### FAQ
**Q: Is 451 a client or server error?**
It's a client-side (4xx) error, meaning the server is pointing at something about your request — in this case, where it's coming from.
**Q: Does a 451 mean I'm blocked when scraping?**
Not necessarily. A 451 points at your request, not a ban. But some anti-bot layers (the systems sites use to spot automated traffic) return it instead of a 403, so check the response body and headers to confirm the real reason.
**Q: How do I fix a 451 error?**
The cause is a legal or geo block, so the fix has to target that. Send the request from an allowed region — usually via a geo-targeted proxy — then retry.
**Q: What's the difference between 451 and 403?**
A 403 Forbidden is a general refusal, often from bot detection or missing permissions. A 451 is specifically a legal or geo block. So a 451 tells you the barrier is jurisdictional: change your region, not your fingerprint (the technical traits that make your scraper look like a bot).
---
## What Is the 502 Status Code (502 Bad Gateway)?
URL: https://scrappey.com/qa/http-errors/what-is-a-502-error
**HTTP 502 Bad Gateway means one server, acting as a middleman, got a broken reply from another server behind it.** Many websites sit behind a gateway or proxy — a front-door server that forwards your request to the real "upstream" server that does the work. A 502 means that front door reached the upstream server but got back garbage it couldn't use. Because it's a 5xx (server-side) error, the fault is on their end, not yours, and it's usually temporary. You'll also see it written "HTTP 502" or just "502 error" / "502 status code."
### Quick facts
- **Status code:** 502
- **Meaning:** Bad Gateway
- **Category:** 5xx Server Error
- **Common causes (scraping):** Flaky proxy, overloaded/down origin, CDN can't reach backend
- **Right response:** Fix the request / retry with backoff; for disguised blocks use a real-browser unblock
### What a 502 Bad Gateway means
Think of the gateway or proxy as a receptionist who passes your request to a back office. A 502 means the receptionist got the message through to the back office (the "upstream" server) but the reply came back broken or unreadable. So the failure happens between the edge (the front-facing server) and the origin (the real backend) — not in your request. Because it's a 5xx error, it's a server-side problem and usually clears up on its own.
### Why scrapers see 502
When scraping, a 502 usually points to something flaky on the path to the site: an unreliable proxy, an origin server that's overloaded, or a CDN (content delivery network — the cache layer that fronts many sites) that can't reach the backend. It can also show up briefly when an anti-bot edge mishandles or drops a request it finds suspicious. Unlike a 403 (which is a hard "access denied"), a 502 is generally worth retrying.
### How to fix a 502 error
Retry with exponential backoff and jitter — that means waiting a bit longer between each attempt and adding a small random delay so retries don't all hit at once. Most 502s clear on their own. If they persist, the gateway between you and the origin is the real problem: swap to a more reliable proxy pool, lower concurrency (fewer requests at once) so you're not overwhelming a fragile origin, and check whether your proxy provider — not the target site — is the one returning the 502. Persistent 502s that only appear on protected URLs can mean an edge is dropping you; in that case a real-browser unblock flow via Web Access API is more stable.
### FAQ
**Q: Is 502 a client or server error?**
It's a server-side error (the 5xx family). The problem is between servers — the front-door gateway and the backend it talks to — not in your request.
**Q: Does a 502 mean I'm blocked when scraping?**
Usually not. A 502 is typically a temporary, server-side hiccup, so it's worth retrying. It can occasionally line up with edge blocking, but it isn't a clear block signal the way a 403 is.
**Q: How do I fix a 502 error?**
The cause is a gateway or proxy getting a broken reply from the upstream server, so the fix targets that path. Retry with backoff (waiting longer between attempts), and use reliable proxy and request infrastructure.
**Q: What's the difference between 502 and 503?**
A 502 Bad Gateway means an upstream server handed the gateway a broken response. A 503 Service Unavailable means the server itself is up but can't handle the request right now — usually due to overload or maintenance. Both are 5xx (server-side) errors and are usually temporary.
---
## What Is Cloudflare Error 521?
URL: https://scrappey.com/qa/http-errors/what-is-cloudflare-error-521
**HTTP 521 Web Server Is Down is the error Cloudflare shows when it cannot reach the website's actual server.** Cloudflare sits in front of many sites as a middleman (it speeds them up and filters traffic). When you request a page, Cloudflare tries to fetch it from the origin server - the real machine that hosts the site. If that origin refuses or drops the connection, Cloudflare can't get the page and returns a 521. Cloudflare itself is working fine; the problem is behind it. You'll see Cloudflare's branded 521 page with a ray ID (a unique reference code for that request). It's also written "HTTP 521" or just "521 error" / "521 status code."
### Quick facts
- **Status code:** 521 (Cloudflare)
- **Meaning:** Web Server Is Down
- **Category:** 5xx (Cloudflare edge)
- **Common causes (scraping):** Origin down, origin blocking Cloudflare IPs, origin misconfig
- **Right response:** Fix the request / retry with backoff; for disguised blocks use a real-browser unblock
### What a 521 Web Server Is Down means
Cloudflare tried to open a connection to the origin server - the real machine hosting the site - and failed. Cloudflare itself is up; the server behind it refused or dropped the connection. You'll see Cloudflare's branded 521 page with a ray ID (a reference code for the failed request). In short, a 521 means Cloudflare couldn't open a connection to the origin server.
### Why scrapers see 521
A 521 is an origin-side problem: the site's own server is down, is blocking the IP ranges Cloudflare connects from, or is misconfigured. For a scraper this is usually temporary and not a personal block aimed at you - though it can sometimes show up alongside aggressive edge rules (extra filtering Cloudflare applies at its network edge before traffic reaches the origin).
### How to fix a 521 error
Retry with backoff - wait a bit and try again, increasing the gap each time. A 521 usually clears once the origin recovers, and there's nothing in your headers or IP to change, because the failure happens between Cloudflare and the origin, not in your request. If you only get 521 through certain proxies, test the site directly (no proxy) to rule out a proxy problem. When you need reliable access to Cloudflare-fronted sites despite these intermittent edge errors, Web Access API and WAF handling handle the retries and challenges for you.
### FAQ
**Q: Is 521 a client or server error?**
It's a server-side error (the 5xx family). The problem is between servers - Cloudflare and the origin - not in your request.
**Q: Does a 521 mean I'm blocked when scraping?**
Not usually. A 521 is typically temporary and server-side, though it can occasionally coincide with edge blocking.
**Q: How do I fix a 521 error?**
The cause is that Cloudflare couldn't open a connection to the origin server, so the fix targets that: retry with backoff (waiting longer between attempts) and use reliable infrastructure.
**Q: What's the difference between Cloudflare 521 and 520?**
A 520 is a generic 'unknown error' - the origin replied, but with something Cloudflare can't make sense of. A 521 is more specific: Cloudflare couldn't connect to the origin at all because it's down or refusing connections.
---
## What Is Cloudflare Error 1020 (Access Denied)?
URL: https://scrappey.com/qa/http-errors/what-is-cloudflare-error-1020
**Cloudflare Error 1020 "Access Denied" means a Cloudflare firewall (WAF) rule on the site has blocked your request outright.** Unlike Error 1015, which is a temporary rate limit, 1020 is a deliberate *rule match*: the site owner (or a Cloudflare Managed Ruleset) decided that traffic with your characteristics shouldn't be served at all. It is Cloudflare's own error page, returned with an HTTP 403, not a status code from the origin server.
### Quick facts
- **Error:** 1020 (Cloudflare)
- **HTTP status:** 403 Forbidden
- **Meaning:** A WAF / firewall rule blocked the request
- **Common causes:** Bad IP reputation, datacenter IPs, bot-like headers/TLS, country or ASN rules
- **Not the same as:** 1015 (rate limit) — 1020 is a rule match, not "too fast"
### What triggers Cloudflare 1020
Error 1020 fires when a request matches a **firewall rule** — a WAF Custom Rule, a Managed Ruleset, or a "block" action in Bot Management. The usual triggers:
- **Poor IP reputation.** Datacenter, VPN, and recycled proxy IPs carry low trust scores; many sites block them on the first request.
- **Bot-like fingerprint.** Missing browser headers, a python-requests or curl User-Agent, or a TLS/JA3 handshake that doesn't match the claimed browser.
- **Geo / ASN rules.** The site blocks whole countries, hosting providers (AWS, GCP, OVH ranges), or specific ASNs.
- **Tripped a custom rule** — hitting a path, header, or query pattern the owner explicitly blacklisted.
Because it's a rule match, waiting does *not* clear a 1020 the way it clears a 1015. You have to stop matching the rule.
### How to fix Cloudflare 1020 when scraping
The block is on *who you look like*, so fix the signals in this order:
- **Use clean residential or mobile IPs.** Datacenter IPs are the number-one 1020 trigger. Rotating residential proxies with good reputation usually clear it.
- **Send a complete, consistent browser profile** — full headers *and* a matching TLS fingerprint. Spoofing the User-Agent alone fails because the JA3/JA4 handshake still says "script." See TLS fingerprinting.
- **Render like a real browser** when the rule checks for JavaScript execution — a full headless browser or a managed API that runs one.
Why a single fix rarely works
1020 rules usually combine signals (IP *and* fingerprint *and* behavior). Fixing one leaves the others matching. A managed scraping API that aligns IP reputation, headers, and TLS together handles this more consistently than fixing one signal at a time.
### 1020 vs 1015 vs a 403
Error 1015
- Temporary rate limit
- "You are being rate limited"
- Clears on its own after a cooldown
- Fix: slow down, rotate IPs
Error 1020
- Firewall / WAF rule match
- "Access denied"
- Does **not** clear by waiting
- Fix: change IP reputation + fingerprint
Plain 403
- From the origin server, not Cloudflare
- Permissions, auth, or origin WAF
- No Cloudflare ray-ID branding
- Fix: headers, cookies, auth
Confirm it's really 1020 by looking for the Cloudflare-branded page and a Ray ID. If there's no Cloudflare branding, treat it as an ordinary 403 Forbidden instead.
### FAQ
**Q: Why am I getting Cloudflare Error 1020?**
A firewall rule on the site matched your request. The usual causes are a low-reputation IP (datacenter, VPN, or recycled proxy), a bot-like fingerprint (missing browser headers or a TLS handshake that does not match your User-Agent), or a geo/ASN rule that blocks your network. It is a deliberate block, not a rate limit.
**Q: How is Error 1020 different from 1015?**
1015 is a temporary rate limit that clears on its own once you slow down. 1020 is a firewall rule match that says "access denied" and does not clear by waiting — you have to stop matching the rule by improving your IP reputation and browser fingerprint.
**Q: Does waiting fix a 1020 error?**
Usually no. Because 1020 is a rule match rather than a cooldown, the same request will keep being denied until what the rule keys on changes — most often the IP reputation and the request fingerprint.
**Q: Can rotating proxies alone fix 1020?**
Sometimes, if the rule is purely IP-based. But most 1020 rules combine IP reputation with fingerprint and behavior checks, so clean IPs plus a real browser profile (headers + matching TLS) are needed together. A managed scraping API aligns all three.
---# Proxies
Proxy types, rotation strategies, and the tradeoffs between residential, datacenter, and mobile IP pools.
## What Is a Residential Proxy?
URL: https://scrappey.com/qa/proxies/what-is-a-residential-proxy
**A residential proxy sends your web traffic through a real home internet connection — a regular broadband or fiber line — instead of through a datacenter.** So the IP address the target website sees belongs to an actual ISP customer in a real neighborhood, which makes your request look like it came from an ordinary person browsing at home. Scrapers use residential proxies to reach sites that block or slow down traffic from datacenter IPs.
### Quick facts
- **Also known as:** Resi proxies, peer proxies, ISP proxies (variant)
- **IP source:** Real residential ISP customers (opt-in P2P, paid panels)
- **Typical pricing:** $3–$15 per GB of traffic
- **vs. datacenter:** 10–50x more expensive, 10–50x harder to detect
- **Common providers:** Bright Data, Oxylabs, Smartproxy, IPRoyal
### How residential proxies work
These proxy networks are made up of real devices on real home connections. The home users usually join in one of two ways: through an opt-in SDK (a small piece of code) bundled into a free app — they agree to share some bandwidth so the app stays free — or through a paid program that pays them to share. When your scraper makes a request, it goes to the provider's gateway (their entry point). The gateway picks an available home device that matches your location and rotation settings, and that device forwards the request to the target site over its own home connection. The site then sees a home IP — no datacenter ASN (the network ID that marks an address as belonging to a hosting company), no known proxy range, nothing obvious to flag. The trade-off: each request takes an extra hop through a possibly-slow home line, so it's slower — but the IP looks far more trustworthy.
### When residential proxies are the right tool
Reach for residential proxies when the target site blocks or rate-limits datacenter IPs, when you need to appear in a specific country or city, or when you need to confirm that a real visitor sees the same thing your scraper does. The most common use cases are e-commerce, travel, social media, sneaker drops, ad verification, and SEO research. Skip them when plain datacenter IPs already work — residential proxies cost 10–50x more per GB, and you pay by bandwidth used, not by number of requests. That means an image-heavy page can cost a lot more than a simple text-only API call.
### Ethical and legal concerns
Where the home IPs come from has long been a sore point. Some networks grew their pools through SDKs bundled into apps where users never really understood they were sharing their connection. Recent FTC actions and changes to app-store policies have pushed the industry toward clearer, genuine opt-in. Reputable providers now explain how they recruit the people sharing their connections (their "peers") and let those people opt out. As a buyer, favor providers that document this consent and have a solid track record. And remember that the IP doing your scraping belongs to a real person whose connection sits in the middle of every request — so keep your request rates polite.
### FAQ
**Q: How are residential proxies different from datacenter proxies?**
Datacenter proxies come from cloud or hosting companies. They're fast and cheap, but easy to detect because their IP ranges are publicly known to belong to data centers. Residential proxies come from real home connections — slower and more expensive, but they carry the trust of normal consumer traffic, so sites are far less likely to flag them.
**Q: Are residential proxies legal?**
Using them is legal in most places. The legal gray areas are about two things: how the proxy network got its IPs (did the home users actually consent?) and what you do with them. Scraping publicly available data is generally fine; bypassing a login or other authentication is not.
**Q: How do I pick a residential proxy provider?**
Look at pool size (millions of IPs is the baseline), geo coverage for the countries you care about, and the real success rate on your specific target sites — run a trial to check. Also compare the pricing model (charged per-GB vs. per-request) and how clearly they explain where their IPs come from and how they get user consent.
**Q: Do residential proxies guarantee I won't be blocked?**
No. A residential IP makes you look more trustworthy, but it doesn't change your fingerprint, your request headers, or how your scraper behaves. Sites that fingerprint visitors will still catch a leaky Playwright scraper even when it runs on a residential IP.
---
## What Is a Rotating Proxy?
URL: https://scrappey.com/qa/proxies/what-is-a-rotating-proxy
**A rotating proxy is a proxy service that automatically gives each request — or each new session — a different outbound IP address, picked from a pool of many IPs.** A proxy is a relay that sits between you and the website, so the site sees the proxy's IP instead of yours. With rotation, instead of all your traffic coming from one IP, the target site sees requests arriving from many different IPs — the same pattern you'd get from a crowd of real users. This is the standard way scrapers get around per-IP rate limits (caps on how many requests one IP can make) and IP-based bot blocks.
### Quick facts
- **Also known as:** Backconnect proxies, IP rotation, proxy gateway
- **Rotation modes:** Per-request, per-session (sticky), time-based
- **Pool types:** Residential, datacenter, mobile, ISP
- **Primary benefit:** Distributes request load across many IPs, defeats per-IP limits
### How rotating proxies work
You point all your traffic at one gateway hostname from your proxy provider — something like `gw.example.com:8000` — and behind the scenes that gateway picks an outbound IP from its pool for each connection. There are two modes. **Per-request rotation** hands you a fresh IP for every single HTTP call; it's the best fit for stateless scraping, meaning each page stands alone and nothing carries over between requests. **Sticky sessions** keep the same IP for a set window (commonly 1, 10, or 30 minutes) so you can finish a multi-step flow on one IP — logging in, or walking from a search-results page to a detail page. Most providers offer both, and you choose by either connecting to a specific port or adding a username token like `user-session-abc123` that pins your session to one outbound IP.
### Why scrapers rotate
Per-IP rate limits are the cheapest and most common defense a site can put up — a few rules in Cloudflare or in nginx (a popular web server), and any single IP making too many requests gets cut off. Rotation spreads your traffic across many IPs so none of them crosses that threshold. It also helps with IP-reputation blocks (sites that ban an IP once it looks suspicious): if one IP gets flagged, your next request comes from a clean one and you keep going. Rotation is not a cure-all — it does nothing against fingerprint-based detection (sites profiling your browser/TLS signature), behavioral tracking across sessions, or account-level rate limits — but for the large class of sites that block on IP alone, it's the single most effective lever.
### How to rotate well
Three rules. **First, match the rotation mode to the workflow:** per-request for stateless catalog crawling, sticky for login flows and multi-page sequences. Switching IPs mid-flow looks suspicious — a logged-in user's IP doesn't change on every click. **Second, rotate within one geography:** a session that starts in Germany and ends in Brazil is an obvious tell. Most providers let you limit the pool to a country, region, or city. **Third, size the pool to the workload.** If you send 10,000 requests per minute through a 100-IP pool, each IP still averages 100 requests per minute — plenty to trip a rate limit. As a rule of thumb, your pool size should comfortably exceed your peak request rate divided by the target site's per-IP limit.
### FAQ
**Q: Per-request vs. sticky rotation — which one?**
Use per-request for stateless scraping, where every URL is independent and nothing needs to carry over. Use sticky for any workflow that relies on cookies, logins, or step-by-step navigation — those need the same IP across the whole sequence to look like one natural user.
**Q: How big should my IP pool be?**
Big enough that each IP's share of the work stays under the target's per-IP rate limit. For example, if the target allows 60 requests per minute per IP and you need 1,000 per minute, you want at least about 20 IPs carrying traffic at once, plus extras so IPs can rotate and cool off.
**Q: Are rotating proxies always residential?**
No. Datacenter, mobile, and ISP proxy pools can all be rotated. Rotation is a feature of the gateway, not of the IP type underneath it. Residential rotation costs more but holds up better against detection.
**Q: Will rotating proxies fix CAPTCHAs?**
Sometimes. If the CAPTCHA appeared because of a per-IP signal, rotating to a fresh IP clears it. But if the CAPTCHA is driven by your browser fingerprint or behavior, rotation alone won't help — you also need to vary your browser fingerprint and request pattern.
---
## What Is Proxy Web Scraping?
URL: https://scrappey.com/qa/proxies/what-is-proxy-web-scraping
**Proxy web scraping means sending your scraper's traffic through proxy servers — middleman machines that forward your requests for you — so the target website sees the proxy's IP address instead of yours.** Think of a proxy as a stand-in that knocks on the door so your real identity stays hidden. Proxies are the foundational tool for any scraper working at scale: they let you rotate IPs, target specific countries, and spread one job across many identities, so you can make millions of requests without hitting the per-IP rate limits (caps on how many requests one address may send).
### Quick facts
- **Also known as:** Proxy scraping, IP-rotated scraping
- **Proxy types:** Datacenter, residential, mobile, ISP
- **Connection types:** HTTP/HTTPS, SOCKS5
- **Primary benefit:** Distributes requests across IPs to work within per-IP rate limits
### Why proxies are mandatory for serious scraping
A scraper without proxies has just one IP address. Send 100 requests in a minute and the target's rate limiter notices. Do it again and that IP gets flagged. Do it a third time and the IP lands on a permanent block list — and since it's your home or office connection, your normal browsing now suffers too. Proxies solve all of this. Your real IP never reaches the target, requests spread across hundreds or thousands of IPs, and a single flagged IP is just one to drop from the pool. At any serious volume, proxies aren't optional — they're how scraping is built.
### Choosing the right proxy type
There are four common types, trading off cost, speed, and how easy they are to detect:
- **Datacenter proxies** ($0.50–$2/GB) are cheapest and fastest, but easiest to spot — their IPs belong to cloud providers like AWS, OVH, and DigitalOcean, and anti-bot vendors have those network ranges (ASNs — the ID number that groups a provider's IPs) memorized. Fine for unprotected sites and APIs that don't care.
- **Residential proxies** ($3–$15/GB) route through real home internet connections. Slower and pricier, but they carry the trust of an ordinary consumer. Use them on sites that block datacenter IPs.
- **Mobile proxies** (4G/5G, $10–$50/GB) are the most expensive and hardest to block. Mobile carriers route thousands of users behind one shared IP (NAT — many devices sharing a single public address), so blocking that IP would also block real customers.
- **ISP proxies** give datacenter speed with residential-style IPs — handy for high-throughput work against medium-difficulty sites.
### How to integrate proxies into a scraper
Most providers give you a single gateway address with a username and password. In Python's requests library: proxies={'http': 'http://user:pass@gw:port', 'https': 'http://user:pass@gw:port'}. In Playwright: pass a proxy option to launch() or per context. To keep the same outbound IP for a whole session (a "sticky session"), you usually encode it in the username — user-session-abc123:pass holds you on one IP until the session ends. For production, wrap all this in a retry layer that detects blocks, retires bad IPs, and reports which IPs succeed against which targets. Without that visibility, you're paying for a pool you can't tune.
### FAQ
**Q: Free proxies vs. paid proxies — is the difference real?**
Yes, enormously. Free proxy lists are mostly hacked servers, traps, or IPs that are already blocked. They're slow, unreliable, and a security risk: whoever runs them can perform a MITM attack (man-in-the-middle — secretly reading or altering your traffic as it passes through). For anything beyond casual experimentation, paid proxies are the only sensible choice.
**Q: Do I need proxies for small scraping jobs?**
Usually not. If you're pulling 100 pages from a friendly site, your home IP is fine. Proxies become necessary once you hit protected sites, exceed a few hundred requests per hour, or need results from a specific country.
**Q: Can my proxy provider see my scraped data?**
For HTTPS sites (the encrypted version of HTTP), no. TLS — the encryption layer behind https — scrambles the request and response between your client and the target, so the proxy sees only the destination hostname and an unreadable payload. For plain HTTP sites (rare today), the proxy can see everything, so don't send sensitive data over HTTP regardless of the proxy.
**Q: How many proxies do I need?**
Enough that your peak request rate divided by the pool size stays under the target's per-IP rate limit. For most production scrapers, a pool of 1,000+ rotating residential IPs is the minimum; for high-volume work, tens of thousands.
---
## What Is a Mobile Proxy?
URL: https://scrappey.com/qa/proxies/what-is-a-mobile-proxy
**A mobile proxy sends your scraper's requests out through real 4G or 5G mobile-carrier IP addresses** — networks like T-Mobile, Vodafone, O2, and AT&T. Mobile carriers use carrier-grade NAT, meaning many real customers share the same public IP at the same time. Because of that sharing, a site can't safely flag one person behind a mobile IP without risking everyone else on it. So anti-bot systems give mobile IPs the highest trust score of any proxy type.
### Quick facts
- **Source:** Real 4G/5G SIM cards on mobile carrier networks
- **Trust score:** Highest of any proxy type — carrier NAT shields individuals
- **Typical cost:** ~$10–15/GB (premium vs ~$3–10/GB for residential)
- **Best for:** Hard DataDome and PerimeterX targets, sneaker drops, social
- **Common providers:** Bright Data Mobile, Smartproxy Mobile, IPRoyal, Soax
### Why mobile IPs have the highest trust
Mobile carriers use carrier-grade NAT (CGNAT) — a setup where many subscribers share one public IP at the same time. So if a site blocks a single mobile IP, it might also be blocking dozens or hundreds of real customers sitting behind that same shared address. For any consumer-facing product, that collateral damage is unacceptable. This is why anti-bot systems like DataDome and PerimeterX rate mobile IP reputation very highly and almost never block them outright. The same scraper fingerprint often gets a clean 200 OK on a mobile IP but a 403 (blocked) on a residential one.
### How mobile proxy providers source IPs
The usual approach is surprisingly physical: providers run racks of real Android phones, each with its own SIM card on a genuine carrier contract. Every phone is a node. Your request hits the provider's API, gets routed to one of these phones, goes out over that phone's mobile data connection, and the response comes back to you. Many providers let you choose "rotating mobile" (a different phone, and so a different IP, for each request) or "sticky mobile" (the same phone for a set period of time).
All that hardware is why mobile is pricey. Each node is a physical SIM with its own data plan. The 3–5× premium over residential proxies pays for the phones, the carrier contracts, and the work of keeping it all running.
### When mobile is worth the cost
Reach for mobile when residential proxies are almost good enough but keep failing:
- **Hard DataDome customers** (high-value e-commerce, ticketing) — here the IP's reputation dominates the trust score, and a mobile IP tips it in your favor.
- **PerimeterX-protected sneaker / streetwear sites** — these judge your fingerprint harshly, and a mobile IP effectively resets that judgment.
- **Social platform scraping** (mobile-first social apps) — these apps are used mostly on phones, so a mobile IP looks like exactly the kind of visitor they expect.
- **Account creation workflows** where datacenter and even residential IPs trigger a "verify your phone number" gate.
Skip mobile for: unprotected public APIs (you'd be burning budget for no benefit) and Akamai-protected sites that score you across many requests — rotating mobile IPs break the _abck cookie's gradual trust-building, so use a static ISP proxy instead.
### Example
```python
from curl_cffi import requests
# Hard DataDome / PerimeterX targets: mobile proxy + Chrome TLS
r = requests.get(
"https://hard-target.com/api/listings",
impersonate="chrome131",
proxies={
"https": "http://user:[email protected]:port"
},
timeout=30,
)
print(r.status_code, r.headers.get("x-datadome"))
```
### FAQ
**Q: Why are mobile proxies more expensive than residential?**
Every mobile IP comes from a real phone with a real SIM card on a real carrier data plan. The provider has to run physical hardware (racks of Android phones), pay the carrier contracts, and eat the cost of replacing SIMs when they get throttled or banned. Residential proxies are cheaper because they come from peer-to-peer networks where ordinary users opt in to share their connection for compensation — so each extra IP costs the provider far less.
**Q: Do mobile proxies rotate automatically?**
Most providers offer both styles. Rotating gives you a fresh phone (and IP) on every request. Sticky keeps you on the same phone for a window you choose, usually 10–60 minutes. Sticky is a must for anything session-based — login, cart, checkout — because rotating IPs mid-session would invalidate your cookies and break the flow.
**Q: Can mobile proxies be detected?**
A site can tell an IP is mobile by looking up its ASN (the network registry record that says which carrier owns the IP), but knowing it's mobile is not the same as flagging it. The anti-bot sees something like "T-Mobile US mobile network" and treats it as trustworthy, not suspicious. It only becomes a problem if one mobile IP starts sending obviously bot-like traffic, in which case that specific IP gets a temporary trust penalty.
**Q: Are 4G and 5G proxies different?**
For scraping, they work the same — to the destination site both are just mobile carrier IPs. 5G coverage is still patchy in many areas, so most "mobile proxy" pools today are mostly 4G. The site you're hitting can't tell which one you used, because either way it just sees a carrier IP.
---
## What Is an ISP Proxy?
URL: https://scrappey.com/qa/proxies/what-is-an-isp-proxy
**An ISP proxy (also called a \"static residential\" proxy) is a fixed IP address that physically sits in a datacenter but is registered to a consumer internet provider.** Every IP belongs to an autonomous system number (ASN) — the ID for the network that owns it. A website checking your IP sees a residential provider like \"Comcast\" or \"Deutsche Telekom\", which carries the high trust those home networks get, but the IP itself stays the same over time because it runs on stable datacenter hardware. That mix is especially useful on sites that judge you across a whole session rather than a single request — multi-request trust scoring most of all.
### Quick facts
- **Trust score:** High — uses residential ASN reputation
- **Stability:** Static — same IP for the duration of your lease
- **Typical cost:** ~$1.50–5 per IP per month (vs ~$3–10/GB residential)
- **Best for:** Multi-request scoring, account-tied work, continuous monitoring
- **Common providers:** Bright Data ISP, Oxylabs ISP, IPRoyal, Massive ISP
### How ISP proxies work
A proxy provider partners with a consumer ISP (or gets a sub-allocation of addresses from one) so it can run a block of IPs registered under that ISP's ASN. The IPs physically live in the provider's datacenter, but the public records that say who owns them — BGP announcements (how networks tell the internet which IPs they route) and WHOIS (the public registry of IP ownership) — list the consumer ISP. So when an anti-bot system looks up the ASN, the IP appears to be an ordinary Comcast or BT home connection, not a datacenter IP.
Because the IP never changes and the provider owns the hardware, there is no rotation by default. You lease one specific IP for a fixed term (monthly or yearly), and for that period it is yours alone or shared by a small group.
### Why multi-request scoring rewards ISP static
Multi-request bot managers score you across many requests, not just one: trust builds up as the same client keeps making consistent, successful requests. Its _abck cookie starts as untrusted (~-1~) and flips to trusted (~0~) after the first successful sensor.js POST — the data the sensor script collects about your browser — then stays trusted as long as the session looks consistent. If you swap to a different residential IP partway through, that built-up trust resets, because the system now sees a brand-new, unproven client. A static ISP IP keeps the same identity for hours or days, which is exactly how a real person browses from home.
For scraping that depends on being logged in (account workflows, a persistent shopping cart), ISP static is the only sensible choice, since the session cookies have to stay attached to one stable identity.
### When to use ISP vs residential vs mobile
**ISP static:** long sessions, multi-request trust scoring, account-tied scraping — anything where trust needs to build up across multiple requests.
**Rotating residential:** lots of independent requests where each one stands alone as its own session — search scraping, SERP collection (search-results pages), listing snapshots.
**Mobile:** the toughest anti-bot targets, where the IP's reputation by itself decides whether you get through.
On cost: ISP at roughly $2 per IP per month, supporting hundreds of long-lived sessions, is far cheaper than the per-gigabyte pricing of rotating residential for the same job.
### Example
```python
from curl_cffi import requests
# Anti-bot-protected site: ISP static IP + Chrome TLS + warm-up
s = requests.Session(impersonate="chrome131")
isp_proxy = {"https": "http://user:[email protected]:port"}
# 1. Warm up on the homepage so _abck flips to ~0~
s.get("https://akamai-target.com/", proxies=isp_proxy)
# 2. Same session, same IP — trust accumulated, _abck trusted
for page in range(1, 11):
r = s.get(f"https://akamai-target.com/listings?page={page}",
proxies=isp_proxy)
print(page, r.status_code)
```
### FAQ
**Q: How is an ISP proxy different from a residential proxy?**
A residential proxy borrows real consumer devices in a peer-to-peer network — your request actually leaves through someone's home router. An ISP proxy is datacenter hardware that is simply registered under a residential ASN. Both look "residential" to the sites you visit; ISP is faster and more stable, while residential gives you a wider, more varied pool of IPs.
**Q: Is an ISP proxy the same as a datacenter proxy?**
The hardware is the same, but the ASN reputation is completely different. A plain datacenter proxy is registered under AWS, GCP, DigitalOcean, and similar — well-known datacenter ASNs that anti-bots flag instantly. An ISP proxy is registered under Comcast, BT, Deutsche Telekom, and the like — residential ASNs that anti-bots trust. Same machine; what matters is whose name is on the BGP announcement.
**Q: Can I rotate ISP proxies?**
You can rotate across a pool of ISP IPs that you control, but you should not switch IPs in the middle of a session on vendors that score across requests. The whole point of ISP static is staying stable — rotating throws that away. If you actually need rotation, use rotating residential or mobile instead.
**Q: How many ISP IPs do I need?**
For account-tied scraping: one IP per account. For session-based crawling: enough that your parallel sessions never have to share an IP at the same time. For most production setups, 50–200 ISP IPs in a region cover the common workloads.
---
## What Is a Datacenter Proxy?
URL: https://scrappey.com/qa/proxies/what-is-a-datacenter-proxy
**A datacenter proxy is an IP address that lives inside a commercial cloud or hosting company — AWS, GCP, Azure, DigitalOcean, OVH, Hetzner.** Instead of routing your request through someone's home internet, it routes through a server in a data center. The upside: it is cheap (~$0.50–$1.50/GB), fast (sub-100ms response from major regions), and comes with unlimited bandwidth. The downside: it is the least trusted proxy type. Every major anti-bot (the software a site uses to spot and block automated traffic) keeps a blocklist of known datacenter network ranges — identified by their ASN, the ID number assigned to a block of IPs owned by one operator — and flags this traffic before the request ever reaches the site. Use them for unprotected public APIs and academic targets. For anything guarded by a real anti-bot, the cost savings vanish because your success rate collapses.
### Quick facts
- **Trust score:** Very low — flagged on sight by major anti-bot vendors
- **Typical cost:** ~$0.50–$1.50/GB — cheapest proxy type
- **Speed:** Highest — direct datacenter routing, ~50–100ms
- **Detection mechanism:** ASN-based blocklists (AWS, GCP, Azure, DO, OVH, Hetzner all known)
- **Best for:** Unprotected APIs, academic targets, large static-HTML sites
### How datacenter IPs are detected
Three checks fire before the page even loads:
- **ASN lookup at the CDN edge.** An ASN is the ID for a block of IPs owned by one operator; the CDN edge is the server that handles your request first, before it reaches the real site. Major anti-bot vendors keep comprehensive lists of hosting-provider ASNs (AS16509 AWS, AS15169 GCP, AS8075 Azure, AS14061 DigitalOcean, plus OVH, Hetzner, Linode, Vultr, etc.). AWS WAF specifically maintains the HostingProviderIPList with ASN-based inclusion. Match → blocked.
- **Published cloud subnets.** A subnet is just a chunk of IP addresses. AWS, GCP, and Azure publish their public IP ranges as official JSON feeds. Anti-bot vendors pull these in directly and update their blocklists in near real time.
- **Reverse-DNS pattern matching.** Reverse DNS turns an IP back into a hostname. Many datacenter IPs answer with names like ec2-54-83-... or compute-1.amazonaws.com. Even if you slip past the ASN check, that hostname gives the IP away.
One widely-cited figure: roughly **99% of traffic from known datacenter ranges is bot traffic**. So a site can block on ASN knowing that the false-positive rate (a real user who happens to come from AWS) is effectively zero.
### When datacenter proxies are actually fine
Datacenter is the right tool when:
- **The target has no anti-bot at all.** Public APIs, government open data, academic sites, and large static-content sites that simply do not care about scraping.
- **The target is your own infrastructure.** Checking your own production endpoints from another region, geofence testing, or load testing — datacenter works because your own systems do not block it.
- **You can authenticate.** Once you hold an API token, the ASN check usually falls away because authenticated requests are trusted differently. Datacenter is fine for authenticated API integrations.
- **You are testing.** Burning cheap IPs to size up a target before you commit to a residential budget.
Anywhere else — behind any real anti-bot system — datacenter gets blocked almost instantly, no matter how perfect your TLS fingerprint (TLS is the encryption layer behind https, and its handshake leaves a recognisable signature) is.
### The proxy ladder by trust
Proxy types form a ladder from cheapest/least trusted up to most trusted/most expensive:
- **Datacenter** — ~$0.50–$1.50/GB. Unprotected targets only.
- **ISP / static residential** — ~$1.50–5/IP/month. Datacenter hardware that is announced to the internet under residential ASNs, so it looks like a home connection. Multi-request trust scoring rewards them.
- **Residential** — ~$3–10/GB. Peer-to-peer networks of real consumer devices. The default choice for general anti-bot work.
- **Mobile / 4G–5G** — ~$10–15/GB. Real carrier IPs sitting behind carrier-grade NAT (many phones share one IP, so blocking it would hit innocent users), which earns the highest trust score. For the hardest anti-bot targets.
Rule of thumb: match proxy cost to target difficulty. Spending mobile-tier money on an unprotected academic dataset is waste; using datacenter on a protected retail site is also waste — every request fails.
### Example
```python
# Datacenter is fine when there's no anti-bot. Confirm first.
from curl_cffi import requests
# 1. Cheap sanity check: does the target even care about my IP?
r1 = requests.get(
"https://target.com/api/public",
impersonate="chrome131",
proxies={"https": "http://user:pass@datacenter-proxy:port"},
)
if r1.status_code == 200 and "challenge" not in r1.text.lower():
# No anti-bot detected, datacenter is fine, save money
print("Datacenter OK, proceeding")
else:
# Anti-bot detected — escalate to residential or ISP
r2 = requests.get(
"https://target.com/api/public",
impersonate="chrome131",
proxies={"https": "http://user:pass@residential-proxy:port"},
)
print("Escalated to residential:", r2.status_code)
```
### FAQ
**Q: Why does AWS get flagged so hard?**
Because running a bot on AWS costs almost nothing extra — scrapers, vulnerability scanners, and abusers all live there. Anti-bots see steady bot traffic from AWS ranges and assume any new request from that range belongs to the same crowd. AWS WAF's own HostingProviderIPList includes all known hosting-provider ASNs on exactly this basis.
**Q: Can I get unflagged datacenter IPs?**
Some providers sell "premium" or "private" datacenter IPs that sit outside the well-known ASN ranges. They work for a while, but eventually the range gets identified and added to blocklists. By then the cost-per-success ends up similar to residential, so the value is weak.
**Q: What about ISP proxies — aren't those datacenter too?**
Physically, yes — the same server hardware. The difference is how the IP is announced to the internet. ISP proxies are announced under consumer ISP ASNs (Comcast, BT, Deutsche Telekom) via BGP, the routing system that tells the world who owns which IPs. So anti-bots see "Comcast residential" instead of "AWS datacenter". Same hardware, different network announcement — and that is the whole point of the ISP proxy product.
**Q: When does a datacenter IP get noticed even on an unprotected site?**
At very high volume. A single IP firing 500+ requests per minute is trivially caught by any rate-limit logic, no matter how trusted it otherwise is. The advantage of higher-tier proxies is not just trust but the larger pool of IPs, which lets you spread requests out and stay under any single IP's rate limit naturally.
---
## What Is a DNS Leak?
URL: https://scrappey.com/qa/proxies/what-is-a-dns-leak
**A DNS leak is when your computer looks up website names through its own DNS resolver instead of through the proxy, which exposes the real network hiding behind that proxy.** DNS (Domain Name System) is the phonebook that turns a name like example.com into an IP address. Even if your proxy is correctly carrying your HTTP traffic, if that name lookup goes out over your real connection instead, your ISP and the DNS server can see exactly which sites you visit - and the resolver's location can give away your true region. The classic cause is using socks5:// (the client does the lookup itself) when you meant socks5h:// (the lookup happens inside the tunnel).
### Quick facts
- **What leaks:** The hostname lookup, sent to your real DNS resolver outside the proxy
- **socks5 vs socks5h:** socks5 resolves DNS client-side (leaks); socks5h resolves at the proxy (safe)
- **Why it matters:** Reveals target hosts and a resolver geolocation that can mismatch the exit IP
- **Also affects:** WebRTC and QUIC paths can do their own lookups outside the proxy
- **Fix:** Force tunnel-side resolution, or run a controlled resolver bound to the proxy
### How the leak happens
When a scraper requests https://example.com through a proxy, two things need to travel through the tunnel: the TCP/TLS connection (the encrypted link behind https) and the DNS lookup that turns example.com into an IP. With an HTTP proxy or a socks5h:// proxy, the client hands the *hostname* to the proxy and lets the proxy resolve it - so nothing leaks. With a plain socks5:// proxy, many clients resolve the hostname **locally first**, then ask the proxy to connect to the resulting IP. That local lookup goes to your machine's own DNS resolver, over your real connection.
The same problem shows up with system-level proxying that doesn't cover UDP, with split-tunnel VPN configs (where some traffic skips the tunnel), and with libraries whose DNS path is separate from their HTTP path. The HTTP request looks perfectly proxied while the DNS query quietly slips out the real network interface.
### Why a DNS leak deanonymizes a scraper
There are two distinct harms. First, **disclosure**: whoever runs your DNS resolver (your ISP, a public resolver, or a corporate network) now has a log of every hostname you scraped, even though the page content itself went through the proxy. If you were relying on the proxy to keep those apart, that defeats the purpose.
Second, and more relevant to anti-bot detection, is **geo incoherence**: the resolver that did the lookup has its own geographic location. If your proxy exit is in Brazil but your DNS resolver is a German ISP, anyone correlating where the lookup came from with the connection can spot the mismatch. This stacks on top of the timezone/IP mismatch family of signals: the story your network tells stops being internally consistent.
### Closing the leak
The fixes, in order of preference:
- **Use socks5h:// not socks5://** - the h forces the hostname to be resolved at the proxy. This one-character change fixes the most common leak in curl, Python requests/httpx, and most scraping stacks.
- **Use an HTTP/HTTPS proxy** - HTTP proxies always receive the hostname (in the CONNECT line), so resolution happens proxy-side by design.
- **Run a controlled local resolver** bound to the tunnel, so even client-side lookups go through the proxy. Some anti-detect browsers ship a built-in resolver for exactly this.
- **Contain UDP** - QUIC/HTTP3 and WebRTC can do their own out-of-band lookups; disable them or tunnel UDP (SOCKS5 UDP ASSOCIATE) so nothing escapes (see WebRTC leaks).
Verify with a DNS-leak test that reports which resolver answered. If the resolver's country matches your proxy exit, the tunnel is clean; if it matches your real ISP, you are leaking.
### Example
```bash
# The one-character fix: socks5h instead of socks5
# LEAKS - curl resolves example.com locally, then proxies the IP
curl --proxy socks5://user:pass@proxy:1080 https://example.com
# SAFE - the 'h' forces resolution inside the proxy tunnel
curl --proxy socks5h://user:pass@proxy:1080 https://example.com
# Python requests / httpx - same rule
# proxies={'https': 'socks5h://user:pass@proxy:1080'} # safe
# proxies={'https': 'socks5://user:pass@proxy:1080'} # leaks DNS
# Verify which resolver answered (country should match the proxy exit):
curl --proxy socks5h://user:pass@proxy:1080 https://dnsleaktest.example/api
```
### FAQ
**Q: What is the difference between socks5 and socks5h?**
With socks5 the client resolves the hostname to an IP locally and asks the proxy to connect to that IP - so the DNS lookup leaks to your real resolver. With socks5h the client sends the hostname to the proxy and the proxy resolves it, so nothing leaks. For scraping behind a proxy, almost always use socks5h.
**Q: Does an HTTP proxy leak DNS like socks5 does?**
No. HTTP and HTTPS proxies receive the target hostname directly (in the request line or the CONNECT request), so the proxy does the resolution by design. The leak is specific to SOCKS5 clients that resolve locally, plus side channels like WebRTC and QUIC.
**Q: Can a DNS leak get my scraper blocked, or just expose me?**
Both. The immediate harm is disclosure - your resolver sees the hostnames. For anti-bot detection, the leak can create a geolocation mismatch between the DNS resolver and the proxy exit IP, which adds to the timezone/IP coherence signals that anti-bot systems already score.
---
## What Is IP Rotation?
URL: https://scrappey.com/qa/proxies/what-is-ip-rotation
**IP rotation is the practice of cycling outgoing requests through a pool of many IP addresses instead of sending them all from one.** Rather than 10,000 requests leaving from a single IP - an obvious pattern any site can rate-limit or block - rotation spreads them across hundreds or thousands of addresses so each one carries only a small, plausible share of the traffic. It is the core technique behind large-scale scraping, powered underneath by a rotating proxy pool.
### Quick facts
- **What rotates:** The source IP address of each request
- **Powered by:** A proxy pool (residential, datacenter, mobile, or ISP)
- **Rotation triggers:** Per request, per session, on block, or on a timer
- **Main benefit:** No single IP exceeds a per-IP rate limit or ban threshold
- **Pairs with:** Matching fingerprints, sticky sessions, geo-targeting
### How IP rotation works
Your scraper sends requests through a proxy service that owns or brokers a large pool of IP addresses. The proxy assigns a source IP to each outgoing request, then swaps it for a different one according to a policy. The two main policies are per-request rotation (every request exits from a fresh IP - good for many independent page fetches) and sticky sessions (a chosen IP is held for several minutes so a multi-step flow like a login or a paginated sequence stays on one address - because changing IP mid-session looks like a hijacked account). Some setups also rotate on demand: keep an IP until it starts returning errors or rate-limit responses, then retire it and pull the next one from the pool.
### Why IP rotation matters for web scraping
Almost every defense a site uses to slow scrapers is keyed to the IP address: rate limits, ban lists, and reputation scores all count per IP. Send everything from one address and you concentrate all of that risk in a single point - one rate limit caps your whole job, one ban kills it. Rotation spreads the load: across a thousand IPs, each one carries only a small share of requests, so a single blocked address costs you one IP instead of the whole run. The quality of the pool matters as much as the rotation itself - residential and mobile IPs tied to real consumer connections are far harder to flag than datacenter ranges, which sites can identify and block in bulk.
### Rotation is not enough on its own
Rotating IPs while keeping every other signal identical is a common trap. If a thousand requests arrive from a thousand different IPs but all carry the same TLS fingerprint, the same user agent, and the same header order, a detector simply clusters them by fingerprint and sees one bot wearing a thousand hats. Effective rotation pairs a fresh IP with a coherent identity - a fingerprint, headers, and timing that match the kind of device that IP claims to be - and respects session continuity where the site expects it. A managed scraping API handles this end to end: it rotates IPs, matches the fingerprint to each one, and holds sticky sessions where a flow needs them.
### Example
```python
import requests
# Manual rotation: a different proxy IP per request
proxy_pool = ['http://user:pass@p1:8000', 'http://user:pass@p2:8000']
for i, url in enumerate(urls):
proxy = proxy_pool[i % len(proxy_pool)]
r = requests.get(url, proxies={'http': proxy, 'https': proxy})
# Rotating the IP alone is not enough - the fingerprint must rotate too.
# A scraping API pairs each fresh IP with a matching browser identity.
```
### FAQ
**Q: What is the difference between IP rotation and a rotating proxy?**
IP rotation is the practice - the strategy of changing source IPs across requests. A rotating proxy is the infrastructure that implements it: a proxy endpoint backed by a pool that hands out a different IP automatically. You do IP rotation by using a rotating proxy.
**Q: How often should I rotate IPs?**
It depends on the task. For many independent page fetches, rotate every request. For multi-step flows (login, cart, pagination) use a sticky session that holds one IP for several minutes, because switching IP mid-session looks like account hijacking. Match the rotation cadence to how a real user would behave.
**Q: Is IP rotation enough on its own?**
It helps a lot but is not sufficient alone. If every rotated request shares the same browser fingerprint, user agent, and header order, detectors cluster them and block the pattern regardless of IP. Rotation must be paired with a matching, coherent fingerprint per IP to be effective.
**Q: What kind of IPs are best for rotation?**
Residential and mobile IPs, tied to real consumer ISPs and carriers, are hardest to flag because they look like ordinary visitors. Datacenter IPs are faster and cheaper but easier to identify and block in bulk. The right mix depends on how aggressively the target site filters traffic.
---# Anti-Bot
How modern bot-detection systems work — fingerprinting, behavioral signals, and the challenges that block automated traffic.
## What Is Cloudflare Turnstile?
URL: https://scrappey.com/qa/anti-bot/what-is-cloudflare-turnstile
**Cloudflare Turnstile is a service that checks whether a visitor is a real human, but without showing the kind of puzzle a normal CAPTCHA does.** Instead of asking you to click images or type warped letters, it quietly runs checks in your browser — looking at your browser's fingerprint (the unique mix of settings that identify it), watching how you behave on the page, making the browser do a small math puzzle (proof-of-work), and scoring all of it with machine learning. If the visitor passes, Turnstile hands out a token (a short pass that proves the check succeeded). Sites put Turnstile on forms and protected pages, so any scraper or bot has to produce a valid token to get through.
### Quick facts
- **Vendor:** Cloudflare
- **Replaces:** Cloudflare's older hCaptcha-based challenge
- **User experience:** Mostly invisible — a brief "Verifying" widget, no puzzle
- **Token TTL:** Usually 5 minutes
- **How it works:** Real-browser verification widget, scored server-side
### How Turnstile works
Turnstile loads a small piece of JavaScript (a widget) from `challenges.cloudflare.com`. That widget quietly runs a batch of tests. It reads browser APIs that reveal hardware and software details — canvas, WebGL, audio, and `navigator` properties (all common fingerprinting sources). It watches subtle human signals like mouse movement timing, focus events, and timing jitter (tiny natural variations in how events arrive). And it runs a small proof-of-work calculation in the background — a deliberate bit of busywork that a real browser can easily do but that costs bots at scale. The results go to Cloudflare, which scores the visitor and, if the score is high enough, returns a token. That token rides along as a hidden form field when the page is submitted, and Cloudflare double-checks it on the server through a separate API call. Bots either fail one check outright or score too low, so Turnstile keeps them stuck at the widget.
### Turnstile vs Cloudflare Bot Management — what's the difference
This is the most common point of confusion. Cloudflare ships **two separate bot-protection products** that people often mix up:
TurnstileBot Management
**What it is**A CAPTCHA replacement widgetAn ML-driven scoring system
**Where it fires**On specific forms / endpoints you chooseOn every request to your zone
**Tier**FreeEnterprise add-on
**Cookie evidence**cf_clearance after solve__cf_bm on every request
**Header evidence**Widget script from challenges.cloudflare.comcf-mitigated: challenge when blocked
**How verification works**The widget runs and is scoredThe underlying fingerprint is scored
The two short forms above: a **zone** is a domain Cloudflare protects, and a **token** is the pass Turnstile issues when verification succeeds. A site can run **both** products at once — Bot Management scores every request, and only when your score is borderline does it pop up a Turnstile widget as a light, low-friction challenge. So a Turnstile token alone is not sufficient if the underlying score is already low.
### Why Turnstile differs from old CAPTCHAs
Old CAPTCHAs were image puzzles — one self-contained task a human or a solving service could complete and be done. Turnstile is continuous: it scores the whole browser environment, not just one click. An automated client that produces a token from a Playwright instance with an inconsistent fingerprint will get a low score back, and the form rejects the token anyway. The challenge is also tied tightly to the exact page: a token solved on `challenges.cloudflare.com` won't work on `example.com`, because Cloudflare checks the sitekey (the site's unique Turnstile ID) and the origin (the domain it was solved on). So services that solve and resell tokens have to do it at the right scope, with the right fingerprint, inside the right session.
### How automated browsers interact with Turnstile
On sites you own or are permitted to access, automated tooling typically interacts with Turnstile in two parts. First, a real browser with consistent fingerprint signals — for example Chrome run with `--headless=new`, or a genuinely headful (real, visible-style) browser running under display virtualization so it behaves like a normal desktop. Second, the Turnstile widget runs inside that browser, is scored naturally, and produces a token. API-only approaches that never run a real browser produce low-score tokens that the server rejects. Integrated tools that bundle the browser and proxy into one service tend to be more consistent, because every layer is configured to work together.
### FAQ
**Q: Is Turnstile a CAPTCHA?**
Cloudflare calls it a CAPTCHA alternative. To a normal user there's no puzzle, just a brief verification. To a scraper it behaves like a CAPTCHA — you still need a valid token to get through, and producing one takes the same kind of solver setup a CAPTCHA would.
**Q: Does Turnstile work without JavaScript?**
No. If JavaScript is turned off, Turnstile can't run and the protected form won't submit. Plain HTTP scrapers (ones that just fetch HTML and don't execute JavaScript) can't pass Turnstile on their own — they need a JavaScript-capable client.
**Q: How long does a Turnstile token last?**
Usually 5 minutes from the moment it's issued. After that it expires and a fresh challenge has to be completed. Sites can set a shorter or longer window.
**Q: Why is my Turnstile token being rejected?**
Usually one of three reasons: it has expired, it was solved for a different sitekey or origin (so it doesn't match this page), or it scored too low for the site's threshold. A low score normally means the fingerprint that produced the token didn't look human enough.
**Q: I see __cf_bm on every response but no Turnstile widget — what does that mean?**
The site is running Cloudflare Bot Management (or the simpler Bot Fight Mode), which scores every request silently in the background. No Turnstile widget means your current score is good enough to let you through quietly. Hurt that score — rotate to a bad IP, change your User-Agent — and the widget or an outright block will start appearing.
**Q: Why does the cf_clearance cookie stop working after I rotate proxies?**
The cf_clearance cookie is tied to the exact IP address and User-Agent that solved the challenge. Change either one and the cookie is invalidated, so you get challenged again. Keep both stable for the whole session — it's the same rule as Akamai's _abck cookie.
---
## What Is Anti-Bot Detection?
URL: https://scrappey.com/qa/anti-bot/what-is-anti-bot-detection
**Anti-bot detection is the set of techniques websites use to tell automated traffic apart from real human visitors — and then block, challenge, or slow down the automated half.** Instead of relying on one clue, it stacks several together: IP reputation (whether an address has a history of abuse), browser fingerprinting (identifying details your browser leaks), TLS analysis (TLS is the encryption layer behind https), behavioral signals like how you move and click, and machine-learning scoring. The result is a risk score attached to every request. Cloudflare, DataDome, PerimeterX, Akamai, and Imperva are the dominant vendors, and most large sites use at least one of them.
### Quick facts
- **Common vendors:** Cloudflare, DataDome, PerimeterX (HUMAN), Akamai, Imperva
- **Signal categories:** IP, TLS, HTTP, browser fingerprint, behavior, history
- **Action taken:** Allow, challenge (CAPTCHA), throttle, block
- **Detection layer:** Four layers: network → JS → WASM → behavioural
### The four layers of anti-bot detection
Modern bot-protection products check every request against four separate layers. **Failing any one layer is usually enough to get blocked** — the layers act like a row of gates, not a points total. You have to clear all of them or you get nothing through.
LayerWhat's inspectedFires before…
**1. Network**TLS Client Hello (JA4), HTTP/2 SETTINGS frame, TCP options, IP reputation, ASNHTML is served
**2. JavaScript**Canvas / WebGL / AudioContext fingerprints, navigator properties, Function.toString() inspection, extension probesXHR / API calls fire
**3. WebAssembly**WASM SIMD CPU profile, SharedArrayBuffer timer precision, hyphenation dictionary checksChallenge token is issued
**4. Behavioural**Mouse movement Bezier curves, scroll cadence, keypress timing, click-to-event latencyScore is finalised over multiple requests
Each layer runs at a different moment. Layer 1 inspects the raw connection — including the TLS Client Hello, the first handshake message a browser sends, summarised as a JA4 fingerprint — before any HTML is even sent back. Layer 2 runs JavaScript in the page to probe the browser itself. Layer 3 leans on WebAssembly (compiled code that runs in the browser) for low-level CPU and timing checks. Layer 4 watches how you actually behave over several requests. So a scraper using curl_cffi (which only handles Layer 1) will pass against Layer 1-only vendors like older Imperva but fail against anything that loads sensor.js. A patched browser (Layers 1+2) will pass Akamai's static checks but fail DataDome's behavioural ML.
### The five-vector coherence test
On top of the four detection layers, vendors run a separate **identity-coherence check**. The idea is simple: a real visitor's details should all tell the same story. These five vectors must agree:
- **IP** — geolocation, ASN type (residential / datacenter / mobile)
- **Timezone** — Intl.DateTimeFormat().resolvedOptions().timeZone
- **Accept-Language** — HTTP header
- **WebRTC** — candidate IP exposed by STUN/TURN
- **DNS** — resolver used (matches ISP or VPN?)
Here is what coherent looks like: an IP in São Paulo, a timezone of America/Sao_Paulo, an Accept-Language: pt-BR, a WebRTC candidate that matches the proxy, and a Brazilian ISP DNS resolver — every signal points to the same person in the same place. Now the giveaway: a US datacenter IP with a Tokyo timezone, English Accept-Language, and a WebRTC leak that reveals the operator's real home IP. That mismatch is the most common scraping signature and is trivially blocked. **Proxy-rotation tools that change only the IP fail this test every time**, because they leave the other four vectors pointing elsewhere.
### What separates good detection from bad
Bad detection blocks on a User-Agent regex ("deny anything with `Bot` in the name") — trivially circumvented and it catches almost nothing real. Mediocre detection blocks datacenter IP ranges and JA3 hashes (a fingerprint of the TLS handshake) of known scraping libraries — this catches the lazy 80% of scrapers but misses anything running a real browser behind proxies. Good detection is the major commercial vendors. They pool signals from thousands of customer sites, update their models as new automation patterns appear, and correlate identities across requests — so passing one challenge is not permanent, because the same fingerprint is recognised on later requests. That is why detection accuracy is a continuously evolving field, not a fixed system that stays the same over time.
### What detection systems weigh most
Three factors dominate. First, identity consistency: detection compares the IP type (residential, mobile, or datacenter), the browser environment, and whether the fingerprint matches configurations that actually exist in the wild. Second, behavioural realism: detection scores request timing, session-level cookie continuity, and whether request gaps look human or mechanically even. Third, model freshness: when a site adopts a new detection vendor, its scoring changes, which is why detection accuracy is best understood as an evolving capability rather than a fixed rule set. For authorized data collection on sites you own or are permitted to access, managed scraping APIs absorb this ongoing change so you do not maintain the configuration yourself.
### FAQ
**Q: How do sites know I'm using a bot?**
Usually it's a combination of signals working together — a datacenter IP, headless-browser leaks, a TLS fingerprint that doesn't match a real browser, unrealistic request timing — rather than any single giveaway. Modern systems score the whole picture, not individual clues.
**Q: Is anti-bot detection ever perfectly accurate?**
No detection system is flawless, and none stays static. Vendors improve their models continuously, so the signals a given configuration produces are scored differently over time. Detection is best understood as an evolving capability rather than a fixed, solved system.
**Q: Which anti-bot vendor is the most aggressive?**
DataDome and PerimeterX (HUMAN) tend to score automated traffic most aggressively. Cloudflare is everywhere and improving fast. Akamai is strong on financial and travel sites. Which one a given site uses depends on the target and changes over time.
**Q: Does respecting robots.txt affect bot detection?**
Not directly — anti-bot vendors don't read robots.txt. Respecting it is good practice and lowers your legal risk, but the detection layer scores technical signals regardless of what robots.txt says.
**Q: Why is a request blocked even though my fingerprint passes a fingerprint-test site?**
Fingerprint-test sites usually only check Layer 2 (JavaScript). Your block happened at Layer 1 (TLS / IP) or in the five-vector coherence test — neither of which the test site inspects. A quick tell: if you get a 403 before any JavaScript runs, the failure is at Layer 1.
**Q: Do all vendors run all four layers?**
No. Imperva and AWS WAF default to Layer 1 plus a light Layer 2. Akamai, Cloudflare Bot Management, and PerimeterX run all four. DataDome leans heavily on Layer 4 behavioural ML. F5 Shape runs Layers 1 + 2 plus a custom VM that defies easy categorisation. The vendor cheatsheet entry maps which layers each one weights heaviest.
---
## What Is TLS Fingerprinting (JA3/JA4)?
URL: https://scrappey.com/qa/anti-bot/what-is-tls-fingerprinting
**TLS fingerprinting is a way to recognize what software made a connection just by looking at how it sets up encryption — before the server reads a single byte of your request.** TLS is the encryption layer behind https, and every connection starts with a "handshake" where the client announces which ciphers, extensions, elliptic curves, and ALPN values it supports. The exact list and order are turned into a short fixed ID (a hash): JA3 since 2017, JA4+ since 2023. Anti-bot vendors run this check at the CDN edge (the network layer in front of a site) and compare your hash to a catalogue of every popular HTTP library — so a scraper using Python requests can be flagged before its headers even arrive.
### Quick facts
- **Standard:** JA4+ (2023, FoxIO) — replaces JA3 (2017, Salesforce)
- **Computed at:** CDN edge (Cloudflare Rust crate, Akamai EdgeWorker)
- **Fields hashed:** TLS version, cipher suites, extensions, curves, ALPN
- **Flags clients like:** requests, httpx, urllib3, curl (default), Go net/http
- **Matched by:** curl_cffi, tls-client, real browsers, Camoufox
### How TLS fingerprinting works
The first message of a TLS handshake is the Client Hello. It lists, in a fixed order, the protocol version, the cipher suites the client supports, named curves, point formats, signature algorithms, and the TLS extensions it sends. The catch: each software library fills in this list differently, so the combination acts like a signature for the tool that built it. **Python requests looks nothing like Chrome; Go net/http looks nothing like Firefox; curl looks like none of them.** Anti-bot vendors hash these fields and compare the result to known browser baselines.
This hash is computed at the CDN edge, before any HTTP byte is parsed, before any HTML is served, before any JavaScript runs. That is why a perfect User-Agent spoof from requests still gets blocked — the TLS layer already gave the game away.
### JA3 vs JA4 — what changed in 2023
The first widely adopted fingerprint was **JA3** (Salesforce, 2017): an MD5 hash (a short fixed-length code) of SSLVersion,Cipher,Extensions,EllipticCurve,EllipticCurvePointFormat. It worked for years but had two problems — TLS 1.3 added GREASE values (random padding browsers insert on purpose), which made the hash change run to run, and a single MD5 hid which field had actually changed.
**JA4+** (FoxIO, 2023) is the replacement. Instead of one MD5 it produces a structured fingerprint with separate parts: the TLS Client Hello (JA4), the HTTP/2 client preface and SETTINGS frame (JA4H), the TLS server response (JA4S), and TCP-level timings (JA4T). Each part can be read on its own, ignores GREASE so it stays stable, and is human-readable rather than one opaque hash.
FingerprintWhat it coversStatus
JA3TLS Client Hello (MD5)Legacy — still seen on older WAFs
JA4TLS Client Hello, GREASE-stableCurrent standard at major vendors
JA4HHTTP/2 client preface + SETTINGS frame orderRequired to pass Akamai, Cloudflare
JA4SServer-side TLS response fingerprintUsed by vendors to detect MITM proxies
JA4TTCP options + window size + timingRare — DataDome experimental
If you are still targeting JA3-only on a 2026 site, your fingerprint can look correct in the testing tool tls.peet.ws yet still block in production. Match JA4 + JA4H together.
### HTTP/2 framing — the layer underneath TLS
Even if your TLS handshake matches Chrome exactly, the HTTP/2 connection that runs on top of it has its own fingerprint. When Chrome opens an HTTP/2 connection it sends a **specific SETTINGS frame** — the opening message that declares connection parameters (HEADER_TABLE_SIZE=65536, INITIAL_WINDOW_SIZE=6291456, MAX_HEADER_LIST_SIZE=262144) — followed by a **WINDOW_UPDATE frame** with a delta of 15663105. The frame order and exact values change with each Chrome major version.
Go's net/http2 and Python's httpx use different SETTINGS values and a different frame order. Vendors hash that SETTINGS frame and compare. curl_cffi ships with the Chrome-matching values baked in; tls-client and noble-tls need them set by hand. This is the second-most-common reason a request with a correct JA4 still gets a 403.
### QUIC and HTTP/3 — the next surface
Cloudflare and Google now serve a large share of traffic over QUIC, which is HTTP/3 running on UDP instead of TCP. QUIC has its own handshake, its own SETTINGS-frame equivalent, and therefore its own fingerprint surface — and almost no scraping library supports it yet. Most scrapers quietly fall back to HTTP/2 by sending an Alt-Svc-incompatible header set, and that fallback is itself a signal.
For now QUIC-aware fingerprinting rarely shows up in production blocks, but expect it to matter by late 2026, especially on Google properties and Cloudflare's premium tier. The fix is the same as everywhere else: pick a library that completes the handshake the way Chrome does, end to end.
### Why this matters for scraping
Say your scraper sends a User-Agent header claiming to be Chrome 131 on macOS, but its JA4 hash matches Python urllib3. You now flag *faster* than if you had spoofed nothing at all — because the mismatch between what you claim and what your TLS shows *is* the signal. Spoofing headers without also spoofing TLS is the single most common rookie mistake in modern scraping.
JA4+ also covers JA4H (HTTP header order and content), JA4X (X.509 certificates, the certificates that prove identity in TLS), and JA4T (TCP options + window). The profile an anti-bot builds reaches well beyond the cipher list.
### How clients match a browser's TLS
To present a TLS handshake consistent with a real browser, scrapers use an HTTP client that reproduces that browser's TLS. **curl_cffi** wraps curl-impersonate, a patched build of curl that uses BoringSSL (Chrome's own TLS library) with Chrome's exact cipher list and HTTP/2 SETTINGS frames. A single impersonate="chrome131" argument gives you a JA4 that is indistinguishable from real Chrome. **tls-client** (Go), **noble-tls**, and **hrequests** do the same in their own ecosystems. Real browsers — Chrome, Firefox, Camoufox — speak real TLS by definition, so they pass without any tricks.
Keep your impersonation profiles current. A Chrome 120 fingerprint in 2026 is itself suspicious, because real users have long since updated.
### Example
```python
from curl_cffi import requests
# One argument replaces the entire TLS stack with Chrome's
r = requests.get(
"https://protected-site.com/api",
impersonate="chrome131",
proxies={"https": "http://user:pass@residential-proxy:port"},
timeout=20,
)
print(r.status_code, len(r.text))
```
### FAQ
**Q: What is the difference between JA3 and JA4?**
JA3 (2017) hashed the TLS extensions in the exact order they were sent. Chrome started shuffling that order in 2022, which broke JA3 — the same browser produced different hashes across sessions. JA4 (2023) sorts the extensions alphabetically and removes GREASE (deliberate random padding) before hashing, so the fingerprint stays stable no matter how the order is randomised. JA4+ extends the same idea to HTTP headers, certificates, and TCP options.
**Q: Can I just spoof my User-Agent to fix this?**
No — and it usually makes detection faster, not slower. The User-Agent header is read only after the TLS handshake. If your User-Agent claims Chrome but your TLS hash matches Python urllib3, that mismatch is itself a strong bot signal. You have to spoof the TLS layer as well.
**Q: Does using HTTPS hide the fingerprint?**
No — the TLS Client Hello is sent unencrypted at the very start of every HTTPS connection. That handshake is the thing being fingerprinted, and it happens before any encrypted data flows.
**Q: Which Python library should I use?**
curl_cffi is the most popular drop-in replacement for requests. It supports Chrome, Firefox, and Safari impersonation profiles and handles HTTP/2 SETTINGS frames correctly. It is the default first-step upgrade for any Python scraper running into a real anti-bot.
**Q: Is JA3 still useful, or do I have to target JA4 now?**
For older WAFs (web application firewalls) such as Imperva or AWS WAF Bot Control on default settings, JA3 alone is still enough to score the request. But for Akamai, Cloudflare Bot Management, DataDome, and PerimeterX in 2026, JA4 and JA4H are scored together — matching only JA3 produces a "wrong-shape Chrome" signal that gets ranked as a bot.
**Q: Do I need to handshake over QUIC to handle modern anti-bot TLS checks today?**
Not yet. Cloudflare still accepts HTTP/2 from real Chrome on the vast majority of sites, and most scraping libraries don't support QUIC anyway. But the gap between "scrapers use HTTP/2" and "real users use HTTP/3" is starting to become a signal on Cloudflare's premium tier — worth keeping an eye on.
---
## What Is Canvas Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-canvas-fingerprinting
**Canvas fingerprinting is a way for a website to identify your device by asking the browser to draw a tiny invisible image, then turning the resulting pixels into a short ID (a hash).** Here is the trick: the exact same drawing instructions come out slightly differently on different GPUs, graphics drivers, and operating systems. That variation is enough to tell one device apart from another, and consistent enough to instantly expose a headless browser (a browser with no visible window, often used by bots) or a software-only renderer.
### Quick facts
- **Introduced:** 2012 academic paper, mainstream by 2016
- **How it varies:** GPU model, driver version, sub-pixel rendering, OS font stack
- **Used by:** Akamai, DataDome, PerimeterX, Cloudflare Turnstile
- **Tells on:** Headless Chrome, SwiftShader, Mesa llvmpipe (known hashes)
- **Varies with:** Real GPU, Camoufox, CloakBrowser (C++ rendering patches)
### How canvas fingerprinting works
A script adds a hidden <canvas> element (a drawing surface) to the page, draws shapes, gradients, and text on it using canvas.getContext("2d"), then calls canvas.toDataURL() to read the pixels back out. The exact pixels depend on several things: the GPU brand and model, the driver version, how the OS renders fonts (Windows ClearType vs macOS CoreText), and canvas DPI scaling (how pixels map to the display). Hashing that pixel data gives a short identifier that stays the same for one device across visits, and differs from almost every other device.
WebGL fingerprinting does the same thing but for 3D rendering, and it gives away even more: it can read the GPU name directly via gl.getParameter(gl.RENDERER). Real Chrome returns something like ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0), while headless Chrome with no GPU returns a generic software string or fails to start the WebGL context at all.
### Why scrapers get caught
Headless browsers usually run on servers with no GPU, so they fall back to software rendering — either Chrome's built-in SwiftShader or Mesa llvmpipe on Linux. Both are predictable: they always produce canvas hashes from a small, known set. Anti-bot systems keep a catalogue of these hashes and flag any visitor that matches one. Worse still, if the script asks for the WebGL renderer string and gets SwiftShader Device 0x0000C0DE, that exact value is already on a blocklist.
Trying to fix this by overriding toDataURL() in JavaScript backfires. Checking Function.prototype.toString.call(canvas.toDataURL) shows the modified source code instead of the expected [native code] marker — so the patch itself becomes the giveaway that this is a bot.
### How consistent canvas rendering works
Approaches that produce consistent results work deep in the browser's C++ rendering code, not in JavaScript. **CloakBrowser** patches Chromium's rendering layer to add a tiny, consistent amount of noise to the pixels before toDataURL() returns them — every profile gets a different hash, yet the image looks identical to the eye. **Camoufox** does the same for Firefox using patches applied when the browser is built. The gold standard is still **real browsers on real consumer hardware**: an actual laptop with an Intel integrated GPU produces a genuine canvas hash that no blocklist will match.
A 2026 development makes this messier: services like Bablosoft PerfectCanvas now sell *real canvas hashes harvested from real consumer GPUs*, which can be replayed into headless sessions. In response, anti-bots are combining canvas checks with signals that are harder to fake — like WASM SIMD timing (how fast the CPU runs certain math) and the physics of mouse movement (behavioural Bezier curves) — so a clean canvas hash on its own no longer proves a browser is real.
### PerfectCanvas — replaying a real machine's output
The naive approach is to add random noise to the canvas pixels. It fails because anti-bot vendors expect each machine to give *stable* output: a hash that changes on every page load looks more bot-like than a hash that is wrong but consistent. The current best technique is **canvas replay**, made popular by the PerfectCanvas project that ships inside multilogin and several commercial anti-detect browsers.
It works in two steps: (1) on a real machine, run a large set of known drawing operations and save the resulting pixel data, then (2) when the headless browser is asked to draw any of those same operations, hand back the pre-recorded real pixels instead of the degraded SwiftShader output. The result is byte-for-byte identical to the real machine for any test the vendor knows to run, and it stays consistent within a session — exactly what a real device looks like.
The weak spot is unfamiliar drawing operations. A vendor that suspects replay can request a render the database has never seen — the headless browser then falls back to its real (broken) output, and the mismatch gives it away. The whole contest comes down to whether the replay database covers more operations than the vendor's tests probe.
### Example
```javascript
// What every anti-bot script runs to fingerprint your browser
function canvasFingerprint() {
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
ctx.textBaseline = 'top';
ctx.font = '14px Arial';
ctx.fillStyle = '#f60';
ctx.fillRect(125, 1, 62, 20);
ctx.fillStyle = '#069';
ctx.fillText('Cwm fjordbank glyphs vext quiz', 2, 15);
return canvas.toDataURL(); // hashed and sent to the server
}
```
### FAQ
**Q: Can I disable canvas fingerprinting?**
Real users can't easily turn it off without breaking pages. Privacy extensions like CanvasBlocker randomise the output, but that randomness is itself detectable — "this canvas hash is too random to be real hardware." For scraping you don't want to disable it at all; you want to look like a real device.
**Q: Do all browsers have the same canvas hash?**
No. Even the exact same browser version on the same OS can produce different hashes on different machines, because the GPU and driver affect the pixels. That difference is precisely what makes the technique useful for fingerprinting.
**Q: Is canvas fingerprinting legal under GDPR?**
Under GDPR (the EU's data protection law), canvas fingerprinting is generally treated as personal data because it identifies devices. Sites need a lawful basis (usually the user's consent) to use it for tracking. Using it purely for anti-fraud or bot defense is more commonly accepted under the "legitimate interest" basis.
**Q: What is the difference between canvas and WebGL fingerprinting?**
Canvas uses 2D rendering and is faster; WebGL uses 3D rendering and also exposes the GPU name string. Anti-bots usually combine the two — a contradiction between them (for example, WebGL reports "Intel UHD 620" but the canvas hash matches a software renderer) is itself a strong bot signal.
**Q: Why is canvas noise injection a worse strategy than fixed spoofing?**
Real machines give a stable canvas hash within a session — the same render always produces the same pixels. A scraper that randomises its hash on every page load stands out against the way real users behave. A scraper with a fixed-but-wrong hash at least looks like a single consistent machine.
---
## What Is DataDome?
URL: https://scrappey.com/qa/anti-bot/what-is-datadome
**DataDome is a bot-protection vendor used on roughly 1,200 enterprise sites, scoring more than 5 trillion signals per day.** Its job is to tell real visitors apart from automated scrapers. Unlike Cloudflare and Akamai, it trains a separate machine-learning model for each protected site (roughly 85,000 in total), runs at the website's own application server rather than at the CDN edge (the network of relay servers that sit in front of a site), and returns a verdict in about 2 ms. That design means its behaviour varies from site to site. What is observed on Grainger.com may differ on Le Monde even with the exact same TLS (the encryption layer behind https), same browser, and same proxy.
### Quick facts
- **Detection cookie:** datadome (also dd_cookie_test)
- **Models:** ~85,000 — one per protected site
- **Decision latency:** ~2 ms, real-time, per request
- **Key signals:** IP reputation (25–30%), TLS, WASM boring_challenge, Picasso device FP, 35+ behavioural signals
- **How it is studied:** curl_cffi + mobile/residential proxy, __NEXT_DATA__ extraction
### How DataDome works
When a request hits a DataDome-protected site, that request is forwarded in real time to DataDome's scoring service while the page is being served. The model looks at several signals at once: IP reputation (the history and trustworthiness of your IP address, which by itself accounts for 25–30% of the score), the TLS fingerprint (the unique signature your client leaves when it sets up the https connection), HTTP/2 frame characteristics (low-level details of how your client packages its requests), the datadome cookie if one is present, and any behavioural data the site has collected from you before. A score comes back in roughly 2 ms — fast enough to block a bot inline without slowing down the page for real users.
The **WASM boring_challenge** is DataDome's most distinctive piece. WASM, short for WebAssembly, is compiled code that runs inside the browser at near-native speed. This challenge is a small program written in Rust and compiled to WASM; it runs in the browser as a state machine (a step-by-step process that moves through defined states) and spits out a token proving the work was done. Because it is real code executing against real browser APIs, you cannot solve it without an actual browser environment. Headless detection (spotting browsers with no visible window, the usual sign of automation) happens here too: the WASM probes the CPU using SIMD timing — measuring how fast certain parallel math instructions run — in a way that no stealth-browser JavaScript patch can fake.
### Why its behaviour varies per site
With 85,000 per-site models, DataDome tunes how strict it is for each customer. Le Monde (a news site, light scoring) blocks far less aggressively than Grainger (e-commerce, hard scoring). So the same client configuration can be scored very differently from one customer's site to another. There is no single, universal way it behaves — the model is per-site and can be retrained at any time.
### What scrapers actually do
Three strategies, in priority order:
- **Look for the data in the initial HTML first.** Many DataDome-protected Next.js sites embed the full page state in a __NEXT_DATA__ script tag — confirmed on Grainger.com, where a 110KB JSON blob holds all the product data right in the first HTML response. A tool like curl_cffi plus a residential proxy fetches that HTML directly. DataDome never even runs its WASM check, because no follow-up XHR (background JavaScript request for more data) ever fires.
- **Use mobile or ISP residential proxies for XHR endpoints.** IP reputation carries so much weight that simply switching from a datacenter IP to a mobile-4G one often flips a session from blocked to a 200 OK response with nothing else changed. Rotating residential IPs is risky; static ISP or mobile IPs are the safest choice.
- **Use Camoufox with geoip=True** when the page genuinely runs the WASM challenge. The five identity signals — IP, WebRTC (a browser feature that can leak your real IP), DNS, timezone, and Accept-Language — all have to point to the same location.
Datacenter IPs are not a viable starting point: their poor IP reputation gets them rejected before any fingerprint detail even comes into play.
### Example
```python
from curl_cffi import requests
import chompjs, re
# Many DataDome-protected sites embed all data in __NEXT_DATA__
r = requests.get(
"https://target.com/product/123",
impersonate="chrome131",
proxies={"https": "http://user:pass@mobile-4g-proxy:port"},
)
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', r.text, re.S)
data = chompjs.parse_js_object(m.group(1))
print(data["props"]["pageProps"]["product"])
```
### FAQ
**Q: Is DataDome the same as Cloudflare?**
No. Cloudflare runs at the CDN edge (the relay network in front of a site) and uses one global ML model trained on roughly 20% of all internet traffic. DataDome runs per-site at the application layer, with 85,000 separate models. They look for different things and behave very differently.
**Q: Does a residential proxy alone change how DataDome scores a session?**
On the lightest deployments, IP reputation carries enough weight that it can. On most production e-commerce or ticketing sites, more signals are evaluated: the TLS fingerprint (the https handshake signature), and — for XHR endpoints (background data requests) — whether a real browser context ran the WASM challenge. The point is that DataDome scores many signals together, not the IP alone.
**Q: Why does DataDome respond in 2 ms?**
Because every request is scored on its own, inline, with no warm-up period where trust slowly builds up. The speed matters because the site can't afford to make real users wait while a model thinks. The catch for scrapers: every single request gets scored, not just the first one.
**Q: Does the datadome cookie mean I am whitelisted?**
No. The cookie just marks a session DataDome has seen before; the score is recalculated on every request. A valid cookie that passed on request 1 can still fail on request 50 if your behavioural fingerprint starts to look off. The cookie is a hint, not a free pass.
---
## What Is Akamai Bot Manager?
URL: https://scrappey.com/qa/anti-bot/what-is-akamai-bot-manager
**Akamai Bot Manager is an enterprise tool that websites use to tell real visitors apart from bots, and it guards roughly 30% of the Fortune 500 — airlines, banks, retailers, ticketing.** It works on two fronts. At the network edge it scores your JA4+ TLS handshake (TLS is the encryption layer behind https, and the handshake leaves a fingerprint of which client you are). Then, inside the browser, it runs a fingerprinting script (sensor.js, a ~512 KB scrambled file) that gathers 500+ signals across several requests. Trust builds up over a session rather than being decided in one shot: the _abck cookie starts at ~-1~ (not trusted yet) and only flips to ~0~ once sensor.js finishes its checks. See also browser fingerprinting.
### Quick facts
- **Customers:** ~30% of Fortune 500 — airlines, banks, retail, ticketing
- **Detection cookies:** _abck (state ~-1~ → ~0~), bm_sz
- **Sensor script size:** ~512 KB, re-obfuscated per rotation
- **Distinct probe:** 60 chrome-extension:// fetches — zero passing = instant block
- **Scoring model:** Multi-request, trust accumulates across the session
### How Akamai scores a session
Akamai checks you in two layers. **The first is JA4+ at the EdgeWorker** — code running on Akamai's edge servers. This fires before any HTML is sent, so a suspicious TLS handshake alone can get you blocked. Clear that layer and the page loads with sensor.js built right into it (inlined or nearly so). That script runs the deeper tests: a canvas hash (a fingerprint from drawing an image), the WebGL renderer (your GPU's name), AudioContext, navigator properties, the Battery API, and a probe that tries to fetch 60 known chrome-extension:// URLs. Real Chrome users almost always have a few extensions installed (uBlock Origin, 1Password, LastPass), so some of those fetches succeed; a headless browser has none, so all 60 fail at once — something that essentially never happens for a real person.
The script then POSTs everything it collected to /_bm/data. Only after that POST succeeds does _abck flip from ~-1~ to ~0~. Protected data endpoints (the XHR calls a page makes for content) check this cookie first — if it still says ~-1~, you get a 412 "Pardon Our Interruption" no matter what else looks right.
### Which signals tend to flag a client
**Signals that commonly draw a flag:** headless Chrome (no real GPU, so the WebGL context is null), SwiftShader (its software-GPU device ID 0x0000C0DE is widely recognised), JavaScript patches whose Function.toString() output reveals they have been rewritten, Page.addScriptToEvaluateOnNewDocument injection (the CDP automation protocol leaves visible timing artifacts), datacenter proxies of any kind, and rotating residential IPs mid-session (this wipes out the trust accumulated so far).
**Signals more consistent with a real client:** a TLS handshake and HTTP/2 frame order that match a mainstream browser, a session that keeps one stable IP and one set of cookies, and a browser profile that actually has the extensions and GPU a real machine would have, so the 60-probe check sees the same mix a person's browser would. These are the same coherence properties that any legitimate browser session naturally exhibits.
### Session hygiene that matters
Because Akamai scores you across many requests, it penalizes inconsistency harder than most vendors do:
- Use one ISP static residential IP for the whole session — never switch in the middle.
- Warm up first: visit the homepage, wait 2–3 seconds, scroll, then go to the data URL.
- Set Accept-Language to match the country your proxy is in.
- Reuse cookies across requests — trust grows from keeping the same _abck cookie going.
### Example
```python
# For medium-strength Akamai deployments, this often works
from curl_cffi import requests
s = requests.Session(impersonate="chrome131")
# Warm up on the homepage so _abck has a chance to accumulate trust
s.get("https://target.com/",
proxies={"https": "http://user:pass@isp-residential:port"})
# Then hit the protected endpoint with the same session
r = s.get("https://target.com/api/listings",
proxies={"https": "http://user:pass@isp-residential:port"})
print(r.status_code, len(r.text))
```
### FAQ
**Q: What does the _abck cookie mean?**
It is Akamai's record of how much it trusts your session. The state field reads ~-1~ when you first arrive (untrusted) and flips to ~0~ after sensor.js runs and POSTs valid signals back to Akamai. Protected data (XHR) endpoints look at this state and return a 412 error if it is still ~-1~.
**Q: What is the 60-extension probe?**
It is a test inside sensor.js that fires 60 fetch() requests to known chrome-extension://[id]/manifest.json URLs (uBlock Origin, LastPass, Bitwarden, and so on). Real Chrome users have at least a few extensions installed, so some of those requests succeed. A headless browser has none, so all 60 fail at the same time — a result that is statistically impossible for a real user.
**Q: Why does Akamai care more about multi-request consistency than other vendors?**
Because trust builds up over the whole session instead of being judged one request at a time. Every clean interaction raises your score; every oddity lowers it. Rotating IPs partway through resets that accumulated trust. Other vendors (notably DataDome) score each request on its own, so mid-session changes hurt you less there.
**Q: Is rotating residential ever okay against Akamai?**
Only between sessions, never inside one. Pick a single IP per session and keep it for the entire visit. ISP static residential is the best fit because that IP stays put and doesn't shift under you.
---
## What Is PerimeterX (HUMAN)?
URL: https://scrappey.com/qa/anti-bot/what-is-perimeterx
**PerimeterX, now operating as part of HUMAN Security, is a bot-protection vendor whose biggest asset is its network.** Bot protection means software that sits in front of a website and decides whether each visitor is a real person or an automated script. PerimeterX protects 29,650+ sites — spanning major retail, real-estate, and ticketing platforms — and checks roughly 15 trillion interactions per week across 3 billion devices. The key idea: get detected on one customer's site and your fingerprint (the unique profile of your browser and connection) is flagged across the entire network. That cross-site reputation effect is the strongest in the industry.
### Quick facts
- **Sites protected:** 29,650+ retail, real-estate & ticketing sites
- **Weekly interactions verified:** ~15 trillion across 3 billion devices
- **Detection cookies:** _px3, _pxde, _pxhd
- **Scoring model:** 5-vector unified — TLS + IP + headers + JS FP + behaviour
- **Network effect:** Reputation shared across all customers — fingerprint burns are global
### How PerimeterX scores requests
PerimeterX checks five things at once, and all five must look human at the same time: your TLS fingerprint (TLS is the encryption layer behind https, and its handshake reveals which client you really are), your IP reputation, the order and content of your HTTP headers, your JavaScript fingerprint (values the browser exposes like canvas, WebGL, and audio), and your behaviour (mouse movement, scrolling, how long you linger). Fixing only one or two has zero effect — the score is combined, so it needs the full set. This is unlike Cloudflare (where TLS plus IP gets you a long way) or DataDome (where IP weight is dominant).
The fingerprint is packed into the _px3 cookie and sent in a POST request to collector-PXxxxxxx.perimeterx.net. The "Human Challenge" — a press-and-hold button — is the visible fallback shown when your score is borderline. A hard block instead returns a 403 error with no challenge at all.
### The network effect
Because HUMAN watches signals from 29,650 sites at the same time, a fingerprint flagged on one customer's site is automatically treated as lower-trust everywhere else. This cross-site reputation is what makes the network the company's strongest asset: a single browser profile that appears across many unrelated domains looks different from a real user, who normally visits a small set of sites from one consistent device. Tools such as Camoufox assign each browser instance its own coherent profile, which is how a per-domain isolation model works in practice.
### Why all five vectors are scored together
Because the score is combined, PerimeterX evaluates all five vectors as one picture, and a real browser session is coherent across every one of them:
- **TLS:** the handshake of a mainstream browser such as Chrome or a real Camoufox / Chrome instance.
- **IP:** residential or mobile connections behave differently from datacenter ranges, which PerimeterX weights heavily.
- **Headers:** the exact header order and capitalisation a given browser version sends.
- **JS fingerprint:** values that are internally consistent — JS patches whose Function.toString() output reveals they have been rewritten stand out.
- **Behaviour:** the navigation, mouse movement, and timing patterns a person naturally produces.
Managed verification APIs (such as Bright Data or Zyte) maintain this coherence across all five vectors for authorized browser workflows on sites you are permitted to access, billed per request, which often saves engineering time at volume.
### Example
```python
from camoufox.sync_api import Camoufox
# Camoufox with geoip aligns all 5 identity vectors and ships
# a real canvas + WebGL + audio fingerprint
with Camoufox(
headless=True,
geoip=True,
proxy={
"server": "http://residential:port",
"username": "user", "password": "pass",
},
humanize=True, # adds human-like mouse/scroll cadence
) as browser:
page = browser.new_page()
page.goto("https://perimeterx-protected.com/")
page.wait_for_timeout(2500) # let _px3 collector POST
data = page.goto("https://perimeterx-protected.com/api/items").json()
```
### FAQ
**Q: Is PerimeterX the same as HUMAN Security?**
Yes — they are one company now. HUMAN Security acquired PerimeterX in 2022, and the PerimeterX product now runs as part of HUMAN's Bot Defender. The cookie names (_px3, _pxde) were left unchanged so existing integrations keep working.
**Q: What does the "Human Challenge" press-and-hold button do?**
While you hold the button, it measures how long you hold, how steady the pressure is, your finger movement, and the timing. Clicking the button itself is easy — the real decision comes from the surrounding fingerprint and IP, which also decide whether you ever see the button at all. Failing the challenge usually does not give you a CAPTCHA; it gives you a 403 error with no way forward.
**Q: Why is the network effect so important?**
Because PerimeterX shares reputation across 29,650+ customers, a fingerprint that fails on one customer's site is instantly treated as lower-trust on another. Scrapers that reuse fingerprints (the same canvas hash, the same TLS profile, the same IP) across targets burn through them very quickly. Keeping each target on its own isolated session is essential.
**Q: Does a request without a real browser look the same to PerimeterX?**
No. Because PerimeterX weighs behavioural signals (mouse movement, scrolling, timing) alongside TLS, IP, headers, and JS fingerprint, a plain HTTP client provides none of the behavioural vectors and looks very different from a real browser session. The higher a customer's configuration, the more those behavioural signals matter to the combined score.
---
## What Is Kasada?
URL: https://scrappey.com/qa/anti-bot/what-is-kasada
**Kasada is a bot-defense system that big retailers, ticketing sites, and sneaker drops put in front of their servers to manage automated traffic.** It works as a gatekeeper proxy: it sits between the visitor and the origin (the real application server), so every request passes through it first. Unlike Cloudflare or DataDome, it does not show a CAPTCHA. Instead, it makes the browser solve a JavaScript proof-of-work challenge — a small puzzle that costs real CPU time — before letting anything through. A request that does not pass receives a silent 403 or 429 error with no explanation. A notable 2026 detail: Kasada identifies playwright-stealth by running Function.prototype.toString() on native browser functions to see whether they have been modified. The patterns those patches leave behind are already catalogued.
### Quick facts
- **Detection cookies:** x-kpsdk-ct, x-kpsdk-cd
- **Challenge file:** ips.js — polymorphic, renamed every deployment
- **Distinct signal:** Function.prototype.toString() inspection on patched APIs
- **Block style:** Silent 403 / 429 — no challenge page, no explanation
- **Notable compatible tool:** PatchRight (Python source patches, not runtime JS)
### How Kasada scores requests
Kasada is a gatekeeper proxy: every request flows through it before it reaches the origin (the real server). It serves a JavaScript file — named ips.js, but renamed on each deployment so its name keeps changing (polymorphic). That file hands the browser a proof-of-work challenge: a math puzzle that needs real CPU cycles and real browser APIs to solve. When the browser finishes, it gets a token (x-kpsdk-ct). Each token works only once — sending the same one twice is an instant red flag.
The standout 2026 detection trick: Kasada runs Function.prototype.toString() on dozens of built-in browser functions (such as navigator.webdriver, WebGLRenderingContext.getParameter, and HTMLCanvasElement.toDataURL). Calling toString() on a genuine browser function returns function () { [native code] }. But if a stealth tool like playwright-stealth has rewritten that function in JavaScript to hide automation, toString() returns function () { [custom code] } instead — and Kasada has the full set of these patched signatures on file.
### Signals Kasada weighs
**playwright-stealth** — every patch it applies leaves a toString() trail. Those signatures are catalogued, which is why JavaScript-layer patching is detectable here.
**undetected-chromedriver** on its own — it changes the webdriver flag, but not the wider set of functions Kasada inspects with toString().
**Datacenter proxies** — Kasada weighs IP reputation heavily. Addresses from cloud providers (AWS, GCP, DigitalOcean ASNs — the network blocks a hosting company owns) carry low trust regardless of the browser configuration.
**Token replay** — x-kpsdk-ct tokens are single-use, so a repeated token is itself a signal.
### How tools interact with it
**PatchRight** is frequently referenced in 2026 because it edits the Playwright Python source before Chrome starts, so its changes never exist as JavaScript inside the running browser. With nothing modified in the JS runtime, there is nothing for toString() to read at that layer.
**SeleniumBase UC mode** is another option that adjusts the WebDriver flag and can complete the proof-of-work challenge automatically.
**Context that affects outcomes:** IP reputation (residential or ISP static IPs versus datacenter addresses), token handling (each challenge token is single-use), and session distribution all factor into how Kasada scores traffic, on top of the browser configuration itself.
### Example
```python
# PatchRight patches the Playwright Python source before Chrome starts,
# so there are no JS-runtime changes for toString() inspection to read.
from patchright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context(
proxy={"server": "http://user:pass@residential:port"}
)
page = context.new_page()
page.goto("https://kasada-protected.com/")
# PoW solved automatically by real browser execution
html = page.content()
```
### FAQ
**Q: Why does playwright-stealth fail on Kasada?**
Because Kasada runs Function.prototype.toString() on the browser's native functions to read their source. playwright-stealth rewrites those functions in JavaScript to hide automation, so toString() hands back the patched source — which is exactly the signal Kasada is hunting for. The patch that was meant to hide you is what gives you away.
**Q: What is ips.js and why is it renamed?**
ips.js is the JavaScript file that delivers Kasada's challenge. It is renamed on every deployment so blockers cannot match it by filename (it is polymorphic). The challenge logic inside also keeps changing shape, so static deobfuscation tools — which try to unscramble the code automatically — cannot keep up.
**Q: Does Kasada's challenge require a real browser?**
Generally yes. The proof-of-work challenge relies on real browser features (the Crypto API, performance.now() timing, a live page execution context), which is why HTTP-only clients struggle with it and why the challenge logic changes every time ips.js rotates.
**Q: When should I use a managed scraping API instead?**
When your team spends more than about 2 engineer-days a month maintaining browser automation against Kasada-protected sites you are authorized to access. The challenge rotation, the toString() surface, and the proxy-reputation upkeep add up fast. Above that point, a managed API such as Scrappey or Bright Data Web Unlocker is usually cheaper than doing it yourself.
---
## What Is F5 Shape Security?
URL: https://scrappey.com/qa/anti-bot/what-is-f5-shape-security
**F5 Shape Security is the most sophisticated anti-bot product on the market — F5 paid $1 billion to acquire Shape in 2020 and the price reflects what it does.** An anti-bot is software a website runs to tell real human visitors apart from automated scripts. Shape's trick: it runs a tiny custom computer (a "virtual machine") inside your browser using JavaScript. The code your scraper has to execute is not normal JavaScript — it is a private instruction set (bytecode), rebuilt with every deployment, that standard reverse-engineering tools cannot decode. The session tokens it issues (reese84) expire in minutes. For teams collecting data they are permitted to access at scale, the engineering cost of building and maintaining a do-it-yourself integration typically exceeds the cost of a managed API within weeks.
### Quick facts
- **Acquisition:** $1B by F5 Networks (2020)
- **Token cookies:** reese84, TS*, $rsc= URL params
- **Token expiry:** Minutes — tight rotation cadence
- **Architecture:** Custom JS VM with proprietary bytecode
- **DIY viability:** Costly at scale — managed APIs are common
### Why Shape is different
Most anti-bots ship JavaScript that is scrambled (obfuscated) but still standard — given enough effort, you can untangle it. Shape is different. It ships a JavaScript program that *interprets a custom bytecode language* — its own private set of instructions. Your browser downloads both the bytecode and the interpreter that runs it, and those instructions map to no standard browser API. So even with Wireshark, mitmproxy (tools that let you watch the traffic), and a deobfuscator, there is, in any normal sense, no source to read.
The challenge code is also regenerated on every rotation. The bytecode produced this hour does not match last hour's, so any analysis of it goes stale within days. This is what makes Shape the most engineering-intensive anti-bot product to work with.
### How teams approach Shape-protected access
**1. Web and mobile endpoints often differ.** Shape is usually deployed only on the website, not the mobile app. The same brand's iOS or Android app often talks to a completely separate API with a different architecture — frequently just simple Bearer-token auth (a token sent in the request header to prove who you are). When you are authorized to access a service's data, understanding which endpoint carries which protections explains why integration effort varies so widely across the same brand.
**2. Managed APIs handle the heavy lifting.** For full-VM cases, building it yourself rarely pays off. Benchmarked success rates (Scrape.do 2025): Bright Data Web Unlocker 98.44%, Zyte 93.14%. A managed provider runs the browser environment, the residential proxies, and the token rotation behind the scenes, so teams accessing data they are permitted to use do not maintain that machinery themselves.
### The economic threshold
A senior scraping engineer costs roughly $700–1,500/day fully loaded (salary plus overhead). Bright Data Web Unlocker is around $3 per 1,000 successful requests; Scrappey's full-browser tier is €1.00 per 1,000. Once Shape is involved, the math almost always favors a managed API. The break-even rule of thumb: if maintaining an in-house Shape integration costs more than two engineer-days per month, that portion is usually better handed to a managed provider.
Token mechanics also explain the maintenance burden: each reese84 token is valid only for a few minutes, so any integration has to re-acquire tokens frequently, and tokens are bound tightly to a single session and IP. These constraints are why DIY maintenance grows expensive over time.
### Example
```python
# Illustration: many brands expose a separate, differently-protected mobile API.
# When you are authorized to access a service's data, a simple Bearer-token
# endpoint behaves very differently from a Shape-protected web page:
from curl_cffi import requests
r = requests.get(
"https://api.brand.com/v2/products",
impersonate="chrome131",
headers={
"Authorization": "Bearer <token from mobile app>",
"X-App-Version": "4.2.1",
},
proxies={"https": "http://user:pass@residential:port"},
)
print(r.status_code, r.json()) # often 200, no Shape, no reese84
```
### FAQ
**Q: Why is Shape so much harder than Cloudflare or Akamai?**
Because the challenge runs inside a proprietary virtual machine — a custom mini-computer Shape builds in the browser. Cloudflare and Akamai ship scrambled but standard JavaScript, which can eventually be reverse-engineered. Shape ships bytecode for a private instruction set, regenerated on every rotation, so the reverse-engineering cycle never finishes — by the time you understand it, it has changed.
**Q: Can a plain HTTP client like curl_cffi work with Shape-protected endpoints?**
For Shape-protected XHR endpoints (the background data calls a page makes), no. The JavaScript VM must run and produce a reese84 token before any protected request is accepted, and curl_cffi cannot run JavaScript. For the rare Shape-protected page where the data is already in the initial HTML, an HTTP client can read it. Web and mobile endpoints often carry different protections, which is why integration effort varies.
**Q: How long do reese84 tokens last?**
Typically a few minutes. Because tokens expire quickly, any integration has to re-acquire them frequently, which is one reason in-house solutions carry high maintenance overhead.
**Q: What sites use F5 Shape?**
Major US banks, airlines, ticketing platforms, and retailers — the kinds of customers willing to pay enterprise-tier anti-bot pricing. There are no maintained public lists, but checking cookies is the fastest way to confirm: open your browser dev tools and look for a reese84 or TS* cookie.
---
## What Is WASM Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-wasm-fingerprinting
**WebAssembly (WASM) fingerprinting is a newer anti-bot technique that identifies a browser by measuring how its actual CPU behaves, instead of trusting what the browser says about itself.** It works two ways: it runs small bursts of low-level math (instructions called SIMD) and times exactly how fast the processor finishes them, and it builds a far more precise stopwatch than browsers normally allow. Both run as compiled machine code, *underneath* the JavaScript layer thatCamoufox, CloakBrowser, PatchRight, and similar stealth tools patch. More and more anti-bot systems now run these WASM checks, which means a stealth browser can still be caught even when all of its visible settings look perfect.
### Quick facts
- **Two probe types:** SIMD CPU timing + SharedArrayBuffer timer
- **Timer resolution:** ~100,000 Hz — distinguishes ~6 µs intervals (17× finer than performance.now)
- **Reveals:** CPU microarchitecture (NEON, SSE, AVX), vector register width
- **Reaches below:** Camoufox, CloakBrowser, PatchRight, undetected-chromedriver — all operate at the JS layer, not WASM
- **Status:** Live in DataDome scoring (May 2026), Chrome marked the leak Won't Fix (crbug 40057687)
### WASM SIMD probes the CPU itself
Most code on a web page is JavaScript: the browser reads it and runs it line by line. WebAssembly (WASM) is different. It is pre-compiled code that runs almost as fast as a native app, and it can use **SIMD** instructions — single commands that do the same piece of math on several numbers at once (the kind of work a CPU does for video or graphics). An anti-bot ships a tiny WASM module that runs these instructions in a fixed pattern and times them very precisely.
Why does timing reveal anything? Because different processors finish the exact same work in slightly different ways. Those measurements expose details like how wide the chip's math registers are and which instruction families it supports — NEON on Apple and other ARM chips, SSE on older Intel/AMD, AVX on modern x86 — plus tiny quirks unique to each CPU model. It is a bit like recognising a car from the precise sound of its engine instead of the badge on the front.
Here is the key point for scraping: stealth browsers change what the browser *reports* about itself, but WASM SIMD measures the real silicon. A genuine Mac with an M2 chip cannot be dressed up to look like an Intel laptop, because the timing fingerprint comes from the chip, not the browser. Source: Anthony Manikhouth, DataDome detection engineer, blog.azerpas.com, May 2026.
### SharedArrayBuffer = 17× timer precision
To fingerprint by timing, an anti-bot first needs an accurate stopwatch — and browsers deliberately make their normal one blurry. On ordinary pages Chrome rounds performance.now() to the nearest 100 microseconds, so that hostile sites cannot use ultra-precise timing to spy on the CPU (a family of attacks known as Spectre).
WASM quietly undoes that blur. A single line — new WebAssembly.Memory({shared:true}).buffer — creates a block of shared memory that a background thread (a Web Worker) can count upward as fast as it can using Atomics.add. Reading that counter acts as a stopwatch ticking about 100,000 times a second: fine enough to tell apart events roughly 6 microseconds apart, about **17× sharper** than the timer Chrome means to hand out. (A microsecond is a millionth of a second.)
With a stopwatch that precise, an anti-bot can measure tiny timing differences in things like canvas drawing, JavaScript execution and animation frames — and those differ between real human machines and bots at the sub-millisecond level. This was reported to Chrome as bug 40057687 and marked Won't Fix. Source: Manuel, brokenbrowser.com.
### The hyphenation-dictionary probe
The third WASM trick has nothing to do with the CPU. It checks which **hyphenation dictionaries** the browser ships — the rules that decide where a long word is allowed to break at the end of a line. The three main browser engines bundle slightly different ones: Blink (the engine inside Chrome), Gecko (inside Firefox) and WebKit (inside Safari). A WASM probe feeds in a known phrase, watches where the browser splits it, and from that works out the real engine underneath — even if the User-Agent string (the line of text a browser sends to identify itself) claims something else.
This catches a specific class of scraper: **Camoufox claiming to be Chrome**. Camoufox is Firefox under the hood, so its Intl hyphenation matches Gecko. A request with a Chrome User-Agent but Gecko hyphenation is unmistakably a Firefox-based stealth tool. Akamai's sensor.js and DataDome's WASM challenge both include this check.
The defence is the same as for the other WASM probes — patch the engine below the JS layer, or use a tool whose hyphenation matches its claimed UA. There is no JS-level workaround.
### What this means for browser automation
The decade-long focus on JavaScript-layer browser configuration has hit a ceiling, because this round of fingerprinting happens below the JavaScript layer. A few practical observations for 2026: WASM probing is expensive to deploy, so it is concentrated on the most heavily defended sites (the machine-learning systems run by the largest anti-bot vendors). On those targets, a browser running below-the-JS-layer configuration may still differ from real consumer hardware. Over time, the closest match to a real browser will come from real consumer hardware on real ISP networks rather than cloud-hosted browser instances.
### Example
```javascript
// What an anti-bot ships in a WASM module to time-fingerprint the CPU.
// Stealth browsers cannot patch this because the work happens natively.
const sab = new WebAssembly.Memory({ shared: true }).buffer;
const view = new Int32Array(sab);
// Spin a worker that increments view[0] as fast as possible
new Worker(URL.createObjectURL(new Blob([`
const view = new Int32Array(self.sab);
while (true) Atomics.add(view, 0, 1);
`])));
// Read it ~100,000 times per second — a 17× higher-resolution timer
// than performance.now() on non-isolated pages.
const before = Atomics.load(view, 0);
// ... do a SIMD op ...
const elapsed = Atomics.load(view, 0) - before;
```
### FAQ
**Q: Can I patch WASM SIMD timing?**
Not from the browser layer. WASM SIMD runs natively against your CPU — patching it would mean intercepting at the OS or hypervisor level, which breaks the browser. The only workable mitigation is hardware diversity: real machines produce different SIMD fingerprints naturally.
**Q: Do all anti-bots use WASM fingerprinting?**
Not yet universally. DataDome is the documented adopter as of May 2026. Cloudflare, Akamai, and PerimeterX are believed to be experimenting. Adoption is concentrated on the highest-value targets first, because the WASM probe adds payload size and execution time.
**Q: Will Chrome ever fix the SharedArrayBuffer timer?**
The Chrome team marked it Won't Fix in crbug 40057687. The trade-off is between blocking timing attacks (which would break legitimate use cases) and the privacy implications of finer timing. The leak is now considered a permanent feature of the platform.
**Q: Does Camoufox handle this?**
Camoufox patches Firefox's JS-visible APIs at the C++ level. It does not modify the WebAssembly execution layer or SIMD timing output. The same is true for CloakBrowser, PatchRight, and every other stealth tool documented in 2026. This is a layer none of them touch yet.
**Q: Why are WASM-based probes so much harder to spoof than JS-based ones?**
Normal anti-bot tricks read JavaScript values that a stealth tool can quietly overwrite. WASM is different: it runs as pre-compiled code the page brought with it, so there are no JavaScript functions to swap out or fake. The probe simply reads the result of running real CPU instructions — and the only way to change that result is to alter the browser engine itself.
---
## What Is HTTP/2 Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-http2-fingerprinting
**HTTP/2 fingerprinting identifies an HTTP client from its SETTINGS frame and frame-level behaviour, independent of the TLS layer.** Think of it like recognising someone by their handwriting rather than their signature: when a client opens an HTTP/2 connection it sends a few low-level setup options, and every implementation fills them in slightly differently. Chrome 148 ships one combination, Python httpx another, Go net/http a third. Akamai and Cloudflare both fingerprint these. TLS is the encryption layer behind https; HTTP/2 sits just above it. So if your HTTP/2 layer does not match the browser you claim to be at the TLS layer, you flag at the network layer before the page even loads.
### Quick facts
- **What it fingerprints:** SETTINGS frame parameters and their order
- **Six parameters:** HEADER_TABLE_SIZE, MAX_CONCURRENT_STREAMS, INITIAL_WINDOW_SIZE, MAX_FRAME_SIZE, MAX_HEADER_LIST_SIZE, ENABLE_PUSH
- **Also fingerprinted:** HPACK header compression, WINDOW_UPDATE sizes, frame order
- **Vendors using it:** Akamai, Cloudflare (both at edge alongside JA4)
- **Bypassed by:** curl_cffi, tls-client — they ship Chrome's exact HTTP/2 stack
### What HTTP/2 leaks
The moment an HTTP/2 connection opens, the client sends a SETTINGS frame — a short message declaring up to six configuration values. Each implementation picks its own defaults, so the SETTINGS frame acts like a name tag:
- **Chrome 148** sends a specific, documented combination — HEADER_TABLE_SIZE: 65536, ordered set of settings.
- **Python httpx** sends HEADER_TABLE_SIZE: 4096, no MAX_HEADER_LIST_SIZE, different ordering.
- **curl** default has yet another signature.
- **Go net/http** has its own set.
The leak does not stop at SETTINGS. Other low-level details — the size of WINDOW_UPDATE frames (how much data the client says it is ready to receive), how HPACK header compression is applied, and the order in which streams are opened — form a secondary fingerprint. These are baked deep into the HTTP/2 client, so you cannot spoof them without rewriting the client itself.
### The SETTINGS frame in detail
The SETTINGS frame is the first thing a client sends once the TLS handshake is done and the HTTP/2 preface (PRI * HTTP/2.0…SM, a fixed greeting that confirms both sides speak HTTP/2) has been exchanged. It lists the client's parameters as (id, value) pairs. Chrome 131 sends these six, in this exact order:
SettingChrome valueCommon scraper mismatch
HEADER_TABLE_SIZE65536Go: 4096 (default)
ENABLE_PUSH0Go: 1, httpx: 1
MAX_CONCURRENT_STREAMS1000httpx: 100
INITIAL_WINDOW_SIZE6291456Go: 65535, httpx: 65535
MAX_FRAME_SIZE16384— usually correct
MAX_HEADER_LIST_SIZE262144Not sent by Go default
Right after SETTINGS, Chrome sends a **WINDOW_UPDATE** frame with a delta of 15663105 (15 MB minus the default 65535) — its way of saying it can buffer a lot of incoming data. Then it sends the request headers in **HPACK-compressed** form, with the pseudo-headers (the leading :-prefixed fields) in the order :method, :authority, :scheme, :path — Firefox uses :method, :path, :authority, :scheme. Anti-bot vendors fingerprint all of it: the SETTINGS values, their order, the WINDOW_UPDATE delta, and the pseudo-header order. curl_cffi bakes Chrome's values in; tls-client and noble-tls require manual configuration.
### Why this matters more than scrapers realise
Akamai's EdgeWorker checks JA4 (a fingerprint of the TLS handshake) at the TLS layer. The next layer down checks the HTTP/2 SETTINGS frame. If your JA4 says "Chrome 148" but your SETTINGS frame has HEADER_TABLE_SIZE: 4096 (the Python httpx default), Akamai sees the contradiction and assigns the maximum bot score before any HTML is served.
The mismatch is the signal. A clean JA4 with a Python HTTP/2 stack is *worse* than a clean JA4 with a Python TLS stack and HTTP/1.1 — the inconsistency itself is what gives you away.
### What works
**curl_cffi** — wraps libcurl built against BoringSSL (Chrome's own TLS library) with Chrome's exact TLS and HTTP/2 parameters. One flag, impersonate="chrome131", handles both layers at once. This is the default first step for any Python scraper hitting a real anti-bot.
**tls-client** (Go, or a Python wrapper) — same idea, built in Go, and tends to be faster at 1k+ concurrent connections.
**akamai-v3-sensor** (Go) — for the hardest Akamai v3 targets that even curl_cffi cannot pass. Production scrapers run a small Go sidecar that uses a Chrome-compatible TLS + HTTP/2 stack at the C level, with Scrapy orchestrating the crawl.
**Quick audit:** request tls.browserleaks.com/json from your scraper and from real Chrome. The response includes the HTTP/2 SETTINGS — diff them. If they differ, your HTTP/2 layer is leaking.
### Example
```python
# Audit your HTTP/2 fingerprint against real Chrome
from curl_cffi import requests
resp = requests.get(
"https://tls.browserleaks.com/json",
impersonate="chrome131",
)
fp = resp.json()
print("JA4: ", fp["ja4"])
print("HTTP/2: ", fp["akamai"]) # Akamai hash of HTTP/2 settings
print("User-Agent: ", fp["user_agent"])
# Compare these values against what real Chrome reports on the same page.
# If JA4 matches but HTTP/2 does not, fix your HTTP/2 stack first.
```
### FAQ
**Q: Is HTTP/2 fingerprinting separate from JA4?**
Yes — they are independent layers. JA4 fingerprints the TLS handshake (the encrypted connection setup); HTTP/2 fingerprinting looks at the SETTINGS frame and stream-level behaviour one layer up. A scraper can match JA4 perfectly and still flag at the HTTP/2 layer. Most production scraping libraries fix both.
**Q: Can I just disable HTTP/2 to avoid this?**
Disabling HTTP/2 forces HTTP/1.1, which is itself unusual in 2026. Most modern browsers prefer HTTP/2 or HTTP/3, so a client that refuses HTTP/2 stands out as an anomaly. Fixing the SETTINGS frame is the right answer, not avoiding the protocol.
**Q: Does requests support HTTP/2?**
No — the standard Python requests library is HTTP/1.1 only. httpx does support HTTP/2, but it ships its own SETTINGS defaults that do not match Chrome. Use curl_cffi or tls-client if you need a Chrome-identical HTTP/2 stack.
**Q: How do I check my HTTP/2 fingerprint?**
The simplest test is a GET to https://tls.browserleaks.com/json from both your scraper and real Chrome on the same machine. Compare the akamai field in the JSON response (Akamai's hash of your HTTP/2 behaviour). If the two differ, your HTTP/2 layer is leaking.
**Q: Is the HTTP/2 fingerprint really separate from the TLS fingerprint?**
Yes. JA4 captures the TLS Client Hello (the opening message of the encrypted handshake). JA4H captures the HTTP/2 client preface, the SETTINGS frame, and the pseudo-header order. A request can match Chrome on JA4 but fail JA4H — common with libraries that wrap a Chrome-impersonating TLS stack around their own, unmatched HTTP/2 implementation.
---
## What Is a WebRTC IP Leak?
URL: https://scrappey.com/qa/anti-bot/what-is-webrtc-leak
**A WebRTC IP leak is when your browser quietly reveals your real IP address — even though you set up a proxy to hide it. It is the most-overlooked failure mode in browser-based scraping in 2026.** WebRTC (the browser feature behind video calls and peer-to-peer connections) finds your real local and public IP using STUN servers, and it does this even when all your HTTP traffic is routed through a proxy. The leak happens because WebRTC works at the network layer below the HTTP proxy — it talks directly to STUN servers from your real network interface, so the proxy never sees it. Anti-bots use this leaked IP as one input in a five-vector coherence test that all major vendors run.
### Quick facts
- **How it leaks:** RTCPeerConnection ICE candidates expose local + STUN-discovered IPs
- **Bypassed proxy?:** Yes — HTTP proxy does not route STUN
- **Vendors checking it:** Cloudflare, PerimeterX, DataDome, Akamai
- **Coherence test:** IP country + timezone + Accept-Language + WebRTC ICE + DNS resolver must all agree
- **Best fix:** Camoufox with geoip=True — auto-aligns all 5 vectors
### How the leak works
WebRTC is the browser API for peer-to-peer audio, video, and data connections — think browser video chat. For two peers to connect when both sit behind a home router (NAT — the address translation that lets many devices share one public IP), WebRTC needs to discover the addresses they can be reached at. It does this with the ICE protocol, which gathers candidate IPs from three places: your local network interface, your public IP via STUN servers (servers whose only job is to tell you what your public IP looks like from the outside), and TURN relays. Any web page can run this:
const pc = new RTCPeerConnection({ iceServers: [{ urls: 'stun:stun.l.google.com:19302' }] });
pc.createDataChannel('');
pc.createOffer().then(o => pc.setLocalDescription(o));
pc.onicecandidate = (e) => { if (e.candidate) console.log(e.candidate.candidate); };The returned ICE candidates include your real IP, even if all your HTTP traffic is going through a proxy. The proxy hides your HTTP requests, but WebRTC went around it.
### The 5-vector coherence test
Modern anti-bots check whether five different signals all tell the same geographic story. If you claim to be in one place, all five should agree:
- **IP country** — the country of your proxy's exit IP.
- **Timezone** — what the browser reports via Intl.DateTimeFormat().resolvedOptions().timeZone.
- **Accept-Language header** — your stated language preferences.
- **WebRTC ICE candidate** — the network the browser is actually connecting from (the leaked IP).
- **DNS resolver location** — which DNS server looked up the page's domain.
Picture a US proxy paired with an Accept-Language of ur-PK (Urdu, Pakistan), a timezone of Asia/Karachi, and a Pakistani WebRTC candidate — it fails immediately. It does not matter how good the proxy is; the contradiction between the vectors is itself the signal. This is why "use a US datacenter proxy and call it a day" stopped working around 2021.
### Mitigation by tool
**Camoufox** with geoip=True looks up the country of your proxy exit IP, then sets timezone, locale, language, WebRTC ICE policy, and DNS to all match it. That one flag fixes the most common coherence failure in seconds. **Playwright / Puppeteer** need you to do this by hand — set locale, timezone_id, and Accept-Language yourself, and either disable WebRTC explicitly or route it through the proxy. **HTTP scraping** (curl_cffi, tls-client) has no WebRTC at all, so this vector never fires — part of why HTTP scraping beats browser scraping on many targets. **Self-test:** browserleaks.com/webrtc shows you exactly what WebRTC exposes from your setup. Run your browser context against it before you deploy.
### Example
```python
from camoufox.sync_api import Camoufox
# geoip=True aligns IP, WebRTC, DNS, timezone, and Accept-Language
# with the proxy exit country — fixes the 5-vector coherence test.
with Camoufox(
headless=True,
geoip=True,
proxy={
"server": "http://us-residential:port",
"username": "user",
"password": "pass",
},
) as browser:
page = browser.new_page()
page.goto("https://browserleaks.com/webrtc")
print(page.content()) # confirm no leak to your real IP
```
### FAQ
**Q: Does a VPN protect against WebRTC leaks?**
It depends on the VPN — some force WebRTC traffic through the tunnel, but many do not. For scraping, do not rely on VPNs at all. Configure WebRTC at the browser layer instead (Camoufox geoip=True, or explicit Playwright settings) so the behaviour is predictable every time.
**Q: Can I just disable WebRTC entirely?**
You can, but turning it off is itself a tell. Real browsers ship with WebRTC enabled, so a browser with no WebRTC stands out and creates a new anomaly. It is better to make WebRTC agree with your proxy than to switch it off.
**Q: Why does the proxy not route WebRTC?**
HTTP proxies only carry HTTP and HTTPS traffic. WebRTC opens UDP connections (a faster, connectionless network protocol) straight to STUN servers from your real network interface, skipping the HTTP layer completely. SOCKS5 proxies can carry UDP, but most consumer scraping proxies are HTTPS-only.
**Q: How do I test my own setup?**
Open browserleaks.com/webrtc using the same browser context your scraper mimics. The page lists every ICE candidate the browser would expose. If you spot your real public IP, or a local IP from a different country than your proxy's exit, you have a leak.
---
## What Is a DOM Honeypot?
URL: https://scrappey.com/qa/anti-bot/what-is-a-honeypot-dom-trap
**A DOM honeypot is an invisible form field or link that humans never see but bots fill in or click.** The DOM (Document Object Model) is the live structure of the page that the browser builds from the HTML. A honeypot lives in that structure but is hidden from view — so a real person never touches it, while a bot that blindly fills every field or follows every link walks right into it. The moment you interact with it, the site knows you are not human and flags your IP. Honeypots are the cheapest, most reliable bot-detection technique in 2026 — they do not care about your TLS fingerprint (TLS is the encryption layer behind https), your proxy quality, or your browser stealth stack. They catch you because you interacted with something a human visually could not.
### Quick facts
- **Cost to deploy:** Near-zero — a few hidden DOM elements
- **What flags you:** Filling a hidden input, clicking a hidden link, following a hidden href
- **Common patterns:** display:none, visibility:hidden, opacity:0, left:-9999px, tabindex=-1
- **Defeats:** Every "scrape every input" or "click every link" bot regardless of fingerprint
- **Mitigation:** Check element.getBoundingClientRect() and computed style before interacting
### Common honeypot patterns
All of these tricks share one idea: the element exists in the HTML but is hidden from the eye. Here are the classic ones:
<input name="email" type="text" style="display:none">
<input name="email" type="text" style="visibility:hidden">
<input name="email" type="text" style="opacity:0">
<input name="email" type="text" style="width:0; height:0">
<input name="email" type="text" tabindex="-1">
<input name="phone" style="position:absolute; left:-9999px">
<a href="/admin/honeypot" style="display:none">Admin</a>
<!-- after </body> — never rendered, only seen by parsers -->
</body>
<a href="/honeypot-trap">Trap</a>Each line hides the element a different way: display:none removes it from the layout, opacity:0 makes it fully transparent, left:-9999px shoves it far off the left edge of the screen, and tabindex="-1" stops the keyboard from ever reaching it. A bot that fills every input or follows every href triggers the trap. A human never sees these elements at all because the browser does not render them.
### Mitigation in practice
The fix is simple: before you touch any element, ask whether a human could actually see it. This helper does exactly that — it checks the on-screen box and the computed style (the final styling the browser applied), and bails out on anything zero-sized, hidden, or transparent:
def is_visible(element):
box = element.bounding_box()
if not box or box["width"] == 0 or box["height"] == 0:
return False
style = element.evaluate("el => getComputedStyle(el)")
if style["display"] == "none": return False
if style["visibility"] == "hidden": return False
if float(style["opacity"]) == 0: return False
return True
for link in page.query_selector_all("a"):
if is_visible(link):
# safe to click
...If you scrape by parsing raw HTML instead of running a real browser (Scrapy, BeautifulSoup), you have no browser to tell you what is visible, so you must apply the same rules by hand: respect inline style="display:none" and the hidden attribute, filter elements positioned off-screen (left: -9999px), and skip anything that appears outside or after the <body>.
### Where this is deployed
Honeypots show up most on the pages where automation is expected to attack: login forms, account-registration flows, contact-us forms, and comment sections — anywhere bots try to brute-force credentials or scrape in bulk. Every major anti-bot vendor ships them as a free first line of defence layered on top of fingerprint-based scoring. Because they cost nothing to add, plenty of independent sites that roll their own anti-scraping use them too — which is why they are the failure mode that catches the most "I have a perfect TLS fingerprint, why am I still blocked?" scrapers.
### Example
```python
# Always check visibility before interacting — Playwright/Puppeteer
def is_visible_strict(element):
box = element.bounding_box()
if not box or box["width"] == 0 or box["height"] == 0:
return False
style = element.evaluate("el => getComputedStyle(el)")
return all([
style["display"] != "none",
style["visibility"] != "hidden",
float(style["opacity"]) > 0,
])
# Safe link-following: never touch what you cannot see
for link in page.query_selector_all("a[href]"):
if is_visible_strict(link):
url = link.get_attribute("href")
# crawl it
```
### FAQ
**Q: Why are honeypots so effective when fingerprinting exists?**
Because they catch a different kind of mistake. Fingerprinting judges what your client is (its browser, network, and crypto traits); honeypots judge what your client does. A bot with a flawless fingerprint that clicks a hidden link is still a bot. The two methods cover each other's blind spots, so anti-bots run both.
**Q: Are honeypots legal?**
Yes — they are server-side defensive measures the site puts on its own pages, like installing motion sensors on your own property. Triggering one means you interacted with content the site never meant you to see, which is on you, not them.
**Q: Do honeypots work against AI agents?**
Less well than against naive scrapers. LLM-driven agents (Browser Use, Skyvern, Anthropic Computer Use) read the page much like a person would and tend to ignore non-visible elements. The remaining risk is the LLM reaching for something that looked clickable but was actually off-screen — still possible, just rarer.
**Q: Can I detect a honeypot without interacting?**
Yes — check the element's computed style and on-screen position (its bounding rect) before you do anything with it. If a human could not see it (display:none, off-screen, zero-size, opacity:0, tabindex=-1), treat it as a trap and skip it.
---
## What Is Scraper Data Poisoning?
URL: https://scrappey.com/qa/anti-bot/what-is-data-poisoning
**Data poisoning is when a site decides you are probably a scraper and quietly feeds you wrong data instead of blocking you: fake prices, made-up reviews, incorrect stock counts, slightly altered product descriptions.** The catch is that nothing looks broken. Your scraper still gets an HTTP 200 (the "success" response code), your pipeline saves the data, and you only discover the problem when your competitor-monitoring dashboard tells your CEO something that is not true. This is more damaging than being blocked because the pain moves from the site to you: blocking just gives the site support tickets from real users it caught by mistake, but poisoning sends you straight into bad business decisions.
### Quick facts
- **Visible signal:** None — requests return 200, data looks plausible
- **Common targets:** E-commerce prices, ticketing availability, real estate listings, airline fares
- **Detection technique:** Cross-IP diff — compare same URL from 2+ independent proxy networks
- **Why sites prefer it:** Wastes scraper time and budget without false-positives on real users
- **Mitigation:** Reduce bot-likelihood score, sample-validate against a managed API
### Why this is more dangerous than blocking
When a site blocks you, you know it instantly. A 403 error or a CAPTCHA is a clear signal that something needs fixing. When a site poisons you, there is no signal at all: you keep scraping happily for weeks, your reports slowly drift away from reality, and your customers often notice the bad numbers before you do. Major retailers, airlines, and ticketing platforms all do this. The reason is simple economics. Blocking costs the site money, because real users sometimes get caught in the net of false positives and complain. Poisoning costs the site nothing, because real users never see the fake data — only suspected bots do.
Poisoning lives in the gray area of bot detection. A scraper the site is sure about gets blocked, a visitor it is sure is human gets clean data, and the uncertain middle gets poisoned. So lowering your bot-likelihood score — how suspicious you look — is often enough to move you back into the "clean data" group with no other changes.
### How to detect poisoning
The only dependable way to catch poisoning is to compare results across different IP addresses. Scrape the same URL from two or more residential IPs that live in different proxy networks, then diff (compare) the structured fields you care about — price, stock, ratings. If a field that should be stable comes back different, suspect poisoning. To break ties, add a third source, such as a real browser on your office connection or a managed scraping API like Scrappey, Bright Data, or Zyte, and take the majority answer.
This is costly, so most production scrapers run it as a periodic spot-check rather than on every request. Checking a random 5% of your URLs once a day is a cheap way to catch drift early.
### Mitigation
- **Reduce your bot-likelihood score so you don't get poisoned in the first place.** Sites that poison typically poison borderline scores and outright block confident bots. A cleaner fingerprint (residential IP, humanized browser behaviour, warm-up navigation) often promotes you back into the "real user" bucket.
- **Cross-check critical fields.** If your business depends on price accuracy, validate sampled URLs against a managed scraping API periodically — managed providers' aggregate scale makes individual poisoning less likely.
- **Watch for statistical drift.** If your scraped price for a stable SKU shifts by exactly 3.7% overnight with no real change in the market, suspect poisoning before suspecting a bug. Poisoning is often a consistent percentage offset, not random noise — a real bug or a real price change rarely lands on the same clean number every time.
- **Pydantic + Instructor schemas catch structural anomalies.** A schema is a strict description of what valid data should look like. If poisoned data has subtle structural differences — say a price returned as text ("19.99") instead of a number (19.99) — schema validation will flag it.
### Example
```python
# Periodic poisoning audit: scrape same URL from 2 networks, diff
import requests
URL = "https://target.com/product/sku-123"
def fetch(proxy):
r = requests.get(URL,
proxies={"https": proxy},
headers={"User-Agent": "Mozilla/5.0 ..."}, timeout=20)
return parse_product(r.text) # returns {"price": ..., "stock": ...}
a = fetch("http://residential-a:port")
b = fetch("http://residential-b:port")
if a["price"] != b["price"] or a["stock"] != b["stock"]:
alert(f"POSSIBLE POISONING on {URL}: {a} vs {b}")
```
### FAQ
**Q: How common is data poisoning?**
Among Fortune 500 e-commerce, ticketing, and travel sites it is widely deployed. Among smaller sites and most public-data targets it is far less common, because poisoning takes more sophisticated infrastructure than simply blocking. The rule of thumb: assume it on high-value commercial targets, and do not assume it elsewhere.
**Q: Can I tell if I am being poisoned right now?**
Not from a single request — fake data looks exactly like real data on its own. The only reliable test is to fetch the same URL through two or more independent proxy networks and look for mismatches in fields that should be stable. If your budget allows, run a daily audit on a random 5% sample.
**Q: Does using a residential proxy prevent poisoning?**
It reduces it. Datacenter IPs are poisoned most aggressively, residential IPs less so, and mobile IPs rarely. But the IP is only part of the picture — your fingerprint and behaviour matter too. A perfectly clean IP paired with a Python requests TLS fingerprint (the signature of the encryption handshake, which screams "script not browser") and robotic, evenly timed requests will still get poisoned by sophisticated targets.
**Q: What if I detect poisoning — how do I fix it?**
First, lower your bot-likelihood score: switch to residential or mobile IPs, use curl_cffi or Camoufox to fix your TLS and browser fingerprint, and add humanization to your timing and navigation. If the poisoning persists, route the affected URLs through a managed scraping API (Scrappey, Bright Data) for the validation passes — their aggregate scale makes it much harder for a site to single you out for poisoning.
---
## What Is Anubis (Anti-AI-Scraper Firewall)?
URL: https://scrappey.com/qa/anti-bot/what-is-anubis-firewall
**Anubis is a free, open-source MIT-licensed "gatekeeper" that sits in front of a website (a reverse proxy - software that intercepts requests before they reach the real server) and forces each visitor's browser to solve a small math puzzle before any page is served. The puzzle is a SHA-256 proof-of-work challenge - a calculation that is hard to compute but easy to verify - designed to slow down AI scrapers that ignore robots.txt (the file where sites politely ask bots not to crawl them).** Released on January 19, 2025 by Xe Iaso (now maintained by Techaro), it has been adopted by GNOME GitLab, the Linux kernel mailing list archives, FFmpeg, Wine, UNESCO, FreeCAD, Duke University digital archives, and most non-Cloudflare FOSS projects. You can recognise it by its anime "weighing the soul" mascot illustration shown while the challenge runs.
### Quick facts
- **Released:** 19 January 2025 by Xe Iaso, now Techaro
- **License:** MIT
- **GitHub stars:** 19.6k+ (May 2026)
- **Algorithm:** Hashcash-style SHA-256 PoW (default: 5 leading zeros)
- **Notable deployments:** GNOME GitLab, Linux kernel archives, FFmpeg, Wine, UNESCO, FreeCAD, sourcehut
### How the challenge works
When a browser asks for a protected page, Anubis does not answer right away. Instead it hands back a puzzle: a random number plus a difficulty setting. The browser then has to keep trying different values (a *nonce* - a throwaway number) until it finds one where SHA-256(challenge || nonce) produces a hash that starts with a set number of zeros - five by default. There is no shortcut; you just try numbers until one works. This is the same Hashcash trick Bitcoin mining uses, shrunk down so a real browser solves it in about a second on a laptop.
Once the browser solves it, the answer is saved as a cookie (techaro.lol-anubis-auth) that lasts roughly a week, after which it must solve a fresh puzzle. The economics are the point: a real person who visits once a week pays a one-second tax and never notices. An AI scraper hitting 10,000 pages a day has to solve thousands of puzzles, and that CPU cost piles up until scraping becomes too expensive to be worth it.
### Why FOSS projects deployed it
Anubis was built after Amazon's AI crawler hammered Xe's Git server while ignoring robots.txt. Within months, projects that had been losing bandwidth (and money) to ChatGPT, Claude, and Perplexity-style crawlers turned it on. The Linux kernel mailing list archive, sourcehut, FFmpeg, Wine, GNOME's GitLab, FreeCAD, and Duke's digital archives all run it. UNESCO digital repositories run it. They share the same problem: small hosting budgets up against industrial-scale crawling that ignores every opt-out signal.
### How effective is it in practice?
Its effectiveness has limits. **Codeberg reported in August 2025** that "many AI scraper bots had learned how to solve the Anubis challenges." Codeberg still found it useful - it had blocked most scraping for several months - but noted the bots had adapted.
Security researcher **Tavis Ormandy** documented that a proof-of-work solver written in fast native code (Go, Rust, C) computes Anubis challenges far quicker than the JavaScript version that runs in a normal visitor's browser, so the per-challenge cost is lower for native solvers than for ordinary browsers.
The practical takeaway: Anubis slows high-volume crawling down, raises the operating cost, and stops the cheapest scrapers entirely. It does not stop an operator willing to build a native solver, and a headless Chromium (a real browser engine running without a visible window, with JavaScript on) completes the challenge the same way any browser does - just slower than a purpose-built binary.
### Example
```python
# Anubis with headless Chromium solves naturally — JS runs, PoW completes.
# Persist the techaro.lol-anubis-auth cookie across requests to avoid re-solving.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
ctx = browser.new_context()
page = ctx.new_page()
# Visit any protected URL once — Anubis serves challenge, JS solves it
page.goto("https://lkml.iu.edu/")
page.wait_for_load_state("networkidle")
# Save the auth cookie and reuse for ~1 week
cookies = ctx.cookies()
anubis_cookie = next(c for c in cookies if "anubis-auth" in c["name"])
print("Solved. Reuse this cookie for ~7 days:", anubis_cookie["value"][:40], "...")
```
### FAQ
**Q: Is Anubis the same as Cloudflare Turnstile?**
Both hand out proof-of-work puzzles, but they are run very differently. Anubis is software you host yourself, open-source under the MIT license. Turnstile is a service Cloudflare runs for you. Anubis is what small FOSS projects without enterprise infrastructure deploy on their own servers; Turnstile is a switch you flip if your site already sits behind Cloudflare. Same puzzle idea, different operating model.
**Q: How long is the Anubis cookie valid?**
About one week by default. After a browser solves a challenge once, Anubis saves the techaro.lol-anubis-auth cookie, and any requests within that week go straight through without solving again. That keeps the cost for a real visitor near zero while still punishing high-volume scrapers, which solve puzzles constantly.
**Q: Does Anubis block search engines?**
By default it can. Crawlers from Google, Bing, and DuckDuckGo make plain HTTP requests without running JavaScript, so they never solve the puzzle and get blocked. To avoid that, Anubis includes a configurable allowlist of "known good" bots, identified by their User-Agent string and confirmed with reverse-DNS lookup (checking that the IP really belongs to the claimed crawler). The site operator decides which crawlers to wave through.
**Q: Will Anubis still be effective in 2027?**
The proof-of-work tax stays real no matter how advanced scrapers get - even with a fast native solver, crawling 10,000 protected pages still burns measurable CPU time. The next frontier is tuning difficulty: Anubis can crank up the required number of leading zeros for visitors it suspects are bots and ease off for likely humans. The arms race continues, but the tool keeps the cost lopsided in the site's favor.
---
## What Is Behavioural Bot Detection?
URL: https://scrappey.com/qa/anti-bot/what-is-behavioural-detection
**Behavioural bot detection is the part of anti-bot scoring that asks "how does this client act?" instead of "what is this client?".** Instead of checking your identity, it watches what you do: how your mouse curves across the screen, how fast you scroll, your typing rhythm, click timing, how long you linger, and the tiny natural jitter that a human hand produces but a script does not. Leading behavioural-detection vendors analyse 35+ such signals per session in real time. This layer is what catches scrapers that pass every TLS (the encryption layer behind https), IP, and fingerprint check — because the giveaway is in *behaviour*, not identity.
### Quick facts
- **Signals analysed:** 35+ behavioural signals per session
- **Primary signals:** Mouse Bezier curves, scroll velocity, typing cadence, click timing, dwell
- **What it catches:** Linear mouse interpolation, constant sleeps, no-warm-up navigation
- **Related tooling:** Botasaurus Humancursor, Camoufox humanize=True, warm-up navigation
- **Practical note:** Scoring is probabilistic and considers the whole session, not single actions
### What gets measured
**Mouse movement.** A human hand moves in smooth, slightly wobbly arcs (Bezier curves with random velocity). It slows down as it nears the target — this is Fitts's Law, the rule that the closer you get to something small, the more carefully you aim — usually overshoots a little, then corrects. A scraper that jumps straight to a point with page.mouse.move(x, y) draws a perfectly straight line, which is statistically impossible for a real hand.
**Timing patterns.** How long between the page loading and your first action? How does your scrolling speed up and slow down? How evenly spaced are your keystrokes? How long do you stay on a page? Machine-learning models (software trained on examples to spot patterns) trained on millions of sessions detect this at sub-millisecond precision — now even finer thanks to WASM shared-buffer timers.
**Session shape.** Do you load the images and fonts a browser normally would? Do you visit the homepage first, or jump straight to a deep URL? Real users hesitate and load CSS and tracking pixels; bots and plain HTTP scrapers usually do not.
**Biometric micro-signals.** The faint tremor in a human's mouse path. Click pressure on touch devices. The rhythm of switching between mouse and keyboard. These are increasingly part of premium behavioural models.
### Why it catches "perfect" scrapers
A scraper can have a Chrome 148 JA4 fingerprint, a home-broadband (residential) IP, a genuine canvas hash, a perfectly matched timezone — and still fail behavioural scoring. The four identity layers all say "this is a real Chrome user." The behaviour layer replies: "this real Chrome user moves the mouse like nobody who has ever touched a computer."
That gap is what makes behaviour so hard to fake. Identity signals can be configured ahead of time (Camoufox C++ patches) or at request time (curl_cffi TLS). Behaviour is different: it cannot be configured statically because it has to be generated as the session runs, and accurately modelling how humans move and type is much harder than it looks. Tooling that approximated human input has historically been re-characterised within months by ML models retrained on the newer patterns.
### Inputs that behavioural models weigh
When working with sites you own or are authorized to automate, the inputs behavioural models weigh most heavily are well documented:
- **Input generation.** Tools such as Botasaurus with Humancursor (Bezier curves with random jitter and Fitts's Law deceleration), or Camoufox's humanize=True, generate pointer and scroll input that falls inside the statistical range of ordinary human input rather than the perfectly straight lines a naive script produces.
- **Navigation path.** Models score the shape of a whole session — landing on a homepage, dwelling, scrolling, and following internal links produces a different signal than jumping straight to a deep URL. This is one reason behavioural scoring looks across multiple requests.
- **Timing distribution.** A constant pause such as time.sleep(2) is itself a machine-like signal; varied timing like random.uniform(1.8, 4.3) sits closer to natural session timings.
The key takeaway: behavioural detection is probabilistic (a likelihood score, not a yes/no) and evaluates the session as a whole. At very high request rates from one IP, the session-level pattern stops resembling a single human regardless of input quality. Distributing authorized, device-like traffic across many home or mobile IPs — what a residential proxy pool provides — keeps each IP's rate within what a single real person could plausibly produce.
### Example
```python
# Pair humanized mouse with warm-up navigation and randomized delays
from botasaurus.browser import browser
import random, time
@browser(proxy="http://user:pass@residential:port", humanize=True)
def scrape(driver, target_url):
# Warm-up: visit homepage, dwell, scroll, click an internal link
driver.get("https://target.com/")
time.sleep(random.uniform(1.8, 4.3))
driver.scroll_human(amount=500)
driver.click_human("a[href*='/category']")
time.sleep(random.uniform(2.1, 5.7))
# Now navigate to the real target
driver.get(target_url)
time.sleep(random.uniform(1.5, 3.2))
return driver.page_html
result = scrape("https://target.com/product/123")
```
### FAQ
**Q: How do anti-bot systems detect machine-like mouse movement?**
It records your mouse path at sub-millisecond resolution and compares it to patterns it has learned from real humans. Human paths follow smooth Bezier curves with naturally varying speed and a slight overshoot at the target. A scraper that draws a straight line between two coordinates produces a path no human would make, and a behavioural machine-learning model flags it within milliseconds.
**Q: Will random sleep solve behavioural detection?**
Partly — but only for timing, not movement. Using random.uniform(1.8, 4.3) for the pause between actions is better than a fixed time.sleep(2), but it does nothing about straight-line mouse paths or robotic scrolling. Behavioural scoring looks at many dimensions at once, so you have to humanize movement, scroll, and timing together.
**Q: Why does warm-up navigation help?**
Behavioural models score the shape of the whole session, not just single actions. Visiting the homepage, pausing 2–3 seconds, scrolling, clicking a category link, and only then reaching the data page matches how a real shopper browses. Landing straight on a deep product URL with no prior browsing does not — and that absence becomes a red flag, no matter how perfect each individual action looks.
**Q: How does request volume affect behavioural scoring?**
At sustained high request rates from a single IP, the session-level pattern stops resembling one human regardless of input quality. For authorized high-volume work the architectural answer is spreading load across a large residential or mobile proxy pool so each IP's request rate stays within what a single real person could plausibly produce. Input quality alone does not change the picture at scale.
---
## What Is a Session Cookie?
URL: https://scrappey.com/qa/anti-bot/what-is-a-session-cookie
**A session cookie is an HTTP cookie with no Max-Age or Expires attribute, so the browser keeps it only in memory and throws it away when the browsing session ends.** A cookie is just a small piece of data a server hands to the browser to remember it between requests; a session cookie is the most basic kind, defined in RFC 6265. The server sets it with a Set-Cookie header, and the browser sends it back on every later request in the same session. It is also called an *in-memory cookie*, *transient cookie*, or *non-persistent cookie*. Login state, shopping carts, and most anti-bot trust tokens (_abck, cf_clearance, datadome, reese84) are session cookies.
### Quick facts
- **Defining attribute:** Absence of Max-Age and Expires in Set-Cookie
- **Storage:** Browser RAM — not written to disk in most browsers
- **Lifetime:** Until the browsing session ends (browser-defined)
- **Standard:** RFC 6265 (HTTP State Management Mechanism, 2011)
- **Also known as:** In-memory cookie, transient cookie, non-persistent cookie
### How session cookies work
A server creates a session cookie by sending a Set-Cookie response header that has no Max-Age and no Expires attribute — those are the two things that would give it a lifespan:
HTTP/1.1 200 OK
Set-Cookie: sessionId=abc123; Path=/; HttpOnly; Secure; SameSite=LaxThe browser holds this cookie in memory (not on disk) and automatically attaches it to the Cookie header of every later request to the same origin during the same session:
GET /account HTTP/1.1
Cookie: sessionId=abc123Close the browser and the cookie disappears. On the next visit the server sees no sessionId, so it treats you as a brand-new visitor — there is nothing left to recognise you by.
By default, per RFC 6265, the cookie goes back only to the exact origin server that set it (not its subdomains) unless the Domain attribute widens that scope. The Path attribute can do the opposite and limit which URLs receive the cookie.
### Session cookie vs persistent cookie
The only thing that separates the two is how long they live, and that is decided by two attributes on Set-Cookie:
Cookie typeHas Expires or Max-Age?Stored whereSurvives browser restart?
**Session**NoRAMNo (usually)
**Persistent**YesDiskYes, until expiration
A persistent cookie lives on disk through browser restarts and is removed only when its Max-Age runs out, its Expires date passes, or the user clears it by hand. The "remember me" checkbox on a login form simply turns a session cookie into a persistent one by attaching an expiry date weeks or months in the future.
**Caveat:** modern browsers smudge this line with *session restoration*. Chrome, Firefox, and Safari can keep session cookies across restarts when the user has "continue where you left off" turned on. To the server it is still a session cookie; to the user it may quietly survive a reboot. RFC 6265 deliberately leaves what counts as "session ends" up to the browser.
### The five attributes that change behaviour
Even with no Expires/Max-Age, a Set-Cookie header can carry several other attributes that meaningfully change how the session cookie behaves:
- **HttpOnly** — JavaScript cannot read the cookie through document.cookie. This matters because it blocks XSS attacks (malicious scripts injected into a page) from stealing session IDs. A session cookie missing this flag is a security bug.
- **Secure** — the cookie is sent only over HTTPS (encrypted) connections, never plain HTTP. Every modern session cookie should have it.
- **SameSite=Lax | Strict | None** — controls whether the cookie rides along on requests coming from other sites. Lax (today's default) sends it when you click through to the site but not on cross-site sub-requests like <img> tags or XHR (background JavaScript fetches). Strict blocks every cross-site request. None allows full cross-site sending but requires Secure — used by embedded widgets and federated login. Most CSRF protection (defence against forged cross-site requests) comes from SameSite=Lax.
- **Domain** — widens the cookie from origin-only to a whole parent domain (e.g. Domain=example.com makes it visible to api.example.com, www.example.com, and so on). Without it, RFC 6265 keeps the cookie tied to the exact origin that set it.
- **Path** — restricts the cookie to certain URL paths (e.g. Path=/account sends it on /account/* but not /blog/*). Used less often.
A modern, well-configured session cookie looks like:
Set-Cookie: sessionId=abc123; Path=/; HttpOnly; Secure; SameSite=Lax
### Session cookies in anti-bot systems
Nearly every major anti-bot vendor stores its session-level trust in a session cookie. The cookie is the visible sign that a client has passed — or is currently being scored against — the vendor's detection model. Examples:
- **_abck** — Akamai Bot Manager. Its value starts at ~-1~ (untrusted) and flips to ~0~ only after sensor.js POSTs valid signals. Trust builds up over the session.
- **cf_clearance** — Cloudflare. Shows the client solved a JavaScript challenge or a Turnstile interaction.
- **datadome** — DataDome. Scored per request, but the cookie tags sessions already seen and lightly trusted.
- **reese84** — F5 Shape Security. A validated session token from its custom JS VM; expires in minutes.
- **_px3 / _pxde** — PerimeterX / HUMAN Security. Carries the 5-vector fingerprint score.
- **x-kpsdk-ct** — Kasada. A single-use proof-of-work token, not reusable across requests.
By RFC 6265's definition these are all session cookies — no Expires or Max-Age, lifetime tied to the browser session. That has a real consequence for scrapers: swapping proxies in the middle of a session throws away the trust those cookies built up, which is why ISP static residential IPs are preferred over rotating residential for Akamai targets.
The takeaway: when an anti-bot vendor talks about "session-level scoring", they almost always mean a session cookie they control. Knowing which cookie they set, what state values to expect, and when the state flips is half the work of handling the vendor's verification flow.
### Security and privacy considerations
- **Session fixation.** If an attacker can plant a session cookie value (via a subdomain takeover, a network MITM — man-in-the-middle interception — or a too-permissive Domain attribute), they can hijack the victim's session after the victim logs in. Defence: regenerate the session ID at every privilege boundary (login, role change) and always set HttpOnly + Secure + SameSite=Lax.
- **Session hijacking via XSS.** Without HttpOnly, injected JavaScript can read document.cookie and ship the session ID off to an attacker-controlled URL. The HttpOnly flag is the single most important protection for session cookies; it should be on by default for every one of them.
- **GDPR and consent.** Strictly necessary session cookies (login, shopping cart, CSRF tokens) need no consent under GDPR. Analytics and tracking cookies do — even when they are technically session cookies in the RFC sense. The legal line is drawn by purpose, not by lifetime.
- **Third-party session cookies.** Chrome's 2024 phase-out of third-party cookies killed the cross-site session cookie use case (federated login, ad attribution). First-party session cookies are untouched and remain core to web authentication.
- **Session cookie length.** Common in 2026: 128–256 bits of entropy (entropy here meaning how hard the value is to guess), base64 or URL-safe encoded. Shorter is brute-forceable; longer is wasteful. Always generated server-side from a CSPRNG (a cryptographically secure random generator), never from time or user data.
### Example
```python
# How session cookies behave in scraping — preserve them across requests
from curl_cffi import requests
# 1. Use a Session — cookies persist across .get() / .post() calls
s = requests.Session(impersonate="chrome131")
# 2. Warm-up request: server sets the session cookie
r1 = s.get(
"https://target.example.com/",
proxies={"https": "http://user:pass@isp-residential:port"},
)
# Inspect what was set — anti-bot vendors usually mark their cookie here
for name in s.cookies.keys():
if name in ("_abck", "cf_clearance", "datadome", "reese84", "_px3"):
print(f"Anti-bot session cookie set: {name}")
# 3. Subsequent request — Session re-sends the cookie automatically
r2 = s.get(
"https://target.example.com/api/protected",
proxies={"https": "http://user:pass@isp-residential:port"},
)
# Same Session, same IP, same cookies — trust accumulates on multi-request
# scored vendors like Akamai.
# 4. To clear and start a new session: instantiate a new Session().
# Closing the Python process is the HTTP equivalent of closing the browser.
```
### FAQ
**Q: How long does a session cookie last?**
Until the browsing session ends, and the browser decides what that means. In practice it usually lasts until you close the browser — but browsers with "session restoration" turned on can keep session cookies across restarts. RFC 6265 intentionally leaves the meaning of "session ends" to the browser, so the answer depends on the browser, its configuration, and whether the user actually closed it or just shut the laptop lid.
**Q: What is the difference between a session cookie and a persistent cookie?**
A session cookie has no Max-Age or Expires attribute, so the browser keeps it only in memory and deletes it when the session ends. A persistent cookie has one of those attributes, so the browser writes it to disk and holds it until the set expiration date. Lifetime is the only difference — on the network the format, transmission, and behaviour are identical.
**Q: Are session cookies safe?**
A session cookie with the HttpOnly, Secure, and SameSite=Lax attributes is the safe modern default. HttpOnly blocks JavaScript from reading it (stopping XSS theft). Secure keeps it on HTTPS only. SameSite=Lax blocks most CSRF (forged cross-site request) attacks. Missing any of the three, a session cookie can be stolen via XSS, leaked over plain HTTP, or abused in cross-site requests.
**Q: Do I need cookie consent for session cookies under GDPR?**
Only if they are not strictly necessary. Login cookies, shopping cart cookies, and CSRF tokens count as strictly necessary and are exempt from consent under GDPR Article 5(3) of the ePrivacy Directive. Session cookies used for analytics, tracking, or advertising still need consent — the legal line is the purpose, not the technical lifetime.
**Q: Why do anti-bot vendors use session cookies?**
Because the trust score they give a client grows over many requests. Keeping that score in a session cookie lets it travel with the client around the site without re-running the full fingerprint check on every request. _abck, cf_clearance, datadome, reese84, and _px3 are all session cookies precisely so they can carry session-bounded trust without sticking around beyond it.
**Q: Can I read a session cookie from JavaScript?**
Only if it lacks the HttpOnly flag. Cookies with HttpOnly are invisible to document.cookie and any client-side script — they live only at the HTTP protocol layer. That is by design, to protect session IDs from being stolen by XSS. If you control the server and need a session ID, set HttpOnly and read it server-side; never trust client-side JavaScript with authentication tokens.
**Q: What happens to my session cookie if I clear browser cookies?**
It is deleted right away, like any other cookie. Your next request to that site arrives with no session cookie, so the server treats you as a fresh visitor — you are logged out, your cart is empty, and any session-scoped state is gone. It is the user-side equivalent of restarting the server-side session, with the same result.
---
## How Do Websites Detect Web Scrapers?
URL: https://scrappey.com/qa/anti-bot/how-websites-detect-scrapers
**Websites spot scrapers by gathering hundreds of small clues about each visitor, then scoring how human the whole picture looks.** No single clue gets you blocked — anti-bot systems add up many signals (an "ensemble" score) and decide based on the total. The clues come from four layers, and these categories stay the same even as the exact checks change: IP reputation, TLS fingerprint (TLS is the encryption behind https), HTTP/2 frame ordering, header consistency, JavaScript probes that run in the browser, canvas/WebGL/audio fingerprints (tiny rendering differences unique to your setup), and mouse/timing behavior.
### Quick facts
- **Network layer:** IP reputation, ASN, geolocation, connection reuse
- **Transport layer:** TLS JA3/JA4, HTTP/2 frame ordering, ALPN
- **Browser layer:** Canvas, WebGL, audio context, font enumeration, navigator probes
- **Behavioral layer:** Mouse movement, scroll velocity, dwell time, click timing
- **Decision model:** Ensemble score across all signals — no single tell
### Network signals (the first filter)
Before any JavaScript runs, the site already knows a lot from your IP address alone: its ASN (the network it belongs to — e.g. Amazon vs. a home ISP), its past reputation, and whether its location makes sense. Datacenter IPs (AWS, GCP, DigitalOcean) get almost no trust by default, because real users rarely browse from a server farm. Residential and mobile IPs start out neutral. IPs caught misbehaving before get blacklisted right at the edge. This one filter handles about 70% of low-effort scraping traffic before any fingerprinting is even needed.
### Transport signals (TLS and HTTP/2)
Every https connection starts with a TLS handshake — the step where client and server agree on encryption. That handshake exposes a JA3/JA4 fingerprint: the list of cipher suites, extensions, and elliptic curves your client offers, in the exact order it offers them. Python's requests library has a JA3 that instantly says "not a browser." HTTP/2 adds more tells, like the order of frame priorities and headers. Real Chrome sends headers in a particular order; curl sends them differently. Anti-bot vendors keep catalogs of known automation-tool fingerprints and block anything that matches.
### Browser signals (JS-collected)
If you make it past the network and transport filters, the page runs JavaScript that quietly inspects your browser. It checks things like the canvas rendering hash (the exact pixels your machine draws), the WebGL renderer string (your graphics hardware), an audio fingerprint, installed fonts, screen size, timezone, languages, the navigator.webdriver flag, and dozens more. Faking any one of these is easy; the hard part is making them all agree with each other. A spoofed canvas paired with a real WebGL value is actually a stronger bot signal than either one alone, because the mismatch gives you away.
### Behavioral signals (the last layer)
Once the page loads, the site watches how you act: mouse movement, scrolling, how long you wait before clicking, and how fast you fill in forms. Real people move the mouse in jittery, curved paths, scroll in bursts, and pause at random. Scrapers either skip all of this (no mouse event ever fires) or fake it in patterns that machine-learning models recognize with high confidence. This is the layer that catches headless browsers — automated browsers with no visible window — that pass every static fingerprint check.
### A worked example — what a single request reveals
Take one GET request to an Akamai-protected site from a plain Python requests script. Here is what each layer sees:
LayerWhat's observedVerdict
NetworkJA4 hash matches Python urllib3, not ChromeBot
TransportNo HTTP/2 — connection negotiates HTTP/1.1Bot
HeadersAccept-Encoding: gzip, no Accept-Language, User-Agent claims ChromeIncoherent — bot
IPAWS us-east-1 datacenter ASNBot
JavaScriptNo script execution — sensor.js never ranBot or non-browser
Every layer independently flags this as a bot. Akamai returns a 412 status with the *Pardon Our Interruption* page, the _abck cookie stays stuck at ~-1~ (its "not verified" state), and any protected XHR endpoints refuse to work because of that cookie. The bot was already caught at the TLS handshake — every layer after that just confirmed it.
Now run the same request with curl_cffi + Chrome impersonation + an ISP residential proxy: the JA4 matches a real Chrome, HTTP/2 works, the headers line up, and the IP looks residential. The same endpoint now returns 200. Nothing changed except the network-layer fingerprint.
### How this is shifting in 2026
Three trends are reshaping how detection works:
- **JA4 has fully replaced JA3** across major vendors. Matching only an old JA3 profile now produces a "wrong-shape Chrome" signal, because vendors check both. curl_cffi, utls, and tls-client all support JA4 — there is no reason to be stuck on JA3 in 2026.
- **WASM challenges are now standard at the enterprise tier.** WASM is compiled code that runs in the browser, harder to inspect or fake than plain JavaScript. DataDome's boring_challenge shipped in 2023; Akamai and PerimeterX added WASM probes through 2024. These can no longer be addressed at the JavaScript layer (see the WASM fingerprinting entry); handling them has moved down into the browser engine itself (Camoufox, CloakBrowser).
- **Behavioural signals are tracked per-session, not per-request.** Vendors now collect clicks, scrolls, and timing across a whole session and score the overall pattern. A single request with a flawless fingerprint can still get flagged by your behavior on request 50. The fix is realistic pacing and warm-up over time, not a perfect one-off fingerprint.
What hasn't changed: the cost ranking of fixes. Network-layer fixes are still the cheapest, behavioural fixes still the most expensive. Move up the layers only as the one below stops working.
### FAQ
**Q: Can I scrape without triggering detection?**
Not with one trick. You can lower the chance of detection a lot by combining good residential IPs, browser-matching TLS, realistic fingerprints, and unhurried request pacing. Perfect invisibility isn't a realistic goal — the aim is to look enough like a real user that blocking you costs the site more than letting you through.
**Q: Which signal is most important to fix first?**
The IP. Datacenter IPs lose before any other signal is even collected. A residential or mobile IP is what gives every other signal a chance to matter.
**Q: Why does my scraper work in a browser but fail headless?**
Headless Chrome (a browser with no visible window) leaks a dozen tells: navigator.webdriver is set to true, chrome.runtime is missing, the permissions API behaves oddly, and more. Use a real browser with a consistent, full-stack configuration (Camoufox, PatchRight), or a managed scraping API that handles the whole fingerprint surface for you on sites you are permitted to access.
**Q: Is "block at the TLS handshake" really one signal or four?**
One. The TLS Client Hello — the opening message of an https connection — is hashed into a JA4 fingerprint in microseconds and compared against known browsers. If it doesn't match any browser baseline, the connection is dropped before the HTTP layer reads a single byte. No User-Agent, IP, or header can save you, because none of them ever get read.
**Q: Which signal is the most-checked across vendors?**
navigator.webdriver — a browser property that flags automation. Every automation framework sets it to true by default (Selenium, Playwright, Puppeteer), and nearly every anti-bot script tests for it. It is the cheapest possible check and catches all unmodified scrapers. Overriding the property is trivial, but the more interesting check — using Function.toString() to spot that the property has been tampered with — is what catches the modified ones.
---
## What Is an Anti-Scraping Mechanism?
URL: https://scrappey.com/qa/anti-bot/what-is-an-anti-scraping-mechanism
**An anti-scraping mechanism is any technical control a website uses to detect, slow down, or block automated requests (bots) instead of real people.** Modern sites don't rely on one trick — they stack several: rate limiting (capping how many requests you can send), IP reputation (judging your network address by its history), TLS fingerprinting (TLS is the encryption layer behind https; its handshake leaks clues about your tool), JavaScript challenges, CAPTCHAs, and behavioral analysis. Any single layer is cheap to handle on its own. The point is that the layers compound — and that combined depth is what makes most casual automated traffic uneconomical.
### Quick facts
- **Cheapest layer:** Rate limiting + IP blocklist
- **Middle layers:** TLS fingerprinting, header validation, JS challenges
- **Hardest layer:** Behavioral ML + custom JS VMs (Shape, Kasada, DataDome)
- **Best response:** Match the effort of handling each layer to the value of the data
- **Vendor examples:** Cloudflare, DataDome, Akamai, PerimeterX, Kasada, F5 Shape
### The layered model
Real anti-scraping is not one product but a stack of checks, like a building with security at the gate, the lobby, and every floor. At the edge (the first thing your request hits): WAF rules (a Web Application Firewall, which filters traffic by pattern), rate limits, and ASN blocklists (an ASN identifies the network your IP belongs to, so a whole hosting provider can be blocked at once). One layer in: TLS fingerprint validation, header consistency checks, and HTTP/2 frame analysis — all looking for tells that you are software, not a browser. Inside the page: JavaScript challenges (a small puzzle the browser must solve, such as proof-of-work, plus fingerprint collection) and CAPTCHAs. After the page loads: behavioral analysis on your mouse, scroll, and timing. A request that passes all five layers is treated as human. A request that fails any one is scored down — and repeated failures escalate the next request to a harder challenge.
### How vendors compose
Anti-scraping is usually bought, not built. Cloudflare and Akamai handle the edge layers and JS challenges as a managed product you simply switch on. DataDome and Kasada specialize in the JS-VM and behavioral layers (a JS-VM is a sandbox that runs obfuscated detection code in your browser). Shape Security (F5) builds custom JS virtual machines that re-obfuscate — scramble themselves — on every deployment, so each release looks new. Many sites stack two vendors: Cloudflare at the edge plus DataDome for bot management is a common pairing. Satisfying one layer does not satisfy the other — each vendor scores requests independently.
### Matching response to the stack
For authorized data collection on sites you own or are permitted to access, the first question is not "how do I get through this?" but "is the data even worth the engineering effort?" A simple rate limit costs hours of work to handle correctly. A stacked Cloudflare + DataDome + behavioral ML (machine-learning) system can cost weeks of engineering plus a recurring proxy bill in the thousands per month. Managed scraping APIs spread that cost across all their customers, so above a certain volume they are usually cheaper than building and maintaining the same infrastructure in-house.
### FAQ
**Q: What is the difference between anti-bot and anti-scraping?**
Mostly they mean the same thing and are used interchangeably. "Anti-bot" stresses blocking any automation at all — including credential stuffing (trying stolen passwords), ad fraud, and account abuse. "Anti-scraping" narrows the focus to data extraction. The underlying defenses are the same either way.
**Q: Can a single tool handle every anti-scraping stack?**
No single tool fits every stack. For authorized collection, the cost-effective approach is to match the tool to the target: a managed scraping API for heavily defended sites, and a lightweight HTTP client for simple ones — sized to how each site is actually built.
**Q: Are anti-scraping mechanisms legal?**
Yes — sites are entitled to defend their own infrastructure. Accessing publicly visible data you are permitted to view is generally legal in most jurisdictions. Reaching non-public data, or circumventing explicit access controls (like a login) you have no authorization for, may not be.
---
## What Is Headless Browser Detection?
URL: https://scrappey.com/qa/anti-bot/what-is-headless-browser-detection
**Headless browser detection is the set of probes anti-bot systems use to distinguish a headless or instrumented Chrome session from a real user's browser.** A "headless" browser is a real browser running with no visible window, usually driven by code; "instrumented" means it is being controlled by an automation tool. Plain Puppeteer or Playwright leaks at least a dozen detectable signals out of the box — navigator.webdriver set to true (a flag browsers raise when automation is in control), a missing chrome.runtime object, a predictable plugins list, and blank or always-identical canvas output (the image a page draws to fingerprint your graphics stack). Stealth patches close most of the easy tells; the hard ones survive into 2026.
### Quick facts
- **Easiest tell:** navigator.webdriver === true
- **Common tells:** Missing chrome.runtime, plugin array length, languages mismatch
- **Hard tells:** Canvas/WebGL fingerprint, Function.toString() on patched APIs
- **Stealth tools:** playwright-stealth, puppeteer-extra-plugin-stealth, Camoufox, PatchRight
- **2026 reality:** Stealth patches that leave toString() trails (Kasada catches these)
### The easy tells
These are the giveaways a detector can catch with a single line of JavaScript. By default, Puppeteer and Playwright launch with navigator.webdriver === true (the automation flag is on), window.chrome missing or stripped out, an empty navigator.plugins list, and an HTTP language header that does not match what navigator.languages reports inside the page. Stealth plugins patch all of them — but the patches themselves are detectable (see below).
### The complete signal inventory
Twelve signals modern anti-bot scripts inspect for headless detection, grouped by how cheap they are to spoof (fake convincingly):
SignalWhat's checkedSpoof cost
navigator.webdriver=== true in unmodified Playwright/Puppeteer/SeleniumTrivial JS override (but see toString check below)
User-Agent "HeadlessChrome"Default headless Chrome substringTrivial — one line
navigator.pluginsEmpty array in default headlessTrivial JS override
navigator.languagesLength 1 in default headless vs typical 2-3Trivial
WebGL rendererSwiftShader / llvmpipe = no GPUMedium — engine-level patch needed
AudioContext fingerprint~3 known headless-Linux hashesMedium — virtual audio device or engine patch
Canvas fingerprintStable per-machine; headless on Linux produces a small clusterHard — PerfectCanvas replay required
CDP runtime artifactswindow.cdc_* keys, Runtime.evaluate timingHard — undetected-chromedriver patches, breaks on update
Function.toString() inspectionEvery JS-patched method returns its source, not [native code]Very hard — needs engine-level patch
Permissions API quirksNotification.permission === 'default' in headless on a denied-notifications profileMedium
Mouse + scroll absenceZero mouse events before clickMedium — synthesize Bezier-curve movement
requestAnimationFrame cadenceHeadless renders at fixed 60Hz with no vsync jitterHard — engine-level
Here is the key idea. A stealth plugin patches 5–10 of these from inside JavaScript — but the Function.toString() check listed above defeats every JS-layer patch at once, because in JavaScript you can ask any function to print its own source code. A real browser API prints [native code]; a patched one prints the replacement, exposing the patch. **Patching below JavaScript, inside the browser's own C++ engine (Camoufox, CloakBrowser, PatchRight) is the only durable answer in 2026.**
### Toolchain status in 2026
Where each stealth toolchain stands against the 12-signal inventory above:
ToolApproachDefeats
Vanilla Playwright/PuppeteerNoneNothing — block-grade on first request
puppeteer-extra-stealthJS-layer patches (~17)Easy tells; loses to toString inspection
undetected-chromedriverBinary + JS patchesEasy + medium tells; loses to toString and Canvas
SeleniumBase UC modeWraps UC; adds Turnstile auto-clickSame as UC, friendlier API
PatchRightPatches Playwright Python source — patches never exist as JSEasy + medium + toString. Loses to deep Canvas/WebGL only at enterprise tier.
CamoufoxFirefox fork with C++ engine patches + real-machine profile DBAll 12 signals. Hyphenation-dictionary check can still expose it as Firefox.
CloakBrowserChromium fork with 49 C++ patchesAll 12 signals. reCAPTCHA v3 ~0.9 score.
The dividing line is simple: are the patches above or below the JavaScript engine? Above (Playwright, Puppeteer-extra, UC, SeleniumBase) means Function.toString() can read the patch and detect it. Below (PatchRight, Camoufox, CloakBrowser) means the patch is baked into the browser binary and is invisible to JavaScript inspection. Production scraping in 2026 picks tools from the bottom of this list.
### The medium tells
These take a little more work to catch than a one-line flag check. Headless Chrome ships with a slightly different default set of fonts than desktop Chrome. The HeadlessChrome string appears in the user agent unless you override it. Asking the graphics API for its hardware name, WebGLRenderingContext.getParameter(UNMASKED_RENDERER), returns "Google SwiftShader" (a software renderer, i.e. no real GPU) on headless instead of a genuine GPU name. The permissions API returns "denied" for notifications without ever prompting the user. Each of these is patched in modern stealth tools.
### The hard tells (2026)
The current frontier is meta-detection — catching the act of patching itself rather than any one fingerprint. Anti-bot systems call Function.prototype.toString() on patched native APIs to see whether they return function () { [native code] } (a genuine browser function) or the stealth tool's replacement code (a giveaway). playwright-stealth fails this check; Kasada catalogs the patch signatures and blocks on a match. The 2026 answer is to patch the browser source itself (Camoufox, PatchRight) so there is nothing in the JavaScript runtime for toString() to inspect.
### Example
```javascript
// Detect headless from the page side
const tells = {
webdriver: navigator.webdriver,
pluginCount: navigator.plugins.length,
chromeRuntime: typeof window.chrome?.runtime,
webglVendor: (() => {
const gl = document.createElement('canvas').getContext('webgl');
const ext = gl?.getExtension('WEBGL_debug_renderer_info');
return gl?.getParameter(ext?.UNMASKED_RENDERER_WEBGL);
})()
};
// Real Chrome: webdriver=false, plugins>0, chrome.runtime='object', GPU string
// Headless: webdriver=true, plugins=0, chrome.runtime='undefined', 'SwiftShader'
```
### FAQ
**Q: Does playwright-stealth actually work?**
For soft targets, yes. Against Cloudflare bot management, Kasada, or DataDome, no — its patch signatures are catalogued and detected through Function.toString() inspection (asking a function to print its source and noticing it is not native browser code).
**Q: Is Camoufox detectable?**
Less so than stealth-patched Chromium. It is a Firefox fork with anti-fingerprinting built into the source code, so there are no runtime patches for toString() to find. It can still be caught through behavioral signals — like the absence of mouse movement — if you do not emulate them.
**Q: Should I just use a real browser session via CDP?**
For low volume, yes — attach to a normal Chrome running a real user profile through CDP (Chrome DevTools Protocol, the wire interface tools use to drive Chrome). For high volume, the operational cost of managing real browsers outweighs the detection cost of a hardened headless setup or a managed scraping API.
**Q: Is there ever a case where vanilla Playwright is enough?**
Yes, on sites with no anti-bot protection — most internal tools, documentation sites, marketing landing pages, and small e-commerce. Past roughly 10% of the public sites you encounter, you hit detection of some kind; even Cloudflare's Bot Fight Mode blocks vanilla Playwright using the datacenter-IP heuristic (flagging traffic from server data centers rather than home connections). For anything advertised as production scraping, start with a patched variant.
---
## How Browser Fingerprinting Works
URL: https://scrappey.com/qa/anti-bot/what-is-browser-fingerprinting-evasion
**Browser fingerprinting is how a site combines signals — canvas, WebGL, audio, fonts, navigator probes, TLS (the encryption layer behind https, which has its own identifying pattern) — into a single identifier for a browser.** A fingerprint is the combined set of these signals, which together can identify one browser. For an automated browser used in authorized workflows on sites you own or are permitted to access, the practical concern is configuration consistency: an empty or contradictory fingerprint is itself unusual, while an internally consistent configuration that matches a real device behaves the way a normal browser would.
### Quick facts
- **Surfaces to handle:** Canvas, WebGL, audio, fonts, navigator, screen, timezone, TLS, HTTP/2
- **Hardest constraint:** Cross-surface consistency (Mac UA + Linux fonts = block)
- **Recommended tools:** Camoufox, PatchRight, Brave, undetected-chromedriver
- **Anti-pattern:** Randomizing each surface independently — produces impossible combinations
- **Rotation strategy:** Whole-profile rotation, not per-surface randomization
### Why per-surface randomization fails
The obvious-but-wrong approach is to randomize each fingerprint surface on its own — a random canvas hash here, a random WebGL string there, a random font list. The problem is that these signals are not independent in real life, so random combinations produce a device that could not exist: a macOS user-agent paired with a Linux font set, an NVIDIA GPU string on a Mac screen aspect ratio, an Asia/Tokyo timezone with US English. Anti-bot models, trained on millions of real users, know these pairings never occur and flag them instantly. The disguise becomes the giveaway.
### Runtime spoofing vs engine-level patching
The same fix can live at two different layers, and where it lives decides how well it holds up:
Runtime spoofingEngine-level patching
**How it works**JS injected at page load overrides properties / methodsC++ source of Chromium / Firefox is patched and rebuilt
**Examples**puppeteer-extra-stealth, undetected-chromedriver, selenium-stealthCamoufox, CloakBrowser, PatchRight (patches at Playwright source)
**Defeats toString check?**No — the patch is a JS function, visible via Function.prototype.toString()Yes — the override happens below the JS layer, so toString still returns "[native code]"
**Setup cost**npm installBinary download (Camoufox/CloakBrowser) or pip install (PatchRight)
**Maintenance**Plugin updates as detections changeTied to upstream Chromium/Firefox releases; weeks-to-months lag
Runtime spoofing means injecting JavaScript when the page loads to override the values a site reads. It is the cheap starting point and works fine against simpler vendors (Cloudflare Bot Fight Mode, Imperva, AWS WAF Common). Engine-level patching means editing and recompiling the browser's own C++ source so the fix sits below JavaScript, where detection scripts cannot see it. That deeper approach is what you need for Kasada, recent Akamai, Cloudflare Bot Management Enterprise, PerimeterX, and F5 Shape — see the vendor cheatsheet for which deployments fall in which category.
### The real-profile-database approach
The hardest part of a consistent configuration isn't any individual signal — it's making them **coherent**, meaning they all fit together the way they would on one real machine. A browser claiming to be Chrome on Windows 11 with an NVIDIA renderer must also have the matching extension list, the matching AudioContext output for that OS, a timezone that matches the IP's location, and so on across dozens of signals. Spoofing each one by hand almost always produces a combination that doesn't add up.
The state-of-the-art fix is the **real-profile database**: collect tuples — bundles of values that belong together — of (UA, OS, GPU, audio, canvas, timezone, language, screen size, …) from real users at scale, then hand one whole tuple to each browser session. Camoufox bundles such a database (10k+ profiles); commercial anti-detect browsers like Multilogin and GoLogin maintain larger ones. Because each tuple was captured together from one real machine, every signal in it is automatically consistent.
The catch is novelty. Anti-bot vendors test against the same scraping tools and harvest their profile databases. A profile that's been published in Camoufox's corpus for six months may already be flagged. Refreshing the database is the real work — collecting profiles, rotating them out before they burn, and matching profile geography to proxy geography. This is why commercial anti-detect tools charge $50-200/month for the same idea Camoufox ships free: the operational cost of profile freshness, not the patching itself.
### Whole-profile rotation
The right thing to rotate is a complete device profile, not one value at a time: a coherent set of (UA, fonts, GPU, screen, timezone, languages, TLS) that matches a real class of device. Tools like Camoufox ship with ready-made profile pools. If you build your own rotation, the generator has to respect which values go together — for example, a Windows + Chrome profile always carries the same set of installed fonts, the same TLS ciphersuite order (the fixed sequence of encryption options the browser offers), and the same audio context hash range.
### What fingerprinting does not cover
A consistent static fingerprint is only one layer of how detection works. Behavioral signals — mouse movement, scroll velocity, how long you linger on a page — are judged separately. And IP reputation runs first: datacenter traffic is often handled differently before the page's JavaScript even loads. The static fingerprint is just one signal among several that systems weigh.
### Example
```python
# Camoufox ships with whole-profile fingerprints, not per-surface randomization.
from camoufox.sync_api import Camoufox
with Camoufox(
headless=False,
humanize=True,
fingerprint='windows-chrome-recent',
proxy={'server': 'http://user:pass@residential:port'}
) as browser:
page = browser.new_page()
page.goto('https://target.com')
```
### FAQ
**Q: Is consistent fingerprinting the same as stealth mode?**
Not quite — stealth tools are one approach. They adjust the known defaults in Puppeteer and Playwright. A consistent browser configuration is broader and also involves whole-profile rotation, behavioral realism, and matching transport-layer (network/TLS) fingerprints.
**Q: How often should I rotate fingerprints?**
Per session, not per request. A real user keeps the same fingerprint for an entire visit, so changing it mid-session is itself a tell.
**Q: Can I use a real Chrome profile instead of patching?**
Yes — driving real Chrome with a real user profile over CDP (the Chrome DevTools Protocol, the channel used to control the browser) avoids most patch-detection tells. The tradeoff is operational: managing real profiles at scale is hard, and you still need residential IPs and behavior emulation.
**Q: Should I always pick engine-level over runtime patching?**
No — runtime patching is cheaper to deploy and is enough against roughly 80% of targets. Decide by testing, not by reputation: start with runtime (undetected-chromedriver or puppeteer-stealth plus a residential IP), measure your block rate, and move up to engine-level only if blocks exceed your budget. Reaching for Camoufox or CloakBrowser on an unprotected site just burns extra compute.
**Q: Why don't the engine-level tools just ship every browser version?**
Forking Chromium or Firefox for every release is expensive. Camoufox tracks ESR Firefox; CloakBrowser tracks stable Chromium with a few weeks of lag. That lag is itself a fingerprint — a request claiming to be Chrome 134 from a tool actually running Chrome 131 has a mismatch between the User-Agent and the real engine, which sophisticated detection can catch.
---
## Anti-Bot Vendor Detection Cheatsheet
URL: https://scrappey.com/qa/anti-bot/anti-bot-vendor-detection-cheatsheet
**A useful first step when working with any protected site you are authorized to access is identifying which anti-bot vendor sits in front of it.** The vendor is the security product the site operator has deployed, and recognizing it explains a great deal about how the site behaves — which TLS profile a browser presents (TLS is the encryption layer behind https, and the "profile" is the fingerprint a client presents during the handshake), how trust accumulates across requests, and how each product structures its session state. This cheatsheet maps the six most common vendors to the cookie names, response headers, JavaScript file paths, and block signatures you can read off a single HTTP response. It is a reference for understanding what you are looking at, not an instruction set for circumventing any system.
### Quick facts
- **Vendors covered:** Akamai, Cloudflare, DataDome, PerimeterX, Kasada, F5 Shape
- **Detection time:** A single HTTP response is usually enough — sometimes the TLS handshake alone
- **Fastest signal:** Response Set-Cookie names (one regex)
- **Most ambiguous:** Cloudflare — present on ~20% of all sites, often with no Bot Management enabled
- **When this matters:** Understanding a site's architecture before integrating with it
### The cheatsheet
Read this table top-to-bottom. The first row that matches the response wins — vendors do not stack on the same hostname and path, so once a row matches you have your answer.
Vendor
Cookies
Headers
JS / block signature
**Akamai Bot Manager**_abck, bm_sz, ak_bmscServer: AkamaiGHostInlined ~512 KB sensor.js; block body *Pardon Our Interruption* on 412
**Cloudflare Bot Management / Turnstile**cf_clearance, __cf_bmServer: cloudflare, cf-ray, cf-mitigated: challenge/cdn-cgi/challenge-platform/ assets; Turnstile widget at challenges.cloudflare.com; Error 1015 on rate limit
**DataDome**datadome, dd_cookie_test_*x-datadome-cid, x-dd-bJS at /js/datadome.js; WASM boring_challenge; CAPTCHA at geo.captcha-delivery.com
**PerimeterX (HUMAN)**_px3, _pxhd, _pxde, _pxvidx-px-* familyJS at /init.js served from client.px-cdn.net; Human Challenge press-and-hold widget
**Kasada**x-kpsdk-ct, x-kpsdk-cdx-kpsdk-* response headersPolymorphic ips.js (renamed per deployment); silent 403 / 429 with no challenge UI
**F5 Shape Security**reese84, TS*Custom TS* set-cookiesCustom JS VM bytecode; $rsc= URL params; minute-cadence token rotation
### Identification workflow on a single response
The cheapest reliable detector is a regex over the Set-Cookie headers (the headers where the server hands you cookies), with the Server header as a tiebreaker for Cloudflare. Work through these in order:
- **Cookies first.** Match the cookie names above against the response Set-Cookie headers. Akamai, DataDome, PerimeterX, Kasada, and F5 Shape all set distinctive names on the very first response, so this alone usually identifies the vendor.
- **Server header second.** Server: cloudflare plus cf-ray confirms Cloudflare is in front, but a Cloudflare site with Bot Management turned off looks identical to one with it on. Look for cf-mitigated or a Turnstile script tag to tell the two apart.
- **HTML body third.** If you got an HTML response, search the script src attributes: sensor.js (Akamai), /cdn-cgi/challenge-platform/ (Cloudflare), captcha-delivery.com (DataDome), px-cdn.net (PerimeterX), challenges.cloudflare.com (Turnstile).
- **Block body fourth.** Once you are blocked, the page itself is diagnostic — *Pardon Our Interruption* is Akamai, *Just a moment…* is Cloudflare, and the *captcha-delivery.com* iframe is DataDome.
### Cookie state machines that matter
Three cookies do more than just mark a vendor's presence — they carry session state (a status the cookie tracks as your session progresses) that the site's background requests check on every call:
- **_abck (Akamai)** — its state field reads ~-1~ on first contact (you are untrusted) and flips to ~0~ (trusted) only after sensor.js POSTs valid signals to /_bm/data. While the cookie is still ~-1~, protected endpoints return **412 Pardon Our Interruption** no matter how good your TLS or IP is.
- **_px3 (PerimeterX)** — carries a risk score baked into the cookie. Pre-validated cookies are sold on grey-market resellers because a clean _px3 is worth more than a clean IP.
- **cf_clearance (Cloudflare)** — issued after you pass a challenge, and tied to your IP plus User-Agent. Change either one and it stops working.
- **reese84 (F5 Shape)** — rotates roughly every minute, so long-lived sessions require constant token re-acquisition — one reason in-house integrations against Shape are costly to maintain at scale.
### How the vendors differ architecturally
Identifying the vendor explains how a site is built more than the site's own design does. Each product has a distinct architecture worth understanding when you integrate with a service you are permitted to access:
- **Akamai** — frequently deployed on the web front-end while a brand's mobile API uses a simpler architecture; the web tier leans heavily on TLS-handshake and behavioural signals.
- **Cloudflare** — a CDN with optional Bot Management and Turnstile layers; the same hostname can range from no bot product at all to full ML scoring, which is why the tiebreaker step matters.
- **DataDome** — scores every request independently rather than building trust across a session, so IP reputation weighs heavily in its model. Some sites also embed data in __NEXT_DATA__ in the initial HTML.
- **PerimeterX (HUMAN)** — reputation is shared across all of its customer sites, so a single fingerprint signal is evaluated network-wide rather than per-site.
- **Kasada** — inspects client code with Function.prototype.toString(), which is why runtime JS patching is detectable and source-level approaches behave differently.
- **F5 Shape** — a custom JS VM with minute-by-minute token rotation, the most engineering-intensive product to integrate against, which is why managed APIs are common for it.
### Example
```python
# Minimal vendor detector — point it at any URL and read off the result
import re
from curl_cffi import requests
VENDOR_COOKIES = {
"akamai": re.compile(r"\b(_abck|bm_sz|ak_bmsc)="),
"cloudflare": re.compile(r"\b(cf_clearance|__cf_bm)="),
"datadome": re.compile(r"\bdatadome="),
"perimeterx": re.compile(r"\b_px[a-z]*="),
"kasada": re.compile(r"\bx-kpsdk-"),
"f5_shape": re.compile(r"\b(reese84|TS[0-9a-f]+)="),
}
def detect_vendor(url: str) -> str:
r = requests.get(url, impersonate="chrome131", allow_redirects=True)
blob = "\n".join(r.headers.get_list("set-cookie")) if hasattr(r.headers, "get_list") else str(r.headers)
for vendor, pat in VENDOR_COOKIES.items():
if pat.search(blob):
return vendor
if r.headers.get("server", "").lower() == "cloudflare":
return "cloudflare_no_bm" # Cloudflare CDN, Bot Management not enabled
return "none_detected"
print(detect_vendor("https://example.com/"))
```
### FAQ
**Q: Can two anti-bot vendors stack on the same hostname?**
Almost never on the same path. A site might run Cloudflare as its CDN while running DataDome on its API subdomain, but any single response will only carry one vendor's cookies. If a regex matches multiple rows of the cheatsheet, the request was probably redirected — follow the redirects and re-check the final response.
**Q: Why does Cloudflare need a tiebreaker step?**
Roughly 20% of the public internet sits behind Cloudflare's CDN, but only a fraction has Bot Management enabled. The cf_clearance cookie and cf-mitigated header only appear once a challenge has fired. A plain cf-ray header with no challenge assets in the HTML means the CDN is present but the bot product is not.
**Q: Is the cookie name enough, or do I need to inspect the body too?**
Cookie names alone are enough for routing decisions — which TLS profile and which proxy type to use. You only need to inspect the body when you want to tell bot-management-on from bot-management-off (Cloudflare), or when the response is already a block page and you need to know which kind of challenge to solve.
**Q: How often do these signatures change?**
The cookie names above have been stable for years across all six vendors — they are interfaces that the vendors' own customer-side code depends on, so they cannot be rotated cheaply. The JavaScript file names and block-page text change occasionally (Kasada's ips.js is the most aggressive, renamed per deployment), but the cookie surface is durable.
---
## What Is Cloudflare Bot Management?
URL: https://scrappey.com/qa/anti-bot/what-is-cloudflare-bot-management
**Cloudflare Bot Management is the enterprise-tier ML scoring system Cloudflare runs on every request to a protected zone.** In plain terms: it watches each incoming request and uses machine learning (ML) to guess whether a human or a script sent it. Unlike *Turnstile* — a friction-light CAPTCHA widget that fires on specific endpoints — Bot Management scores *every* request silently and emits a Bot Score from 1 (definitely bot) to 99 (definitely human) that customer-side Workers and rules can act on. It sits behind roughly 20% of all internet traffic and trains its model on the entire Cloudflare network.
### Quick facts
- **Tier:** Enterprise add-on (Bot Fight Mode is the free downgrade)
- **Detection cookie:** __cf_bm (rotates ~30 min), cf_clearance (after passed challenge)
- **Response header:** cf-ray on every response; cf-mitigated: challenge when blocked
- **Output:** Bot Score 1–99 + verified-bot label (e.g. Googlebot) on every request
- **Network advantage:** Trained on global Cloudflare traffic — fingerprint burns are network-wide
### How Bot Management scores a request
Cloudflare calculates the Bot Score at its edge — the network of servers between the visitor and the website — before the origin server (the customer's actual backend) ever sees the request. The inputs are the JA4 TLS fingerprint (a signature of how the client opens an encrypted https connection), the HTTP/2 SETTINGS frame (low-level connection settings that often give away automation tools), IP reputation, ASN type (the kind of network the IP belongs to, e.g. a datacenter vs. a home ISP), request rate patterns, and — when a JavaScript challenge has fired previously — the __cf_bm cookie carrying the result. The score is exposed to the customer via the cf.bot_management.score field in Workers and in firewall rules.
The customer decides what to do with the score. A common setup is *block under 30, challenge 30–60, allow above 60*, with allowlists for verified bots (Googlebot, Bingbot — Cloudflare maintains the list and labels them). Because the model is shared across the whole network, a scraper that fingerprints as Bot Score 12 will get blocked or challenged on every protected site at once.
### Bot Management vs Turnstile vs Bot Fight Mode
Three Cloudflare products are easy to mix up, so here is how they differ:
ProductTierWhat it does
**Bot Fight Mode**FreeBlunt heuristic block of known datacenter / cloud IPs. Easy to identify (blocks before JS runs) and the lightest of the three layers; a residential proxy changes the IP reputation it keys on.
**Bot Management**EnterpriseContinuous ML scoring per request. Authorized automation needs a coherent, consistent browser configuration across all four detection layers to score as a normal client.
**Turnstile**Free / managedA widget you embed on a specific endpoint (login, signup). Issues cf_clearance on solve. Can be invoked by Bot Management as a challenge.
A single protected site can run all three in layers: Bot Fight Mode catches the cheap traffic, Bot Management scores the rest, and Turnstile is shown when the score is borderline.
### How different clients score
**Scores as a bot:** Python requests with a Chrome User-Agent — the header claims Chrome but the connection fingerprint doesn't match, so Bot Score lands around 3. Playwright with default settings also scores low, because CDP (the Chrome DevTools Protocol it uses to drive the browser) leaks and the score is ~15. Datacenter proxies of any flavour, and residential proxies with a mismatched timezone, score poorly too.
**For authorized automation behind the free tier (Bot Fight Mode):** curl_cffi with Chrome impersonation (it copies Chrome's real TLS fingerprint) plus a residential proxy generally presents a consistent client.
**For authorized automation behind Bot Management:** a real browser such as Camoufox or CloakBrowser with a clean residential or ISP IP, a matched Accept-Language header, and patient request pacing keeps the configuration coherent. High-volume workflows on sites you are permitted to access often route through a managed API. The tell-tale sign that Bot Management (not the free tier) is in front of you is the cf-mitigated header on a block — Bot Fight Mode blocks return a plain 403 with no cf-mitigated.
### Example
```python
# Detecting which Cloudflare layer is in front of you (use on sites you are permitted to access)
from curl_cffi import requests
s = requests.Session(impersonate="chrome131")
proxies = {"https": "http://user:pass@residential:port"}
r = s.get("https://target.com/", proxies=proxies)
# Inspect to confirm what you're facing
cf_ray = r.headers.get("cf-ray")
cf_mitigated = r.headers.get("cf-mitigated")
print(f"cf-ray: {cf_ray}") # Present = behind Cloudflare
print(f"cf-mitigated: {cf_mitigated}") # Present = Bot Management active
print(f"status: {r.status_code}")
```
### FAQ
**Q: How is the Bot Score actually computed?**
Cloudflare doesn't publish the model, but the inputs are public: the JA4 TLS fingerprint, HTTP/2 framing, IP and ASN reputation, request cadence, the __cf_bm cookie if one was issued earlier, and prior interactions with the Cloudflare network. The score is recalculated on every request, so a long-lived session can drift up or down over time.
**Q: Can I see my own Bot Score on a Cloudflare-protected site?**
Only if the site chooses to expose it — some debug pages do, via the cf-bot-score response header when it's enabled. Otherwise, inspect the cf-mitigated header on blocked requests: if it's present, Bot Management decided to mitigate (block or challenge) you.
**Q: Is Bot Fight Mode enough of a signal to detect "Cloudflare without Bot Management"?**
Yes. Bot Fight Mode blocks datacenter ASNs (datacenter networks) at the edge with a generic 403 — no cf-mitigated header, no JS challenge, no Turnstile. If you can get through with just a clean residential IP and curl_cffi, you are facing Bot Fight Mode, not the heavier Bot Management.
**Q: Does the verified-bot list cover non-search engines?**
Yes. Cloudflare's verified-bot category includes major search engines, monitoring services (UptimeRobot, Pingdom), social media link previewers, and some AI training crawlers (GPTBot, Claude-Web). Cloudflare maintains this list itself, and customers cannot add to it directly.
---
## What Is Imperva Incapsula?
URL: https://scrappey.com/qa/anti-bot/what-is-imperva-incapsula
**Imperva Incapsula is the enterprise WAF and bot-protection product from Imperva** (acquired by Thales in 2023). A WAF (web application firewall) filters incoming web traffic to block attacks. Incapsula is heavily deployed across banking, healthcare, government, and B2B SaaS — sectors that adopted WAFs before the modern bot-management category existed. Its detection leans mostly on Layer 1 checks (TLS fingerprint plus IP reputation — TLS is the encryption layer behind https) and a lightweight Layer 2 JavaScript challenge. That older design is simpler than newer behavioural systems, which is relevant when integrating with a service you are authorized to access.
### Quick facts
- **Detection cookies:** incap_ses_*, visid_incap_*, nlbi_*
- **Response header:** X-Iinfo (4-segment debug info), X-CDN: Incapsula
- **Common sectors:** Banking, healthcare, government, enterprise SaaS
- **Challenge style:** iframe-loaded "Request unsuccessful" page with reload script
- **Architecture:** Older two-layer design — TLS/IP checks plus a lightweight JS challenge
### How Incapsula scores a request
Incapsula checks a visitor in two stages. **Layer 1** runs before any code executes: it inspects your TLS fingerprint (the signature pattern your encryption handshake produces), your IP and ASN reputation (how trustworthy your network is), how fast you are sending requests, and your static User-Agent against a known-bot blocklist. A datacenter IP or an obviously-scraper UA gets blocked right here, before any JavaScript runs. **Layer 2** is a lightweight JavaScript challenge served in an iframe with the message *"Request unsuccessful. Incapsula incident ID: …"* — the script sets the incap_ses_* cookie after running and then reloads the page. Once you hold that cookie, later requests pass.
The X-Iinfo response header carries a 4-segment debug code (e.g. 8-12345678-12345678 NNNN RT(...)) that reveals which security policy fired. It is handy for debugging, but it is also a dead giveaway that you are behind Incapsula — no other CDN emits this header.
### How the two layers behave
**Layer 1 (TLS and IP).** Because the first layer reads the client's TLS fingerprint and IP reputation, a datacenter IP or a default Python requests TLS profile (whose JA3 fingerprint — a hash of the TLS handshake — does not match a real browser) is filtered before any JavaScript runs. The incap_ses_* cookie is also bound to a single IP, so it is not portable across addresses.
**Layer 2 (JS challenge).** When a deployment serves the lightweight JavaScript challenge, a real browser session satisfies it; there is no behavioural ML (machine-learning scoring of mouse and timing patterns) involved, which is one way Incapsula differs from newer products.
Compared with DataDome or Akamai, Incapsula's infrastructure is older and its model is simpler. Many deployments pair the WAF with an aggressive request-rate rule, so when you are authorized to access a service, request pacing matters as much as a consistent browser configuration.
### Telling Incapsula apart from generic WAFs
The X-Iinfo header alone identifies Incapsula. Even when it isn't visible, an incap_ses_* or visid_incap_* name on a Set-Cookie is diagnostic — these cookie names are unique to Incapsula and have been stable for years. The *"Request unsuccessful. Incapsula incident ID"* block page is a third tell. Older deployments also expose X-CDN: Incapsula.
### Example
```python
# Detecting Incapsula from a response (use only on sites you are permitted to access)
from curl_cffi import requests
s = requests.Session(impersonate="chrome131")
proxies = {"https": "http://user:pass@residential:port"}
r = s.get("https://target.com/api/data", proxies=proxies)
# X-Iinfo is the giveaway header
if "x-iinfo" in r.headers:
print(f"Incapsula confirmed: {r.headers['x-iinfo']}")
print(f"status: {r.status_code}, bytes: {len(r.text)}")
```
### FAQ
**Q: Is Incapsula the same as Imperva WAF?**
Incapsula is the cloud-hosted product; Imperva WAF historically referred to the on-prem appliance (software running on the customer's own hardware). Since Imperva consolidated its branding around 2020, the names are used interchangeably, and the cookie signatures are identical.
**Q: Why is Incapsula's architecture simpler than Akamai or DataDome?**
It predates the modern bot-management category. There is no behavioural ML, no WASM challenge (a heavier check compiled to WebAssembly), and no multi-request trust accumulation (where trust is built up over several requests). Layer 1 — TLS plus IP — is the bulk of the detection, which is why its design feels closer to a traditional WAF than to newer scoring systems.
**Q: What does the X-Iinfo header tell me?**
It is a 4-segment debug code containing the request type, account ID, the policy that fired, and round-trip metrics. Site owners use it to debug false positives. For scrapers it is mainly useful as a vendor identifier and to confirm whether the policy that triggered is a rate-limit (shown as RT) or a fingerprint check (shown as NNNN).
---
## What Is AWS WAF Bot Control?
URL: https://scrappey.com/qa/anti-bot/what-is-aws-waf-bot-control
**AWS WAF Bot Control is a ready-made set of rules inside AWS WAF (Amazon's web application firewall — the security layer that filters traffic before it reaches a site) that detects and blocks bot traffic.** It comes in two tiers — *Common* (signature-based: it blocks known crawlers and bots that identify themselves) and *Targeted* (which adds a JavaScript / CAPTCHA challenge, groups requests by IP to spot abuse, and applies TGT_* labels). Lots of sites use it because it's a one-click enable on any AWS-fronted site, but its detection is well below DataDome or Akamai.
### Quick facts
- **Tiers:** Common (signature) and Targeted (challenge + behaviour)
- **Detection cookie:** aws-waf-token (only on Targeted with challenge action)
- **Response header:** x-amzn-waf-action when challenge fires; x-amz-cf-id on CloudFront
- **Labels emitted:** awswaf:managed:aws:bot-control:* on classified requests
- **Detection strength:** Light (Common) to moderate (Targeted with challenge)
### Common vs Targeted — the two tiers
**Common** checks each request against a fixed list of tell-tale signs: known crawler User-Agents (the string a client uses to identify itself), a missing Accept-Language header, scripting-engine UAs, and datacenter ASNs (the network blocks that cloud servers live in, as opposed to home internet). It blocks roughly the same traffic as Cloudflare's Bot Fight Mode. curl_cffi with Chrome impersonation presents a consistent client to the Common tier, because its UA, TLS (the encryption layer behind https), and headers all look like a real browser.
**Targeted** adds a Silent Challenge (a small piece of JavaScript that hands out an aws-waf-token) and a CAPTCHA Challenge action. When it's set to *challenge* rather than *block*, a request with no token gets a 405 response carrying an x-amzn-waf-action: challenge header plus an HTML page that runs the WAF challenge script. Targeted also counts requests per session token to catch ones coming in too fast.
### How AWS labels classified requests
Unlike Cloudflare, AWS WAF doesn't give each request a 0–99 score. Instead it attaches **labels** — short tags describing what it thinks the request is — such as awswaf:managed:aws:bot-control:bot:category:scraping_framework or awswaf:managed:aws:bot-control:signal:automated_browser. The site owner then writes rules that act on those labels (block, challenge, or just count). This makes Bot Control more lenient by default than other vendors: a labelled request is only blocked if the owner actually added a rule for it, so many AWS-protected sites let through traffic that Cloudflare or Akamai would reject.
### How different clients are handled (on sites you are permitted to access)
**Common tier:** any modern impersonation library (curl_cffi, tls-client, hrequests) plus a non-datacenter IP presents a consistent client. The signature list is short and well-known.
**Targeted tier:** if the owner chose *challenge*, a real-browser session (Playwright, Camoufox) completes the challenge once and then reuses the aws-waf-token cookie for later requests — the token stays valid for a while (~5 min by default, configurable). If the owner chose *block*, there's no challenge to complete, so a coherent, consistent browser configuration matters — broadly the same considerations as Akamai, but against a much weaker scoring model.
### Example
```python
# AWS WAF Common: curl_cffi alone usually passes
from curl_cffi import requests
s = requests.Session(impersonate="chrome131")
r = s.get("https://target.com/api/items")
# Check for AWS WAF challenge
if r.headers.get("x-amzn-waf-action") == "challenge":
print("Targeted tier with challenge — switch to a real browser")
elif r.status_code == 403:
print("Common tier blocked — check IP and User-Agent")
else:
print(f"OK: {r.status_code}")
```
### FAQ
**Q: Is AWS WAF Bot Control as strong as Cloudflare Bot Management?**
No. The signature-based Common tier is about as strong as Cloudflare's free Bot Fight Mode. The Targeted tier adds JS challenges but lacks the continuous machine-learning scoring and the global cross-site data that Cloudflare's enterprise product draws on. Most AWS-WAF-protected sites are not the hardest targets.
**Q: How do I tell Common from Targeted from a single response?**
Common blocks with a plain 403 and no extra headers. Targeted set to challenge returns a 405 plus an x-amzn-waf-action: challenge header. Targeted set to block returns a 403, but its Set-Cookie header usually contains an aws-waf-token left over from a previous interaction.
**Q: Do the awswaf:* labels appear in HTTP responses?**
No — the labels live inside AWS WAF and are only visible in CloudWatch logs to the site owner, so scrapers never see them. What you can observe is the action the owner's rule took based on those labels (block, challenge, or allow).
---
## What Is Forter?
URL: https://scrappey.com/qa/anti-bot/what-is-forter
**Forter is a fraud-and-trust platform that runs at e-commerce checkout — it is not a traditional anti-bot product.** Instead of blocking scrapers from reading pages, it scores each transaction for fraud risk (fake identities, hijacked accounts, payment abuse) and instantly tells the store whether to approve or decline the order. Most scrapers never meet it, because Forter only fires at *checkout*, not on product pages. If you are monitoring prices or pulling catalog data, you can ignore it. But if you automate account sign-ups, checkout, or refunds, you will hit Forter on the second request and be quietly turned away.
### Quick facts
- **Category:** Fraud / identity — not pure bot protection
- **Where it fires:** Checkout, account creation, payment, post-purchase
- **Detection cookies:** fortertoken, ftr_blst_* (device-identity blob)
- **Decision style:** Real-time approve / decline returned to merchant via API
- **Visibility to scrapers:** Silent — failed requests look like merchant-side payment failures
### What Forter sees and what it decides
Forter is called by the store's checkout backend, not by the CDN (the network of edge servers that usually sits in front of a site). When a shopper clicks "Place Order", the store sends Forter the cart, payment details, a device fingerprint, and a fortertoken. Within a few hundred milliseconds Forter answers: approve, decline, or review. The piece that matters for automation is the device fingerprint — a snapshot of the browser collected by a Forter JavaScript SDK (a script the checkout page loads) that includes canvas/WebGL fingerprints (tiny rendering quirks unique to your graphics hardware), your accept-language header, timezone, and a hardware-tied identity blob.
The key point: Forter scores **identity**, not just the **session**. A clean fingerprint is not enough — the identity also has to look real, meaning the IP location matches the billing address, the account has history, and the payment card has a good reputation. That is why scrapers that solve every CAPTCHA still get declined at checkout: the fingerprint says "human" but the identity says "made up".
### When scrapers actually encounter Forter
Scrapers doing pure data extraction — price monitoring, listings, reviews — never see Forter; those pages don't call it. You only run into it when you automate an action that touches money or identity:
- **Automated checkout flows**Sneaker bots, ticketing bots, retail arbitrage
- **Account creation at scale**Bulk sign-ups from the same fingerprint or IP range
- **Coupon & promo-code redemption**One-per-customer offers claimed in volume
- **Returns & refunds automation**Programmatic refund or chargeback flows
The failure is silent
Checkout just says *"Payment declined, please try a different card"* — identical to an ordinary card rejection. The real reason is that Forter declined the transaction on **identity** grounds, not because anything was wrong with the card.
### What works against Forter
Hardening your browser fingerprint alone won't help, because Forter scores identity, not fingerprints. The countermeasures are operational, not technical: match the billing address to the IP location, use payment cards with a clean track record, let accounts age and build up real activity before checking out, and avoid the velocity patterns (for example, 10 accounts in 10 minutes from the same IP block) that Forter watches for. Importantly, Forter shares decline signals across every store that uses it — an identity declined at one Forter customer is flagged at all the others.
### FAQ
**Q: Is Forter an anti-bot product?**
No — it's an anti-fraud / identity-trust product. It doesn't stop scrapers from reading pages. It only fires at checkout, where it decides whether to approve a transaction. Scrapers doing plain data extraction can ignore it.
**Q: Why does Forter show up in anti-bot articles then?**
Because the line between "bot" and "fraud" blurs at checkout. Sneaker-bot operators, ticket scalpers, and retail-arbitrage automation all hit Forter on the second request, and the usual page-level anti-bot tricks (fingerprint hardening, residential IPs) aren't enough — Forter scores identity, not session.
**Q: Can I detect Forter on a target site before checkout?**
Yes — check the checkout page for a script tag loading from cdn.forter.com, or look for a fortertoken cookie after you submit cart details. If either is present, expect identity scoring when you place the order.
---
## What Is Riskified?
URL: https://scrappey.com/qa/anti-bot/what-is-riskified
**Riskified is a chargeback-guarantee platform for e-commerce checkout.** A chargeback is the money a merchant loses when a customer disputes a charge. Merchants pay Riskified a per-transaction fee, and in return Riskified takes on the chargeback liability for any transaction it approves — if an order it approved turns out to be fraud, Riskified pays, not the store. Like *Forter*, it is an anti-fraud product, not an anti-bot product — but checkout-automation scrapers (sneakers, tickets, limited drops) hit it on every order. Its decision model leans heavily on behavioural and transactional signals: the device fingerprint (a profile of your browser and hardware), prior transaction history, whether billing and shipping addresses match, and velocity patterns (how fast orders pile up) seen across the whole Riskified network.
### Quick facts
- **Category:** Fraud / chargeback guarantee — not pure bot protection
- **Where it fires:** Order submission at checkout
- **Detection signals:** Device fingerprint, IP, billing/shipping match, network-wide transaction history
- **Decision style:** Approve / decline with chargeback guarantee on approve
- **Common merchants:** Limited-drop apparel, ticketing, electronics, luxury
### How Riskified differs from a typical fraud check
Most merchants run their own internal fraud rules and eat the chargeback losses themselves. Riskified flips this around: the merchant sends every order to Riskified for a decision, and Riskified pays the chargeback if any approved order turns out to be fraud. Because false approvals cost Riskified money directly, it has a strong incentive to be aggressive on borderline cases.
In practice, that means Riskified-protected merchants **decline orders at the slightest pattern anomaly**: a residential IP in one state with a billing address in another, a freshly-created account, or a payment instrument with no history on the Riskified network. Real customers get caught by this regularly, which is why "my order was declined" support tickets are so common at Riskified-merchant sites.
### When scrapers encounter Riskified
Same answer as Forter: pure data extraction is unaffected, but anything that automates checkout is in scope. Common scenarios:
- Sneaker / streetwear limited-drop automation
- Concert and sports ticket purchasing
- Limited electronics releases (consoles, GPUs)
- Luxury goods resale flipping
The failure looks the same as Forter: a silent decline at the payment step with a "please try again" message.
### What works against Riskified
What helps here is operational, not technical. Aged accounts with organic activity (orders, returns, browsing) score higher than fresh accounts, no matter how clean the fingerprint. Billing and shipping addresses must match the IP geolocation. Payment instruments with a clean history at any Riskified-network merchant carry that good reputation over. **The single biggest signal Riskified uses is the network effect**: a card or device that was declined at one Riskified merchant gets a worse score at every other Riskified merchant, so burning an identity is permanent.
### FAQ
**Q: Is Riskified easier or harder than Forter?**
They are different enough that the comparison isn't direct. Riskified leans more on transactional history (payment-instrument reputation, billing patterns), while Forter leans more on device identity. Sites running both — increasingly common — combine the strengths of each.
**Q: Can I tell from outside whether a site uses Riskified?**
Sometimes. Riskified ships a beacon JS file (a small tracking script) that loads on checkout pages, so look for script tags from beacon.riskified.com. Some merchants also expose a Riskified order ID in the order-confirmation page source.
**Q: Does Riskified score every request like Bot Management does?**
No. Riskified is called synchronously when an order is submitted — meaning the checkout waits for its answer — returns a decision in a few hundred milliseconds, and stays silent otherwise. A merchant can configure pre-checkout calls (scoring at the cart stage), but most don't.
---
## What Is WebGL Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-webgl-fingerprinting
**WebGL fingerprinting reads identifying information directly from the GPU.** WebGL is the browser feature that lets web pages draw 3D graphics using your graphics card, and a fingerprinting script asks it questions that real hardware answers in revealing ways. The browser exposes the graphics card vendor and renderer string (via WEBGL_debug_renderer_info), the list of supported WebGL extensions, the maximum texture size, the precision of shader operations, and - most distinctively - the pixel output of a known render operation (drawing a fixed image and comparing the exact pixels that come out). The combination is roughly tied to your hardware and survives most user-agent spoofing (faking the browser's identity string), so it keeps pointing at the same machine.
### Quick facts
- **Reads:** GPU vendor + renderer string, ~70 extension flags, render output hash
- **Distinct from Canvas:** WebGL probes the GPU directly; Canvas probes the 2D drawing pipeline
- **Common bot tell:** SwiftShader renderer (Google Inc. SwiftShader) = headless Chrome with no GPU
- **Block-grade signal:** Renderer mismatched to platform — e.g. NVIDIA on macOS, llvmpipe on Windows
- **Defeats:** Vanilla headless Chrome, default Playwright/Puppeteer
### What WebGL exposes
The browser API exposes four classes of identifying data:
- **Renderer string** — the GPU's name, read via gl.getParameter(gl.RENDERER) with the WEBGL_debug_renderer_info extension. Returns strings like "ANGLE (NVIDIA, NVIDIA GeForce RTX 4070 Direct3D11 vs_5_0 ps_5_0, D3D11)" on Windows or "Apple GPU" on macOS.
- **Extension list** — about 70 named extensions (EXT_color_buffer_float, OES_texture_float_linear, etc.), the optional features your GPU and driver support. The exact set varies by GPU model and driver version.
- **Parameter values** — limits the hardware reports, such as max texture size (4K/8K/16K), max viewport dimensions, max vertex attributes, and fragment shader precision (how accurate the GPU's per-pixel maths is). Each varies by hardware tier.
- **Rendered output** — draw a known shape with a known shader (a small GPU program) and hash the pixels read back from the canvas into a short ID. Different GPUs produce subtly different floating-point rounding, anti-aliasing (edge smoothing), and gradient interpolation, so the ID is steady but machine-specific.
### Why headless browsers fail WebGL the hardest
A headless browser runs with no visible window, the usual setup for automation. Headless Chrome without a GPU falls back to **SwiftShader**, Google's software rasterizer - a stand-in that draws graphics on the CPU instead of a real GPU. The renderer string is then "Google Inc. SwiftShader" or "Google SwiftShader" — anti-bot vendors block this string unconditionally because no real desktop user has SwiftShader as their primary GPU. The fallback chain is even worse on Linux: llvmpipe (Mesa's software rasterizer) is an instant tell.
Running headless Chrome with --use-gl=angle --use-angle=swiftshader-webgl still produces a SwiftShader renderer. The mitigations are: (1) run with xvfb (a virtual display) plus a real GPU passed through, (2) spoof WEBGL_debug_renderer_info at the CDP level (Chrome's remote-control protocol) with a believable renderer string, or (3) use a tool like Camoufox or CloakBrowser that patches the renderer at the C++ level (inside the browser engine itself).
### The renderer-platform coherence trap
Spoofing the renderer string alone is insufficient. Anti-bot vendors cross-check the claimed renderer against the platform (navigator.platform, navigator.userAgent) and the GPU's expected extension list - in other words, do all the clues agree with each other? A request claiming *Windows + Chrome 131* with renderer string *"Apple GPU"*, or *macOS* with an NVIDIA RTX renderer, gets blocked. The renderer extension set must also match the claimed hardware — a budget integrated GPU advertising the extensions only found on enthusiast cards is a clear tell.
Camoufox's approach is to maintain a database of real (platform, GPU, renderer, extensions, parameters) tuples harvested from real users, and serve a coherent one per session, so every value matches. This is more expensive than spoofing individual values, which is why DIY hardening of WebGL almost always trips at least one cross-check.
### Example
```javascript
// What an anti-bot script reads to fingerprint WebGL
function webglFingerprint() {
const canvas = document.createElement('canvas');
const gl = canvas.getContext('webgl');
if (!gl) return 'no-webgl'; // already suspicious
const debug = gl.getExtension('WEBGL_debug_renderer_info');
const renderer = debug
? gl.getParameter(debug.UNMASKED_RENDERER_WEBGL)
: gl.getParameter(gl.RENDERER);
return {
renderer, // 'ANGLE (NVIDIA RTX 4070 ...)' or 'SwiftShader'
vendor: gl.getParameter(gl.VENDOR),
extensions: gl.getSupportedExtensions().sort().join(','), // ~70 entries
maxTexture: gl.getParameter(gl.MAX_TEXTURE_SIZE), // 4096 / 8192 / 16384
maxViewport: gl.getParameter(gl.MAX_VIEWPORT_DIMS).join('x')
};
}
// SwiftShader anywhere in the renderer string is a hard block at major vendors.
```
### FAQ
**Q: Is WebGL fingerprinting more or less reliable than Canvas fingerprinting?**
More reliable for catching headless browsers — SwiftShader is an instant tell that canvas fingerprinting alone cannot catch. But it is less reliable for telling real users apart, because most people have one of a small set of common GPUs (integrated Intel, or Apple GPU on Mac), so many genuine visitors share the same value (high collision rates).
**Q: Can I just disable WebGL to avoid the fingerprint?**
No — disabling WebGL is itself a strong signal. On a modern browser, calling getContext("webgl") is expected to return a working context. Returning null or throwing an error is what scrapers and Tor Browser do, so anti-bot vendors block that behavior.
**Q: What renderer string should a spoofed browser use?**
A common real one that matches your claimed platform. The most-used renderers in production traffic are "ANGLE (Intel, Intel UHD Graphics ...)" on Windows, "Apple GPU" on macOS/iOS, and "Mali-G78" on mid-range Android. Pick one consistent with the User-Agent and don't change it within a session.
---
## What Is AudioContext Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-audiocontext-fingerprinting
**AudioContext fingerprinting plays a silent waveform through the Web Audio API, then reads back the resulting floating-point samples and hashes them.** In plain terms: a website tells your browser to process a sound it never plays, measures the exact numbers that come out, and squeezes them into a hash (a short fixed-length string). The output is determined by the exact audio subsystem - operating system audio mixer, DSP library (the code that does the audio math), CPU floating-point behaviour, and audio driver. Two devices with otherwise-identical fingerprints produce subtly different audio hashes, and headless browsers (browsers running with no visible window, common in automation) without a real audio stack produce a small set of giveaway values.
### Quick facts
- **Reads:** Floating-point output of a known oscillator + biquad filter chain
- **Why it works:** Audio subsystem produces hardware/OS-dependent rounding differences
- **Headless tell:** OfflineAudioContext returns one of ~3 known hashes when no real audio stack exists
- **Used by:** PerimeterX, DataDome, Akamai (as one of dozens of signals)
- **Flags:** Vanilla headless Chrome, most stealth plugins
### How the probe works
The standard probe uses OfflineAudioContext - a Web Audio API that renders samples without actually playing them, so the user hears nothing. The script creates an oscillator (a tone generator) at a known frequency (typically 1000 Hz), routes it through a DynamicsCompressorNode or BiquadFilterNode (audio effects, configured with known parameters), renders ~5000 samples, and hashes them with SHA-256.
The hash is stable per (browser, OS, CPU architecture, audio driver) and varies between them. In other words, the same setup always gives the same number, but a different setup gives a different one. Two iPhones produce the same hash; an iPhone and an Android produce different hashes; a real macOS user and a headless Chrome on the same machine produce different hashes because the audio fallback path is different.
### The headless tell
Headless Chrome on a server with no audio device falls back to a stub audio backend (a placeholder with no real sound hardware behind it). The stub produces a small set of known hashes - **roughly three** distinct values across the entire population of headless Chrome instances on Linux servers. Anti-bot vendors maintain a blocklist of these specific hashes and flag any request matching them.
Even with --use-fake-device-for-media-stream and similar Chrome flags, the OfflineAudioContext path is independent of media-device flags. The fix is one of: (1) run on a machine with a real audio device passed through to the browser, (2) install a virtual audio driver (PulseAudio dummy sink, BlackHole on macOS) that produces real-machine-like output, or (3) use a tool that patches AudioContext at the engine level (Camoufox, CloakBrowser).
### Why naive spoofing fails
Spoofing AudioContext in JavaScript is detectable in two ways. First, Function.prototype.toString() on the patched method reveals the patch - real Chrome's AudioContext.prototype.createOscillator.toString() returns "function createOscillator() { [native code] }", but a JS replacement returns the patch source instead, exposing the tampering. Second, the timing of OfflineAudioContext.startRendering() on a real audio stack vs a JS-stubbed one differs by orders of magnitude, and that timing is itself fingerprinted.
The only reliable mitigation is patching at the browser-engine level (below the JS layer) so toString() still returns [native code] and the render timing matches a real machine. This is what differentiates Camoufox / PatchRight from playwright-extra-plugin-stealth.
### Example
```javascript
// The standard AudioContext probe used by major anti-bot vendors
async function audioFingerprint() {
const ctx = new OfflineAudioContext(1, 5000, 44100);
const osc = ctx.createOscillator();
osc.type = 'triangle';
osc.frequency.value = 1000;
const compressor = ctx.createDynamicsCompressor();
compressor.threshold.value = -50;
compressor.knee.value = 40;
compressor.ratio.value = 12;
compressor.attack.value = 0;
compressor.release.value = 0.25;
osc.connect(compressor);
compressor.connect(ctx.destination);
osc.start(0);
const buf = await ctx.startRendering();
// sum a stable slice — anti-bot vendors usually hash samples 4500-5000
let sum = 0;
for (let i = 4500; i < 5000; i++) sum += Math.abs(buf.getChannelData(0)[i]);
return sum.toString();
}
// Headless Chrome on Linux servers returns ~3 distinct values — easily recognized.
```
### FAQ
**Q: Why don't scrapers just disable the Web Audio API?**
Disabling Web Audio means OfflineAudioContext returns undefined or throws an error, which is a stronger signal than any specific hash. Real browsers always have a working Web Audio API, so its absence stands out immediately. The detection asks "what does it produce", not "does it exist" - disabling is worse than failing the probe.
**Q: How distinctive is the audio hash compared to canvas?**
Less distinctive between real users (collision rates around 1-in-1000 for popular configurations, meaning many real users share the same hash) but more distinctive between real and headless (the headless cluster is tiny and well-known). It is most useful as a headless-detection signal rather than a unique identifier.
**Q: Can I run a virtual audio device to fix this?**
Yes. On Linux, PulseAudio with a dummy sink produces real-machine-like output, and on macOS BlackHole works similarly. The catch is that the hash needs to be coherent with the rest of the fingerprint - a Linux PulseAudio hash on a request claiming to be Windows is its own tell.
---
## What Is Function.toString() Inspection?
URL: https://scrappey.com/qa/anti-bot/what-is-function-tostring-inspection
**Function.prototype.toString() inspection is a technique anti-bot scripts use to identify JavaScript functions that have been modified at runtime.** Every JS function has a .toString() method that returns its source code as text. For functions built into the browser, it returns a fixed placeholder: "function name() { [native code] }". But if a plugin has replaced that built-in function with its own version, .toString() returns the replacement's actual source code instead. A two-line check spots the difference, which is why playwright-extra-plugin-stealth and similar runtime patchers are identifiable to vendors that use this technique — most notably Kasada.
### Quick facts
- **The core check:** navigator.webdriver.toString().includes("[native code]")
- **Vendor flagship:** Kasada — built the entire detection model around toString inspection
- **Identifies:** playwright-extra-plugin-stealth, undetected-chromedriver, puppeteer-stealth
- **Source-level alternative:** Source-level patching (PatchRight, CloakBrowser, Camoufox) below the JS layer
- **Why it works:** toString() runs in the engine, not as JS — cannot be cheaply spoofed
### The two-line detection
The whole technique fits in two lines:
const looksLikeNative = (fn) =>
Function.prototype.toString.call(fn).includes('[native code]');The text [native code] is the browser engine's way of saying "this function is built in, I'm hiding its real source." A genuine browser's navigator.webdriver getter, HTMLCanvasElement.prototype.toDataURL, and WebGLRenderingContext.prototype.getParameter all pass this check. A stealth plugin that replaces any of those with its own JavaScript function fails instantly, because the replacement's source code shows up word-for-word instead of the placeholder.
Anti-bot scripts run this check across dozens of methods, and even one failure is enough to score block-grade (high enough to get you blocked). Once you know to look, playwright-extra-plugin-stealth patches more than 20 detectable surfaces — each one a clear bot signal.
### Why naive workarounds fail
The obvious fix is to also patch Function.prototype.toString itself so it returns "[native code]" for any function a stealth plugin replaced. This fails for two reasons. First, the patched toString is itself a function whose .toString() can be inspected — it's turtles all the way down. Second, anti-bot scripts call Function.prototype.toString.toString() and check that it matches the real native signature, which it no longer does once patched.
You also cannot wrap functions in a Proxy (a JavaScript object that intercepts calls), because a Proxy gives itself away: fn.toString on a proxied function returns a different value than the original (and adds a new own-property), and that difference is itself detectable.
### What actually works
The only durable fix is to patch **below the JavaScript engine**, in the C++ source code of Chromium or Firefox itself. Tools that do this:
- **Camoufox** — a Firefox fork with C++-level patches to the engine's function-source mapping
- **CloakBrowser** — a Chromium fork with 49 documented C++ patches, including toString behaviour
- **PatchRight** — patches the Playwright Python source so the injected scripts never exist as JS in the first place
These tools change the engine so that calling Function.prototype.toString on a method returns the original [native code] string from the function's metadata, no matter how the underlying behaviour was changed. Because the patch lives in C++, there is no JavaScript function left to inspect.
### Example
```javascript
// What Kasada's ips.js runs — paraphrased, the real version is obfuscated
const SUSPECT_METHODS = [
navigator.webdriver, // most-checked
HTMLCanvasElement.prototype.toDataURL,
HTMLCanvasElement.prototype.getContext,
WebGLRenderingContext.prototype.getParameter,
CanvasRenderingContext2D.prototype.getImageData,
Permissions.prototype.query,
Notification.requestPermission,
Function.prototype.toString // the meta-check
];
const patched = SUSPECT_METHODS.filter(fn => {
if (typeof fn !== 'function' && typeof fn !== 'undefined') return false;
if (fn === undefined) return false;
return !Function.prototype.toString.call(fn).includes('[native code]');
});
if (patched.length > 0) {
// Send block-grade signal back to the anti-bot endpoint.
// playwright-extra-plugin-stealth fails roughly 8 of these on default settings.
}
```
### FAQ
**Q: Does every anti-bot vendor use toString inspection?**
No. Kasada makes it the centerpiece. Akamai's sensor.js runs the check on a smaller set of methods. DataDome and PerimeterX inspect a handful of methods as part of their broader fingerprint. Cloudflare Bot Management leans more on TLS (the encryption layer behind https) and behaviour than on toString. Imperva and AWS WAF Bot Control don't use it heavily.
**Q: Why is playwright-extra-plugin-stealth still recommended in tutorials?**
Because it addresses older or simpler bot checks — the navigator.webdriver flag, a missing chrome runtime object, an unusual plugin-list length. But on any site protected by Kasada or recent Akamai it is counterproductive: each patch it adds is individually identifiable via toString, and they stack on top of each other.
**Q: Is there a way to detect whether the current site uses toString inspection?**
Yes. Open DevTools, set a breakpoint on Function.prototype.toString, then reload the page. If the breakpoint fires hundreds of times before the page becomes interactive, the site is running an aggressive inspection pass.
---
## What Is Font Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-font-fingerprinting
**Font fingerprinting identifies a device by working out which fonts are installed on it and measuring how that device draws text.** The idea is simple: a script draws a word in the font it wants to test, then draws it again in a known fallback font, and compares the width and height. If the two differ, the test font must be installed (otherwise the browser would have just used the fallback). The complete list of installed fonts - plus the small differences in how text is rendered - is high-entropy and stable, meaning it carries a lot of identifying detail and rarely changes. A headless server (a browser with no visible window, used for automation) ships with only a handful of fonts, so it stands out immediately.
### Quick facts
- **Two flavours:** Enumeration (which fonts exist) + metrics (how they render)
- **Measured via:** measureText() width/height, getBoundingClientRect, OffscreenCanvas
- **High signal:** Default Linux server font set is a known bot tell
- **Used by:** DataDome, PerimeterX, FingerprintJS, Akamai
- **Hardened by:** FontConfig profiles matching a real OS, Camoufox font patches
### How font enumeration works
Enumeration just means listing out what's there. The classic technique draws some text using three generic fallback fonts (monospace, sans-serif, serif) and records the rendered width and height. It then redraws the same text asking for a candidate font, with the generic as a backup (e.g. "Calibri", monospace). If the size changes, the candidate font exists and was used; if it matches the fallback size, the font is not installed. Run this across a list of a few hundred fonts and you get the full installed set.
Newer variants skip the page's HTML entirely and measure with CanvasRenderingContext2D.measureText() on an offscreen canvas - a hidden drawing surface. That is faster and harder to tamper with. The result is hashed together with the other signals into the overall browser fingerprint.
### Why scrapers get caught
A fresh Linux container running headless Chrome ships with a tiny, well-known font set (often just DejaVu and a couple of Noto families). Real Windows and macOS machines have dozens of system fonts plus whatever each installed application adds. Anti-bots keep a catalogue of these “default server” font sets and flag them on sight. Worse, a mismatch between the OS claimed in the User-Agent (the browser's self-reported identity string) and the actual font set - UA says Windows, fonts say Linux - is a flat contradiction that a lie detector will catch.
Faking the font list in JavaScript is fragile: the measurements still come from the real renderer, so claiming a font exists while measureText returns the fallback size is itself a giveaway. The durable fixes are to install a real font set at the OS/FontConfig level (FontConfig is the Linux system that manages fonts), or to use a patched browser like Camoufox.
### Passing font checks without faking it
Because the measurements come from the real font rasteriser (the part of the system that turns a font into pixels), faking measureText or the font list in JavaScript is brittle - the list you claim and the glyph widths you actually measure drift apart, and that contradiction is the tell. The reliable fix is to make the font set genuine: install a font package at the OS / FontConfig level that matches the operating system your User-Agent claims, so a "Windows" identity really does carry Segoe UI, Tahoma and the rest of the expected Windows fonts.
This is exactly the kind of signal a patched browser like Camoufox handles deep in the engine rather than with a script injected into the page, keeping the rendered measurements and the reported font list in agreement. If you scrape through a managed web scraping API, fonts are part of the device profile the provider maintains, so you inherit a consistent set instead of the bare DejaVu-only fingerprint of a fresh Linux container.
### FAQ
**Q: How many fonts does it take to be unique?**
There's no fixed number. The pattern of which fonts are present or absent across about 200 common fonts is usually enough to single out most users. Combined with canvas and WebGL signals, the installed-font set pushes uniqueness to near-certainty.
**Q: Can I just disable font fingerprinting?**
Firefox’s resistFingerprinting setting limits the visible font set to a standard list, which helps against tracking but makes you look identical to every other person using that setting (RFP = resistFingerprinting). For scraping you want a realistic, OS-consistent font set, not an empty one.
**Q: Why is a Linux font set a problem if my UA says Windows?**
Because it is a contradiction. Anti-bots cross-check the measured font metrics against the platform you claim to be; a Windows User-Agent paired with a DejaVu-only font set is a textbook headless-server signature.
---
## What Is Math & JS Engine Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-math-fingerprinting
**Math fingerprinting identifies a browser by running math functions (sin, cos, tan, exp, log, pow) on fixed inputs and reading the very last bits of the answers.** Those final bits depend on the CPU's floating-point unit (the chip's math hardware), the system math library (libm - the C code that computes these functions), and the JavaScript engine's own shortcuts. The results are the same every time on a given machine, so they make a stable, hard-to-fake signal - and they reveal the real engine even when the User-Agent (the browser's self-reported identity) lies about it.
### Quick facts
- **Reads:** Last bits of Math.acos, atanh, expm1, sinh, pow, etc.
- **Varies by:** CPU FPU, libm version, V8 vs SpiderMonkey vs JavaScriptCore
- **Cost:** Microseconds — runs on every page load
- **Reveals:** A "Safari" UA running on V8 (i.e. a faker)
- **Related:** WASM timing, hardware concurrency, device memory
### Why the same formula gives different bits
The IEEE-754 standard (the rules for how computers store decimals) only promises exact results for basic operations: +, −, ×, ÷, and sqrt. Functions like Math.tan or Math.expm1 are left *implementation-defined* in their last bit or two - meaning each engine is free to round them slightly differently. Chrome's V8, Firefox's SpiderMonkey, and Safari's JavaScriptCore each ship their own math routines, and those routines may hand the work off to the platform's libm. The result: Math.tan(1e300) or Math.sinh(1) ends in hex digits that effectively name the engine + OS combination.
Because the answer is identical on every run of a given machine, it drops cleanly into a composite browser fingerprint with no random noise to filter out.
### How it exposes spoofed browsers
This is the signal that catches lazy User-Agent spoofing. If a scraper claims a Safari UA but is actually running headless Chrome (Chrome with no visible window, usually on a server), the math probes return V8's values, not JavaScriptCore's. A lie detector compares the math signature against the engine you claim to be and flags the mismatch. The same trick exposes Chrome-on-Linux pretending to be Chrome-on-Windows when combined with other OS signals.
There is no JavaScript-level fix: you cannot reimplement libm to match a different platform without reimplementing the whole engine. Real consistency only comes from running the actual browser + OS you are claiming — which is why anti-detect stacks lock down the entire environment, not just the UA string.
### Why you cannot spoof your way out
The differences exploited here come from the JavaScript engine and the CPU's floating-point unit — the last bits of Math.tan(-1e300) or Math.sinh() differ between V8 (Chrome), SpiderMonkey (Firefox) and JavaScriptCore (Safari), and again across hardware. You cannot fake these results convincingly from inside a content script (the JavaScript a site can run in the page) without re-implementing the math, and any wrapper you add is itself detectable. So the engine signature has to genuinely match the browser you claim to be.
That makes engine fingerprinting a coherence test more than a value test: a tool running the SpiderMonkey engine must present a Firefox identity, not a Chrome one, or the math bits and the User-Agent contradict each other. This is why Camoufox is built on Firefox and reports as Firefox — and why bolting a Chrome User-Agent onto a non-V8 runtime is caught instantly by a lie detector.
### FAQ
**Q: Is math fingerprinting high-entropy on its own?**
No — on its own it mostly identifies the engine + OS family, not a specific device (low entropy: it narrows down what kind of machine you are, not which one). Its real value is consistency and serving as a lie-detector cross-check, not standalone uniqueness.
**Q: Can I randomise the math results?**
Patching the Math functions in JavaScript is detectable - the patched function fails the native-code toString check that reveals it is no longer the browser's built-in version - and randomising breaks the stable-per-machine behavior detectors expect. It does more harm than good.
**Q: Does WebAssembly have the same issue?**
WebAssembly (WASM - a low-level binary format browsers run for speed) specifies math more strictly, but its SIMD timing and rounding still leak engine details, which is why anti-bots increasingly pair Math probes with WASM ones.
---
## What Is Fingerprint Lie Detection?
URL: https://scrappey.com/qa/anti-bot/what-is-fingerprint-lie-detection
**Fingerprint lie detection is the practice of verifying that the signals a browser reports are internally consistent and untampered, rather than trusting them at face value.** (A browser exposes hundreds of signals - the User-Agent string, the list of fonts, screen size, and so on - that together form its fingerprint.) Popularised by the open-source CreepJS project, it flips the problem: a spoofer can change any single value, but making *all* values agree with each other - and survive native-code integrity checks (proof a value still comes from the real browser, not a script) - is extremely hard. A detected lie is a stronger bot signal than any single fingerprint.
### Quick facts
- **Popularised by:** CreepJS (abrahamjuliot)
- **Checks:** Native-code integrity, prototype tampering, cross-property contradictions
- **Key trick:** Compare main thread vs Web Worker navigator
- **Beats:** UA spoofing, canvas noise, property overrides
- **Try it:** /tools/browser-fingerprint-checker
### The three classes of lie
**1. Tampering lies.** When a script replaces a built-in function - say navigator.webdriver or HTMLCanvasElement.prototype.toDataURL - the replacement no longer prints as [native code] the way a genuine browser function does. Asking Function.prototype.toString.call(fn) what the function looks like - and checking that toString itself has not been tampered with - exposes the patch. See function toString inspection.
**2. Contradiction lies.** Two reported values cannot both be true: a Windows User-Agent paired with a Linux font set, a navigator.platform of Win32 but a math signature (tiny rounding differences unique to each JS engine) from a different engine, userAgentData.mobile = true alongside maxTouchPoints = 0 (a touchscreen with zero touch points), or a screen availWidth larger than its width.
**3. Scope lies.** The most elegant: spawn a Web Worker (a background JavaScript thread) and read navigator from inside it. Many spoofing tools only patch the main-thread navigator and forget the worker scope, so the two disagree. CreepJS leans heavily on this.
### Why lie detection beats spoofing
Single-value spoofing assumes the vendor reads each signal on its own. Lie detection assumes nothing and instead measures *coherence* - whether everything fits together. To pass, a scraper must present a fingerprint where every signal - UA, platform, fonts, canvas, WebGL renderer, math, timezone, languages, worker scope - matches one real, existing device. That is why the durable approach is to run a genuine browser on genuine hardware (or a deeply patched build like Camoufox / CloakBrowser) rather than overriding properties at runtime.
You can see exactly which lies your own browser exposes - and the trust score they add up to - in the Browser Fingerprint Checker.
### Why coherence is the unit of measurement
The lesson from lie detection is that detectors measure the whole identity, not any single field. Once a detector cross-checks the User-Agent against the JS engine math, the font set against the OS, the GPU string against the renderer, and the timezone against the IP geolocation, a value changed in isolation only creates a new contradiction. Any field that differs has to be consistent with every field that did not.
This is why tools built around a real, internally consistent device profile — the approach managed scraping APIs and patched browsers such as Camoufox take — behave differently from runtime property overrides. A coherent stack (engine, fonts, canvas, WebGL, headers, network) has no seam for cross-checks to catch, whereas JavaScript overrides layered on top of a headless Chrome still surface contradictions the detector can read.
### FAQ
**Q: What is the single most common lie that gets scrapers caught?**
The mismatch between the main-thread navigator and the one inside a Web Worker (a background thread). Many automation tweaks patch window.navigator but not the worker scope, and CreepJS-style checks read both and compare them.
**Q: Does a detected lie always mean a block?**
Not necessarily - vendors fold it into a score rather than blocking outright. But a tampering lie (a patched built-in function) is high-confidence, so it usually pushes the session into a challenge or a block.
**Q: How do I see my own lies?**
Run the Browser Fingerprint Checker. It performs native-code integrity checks, cross-property contradiction checks, and a worker-scope comparison, then reports each finding with a trust score.
---
## What Is Favicon Fingerprinting (Supercookies)?
URL: https://scrappey.com/qa/anti-bot/what-is-favicon-fingerprinting
**Favicon fingerprinting (the "Supercookie" technique) abuses the browser's separate, long-lived favicon cache to store a persistent identifier that ordinary cookie controls do not clear.** A favicon is the small icon shown in a browser tab; browsers keep a special store for these icons (the F-cache) so they do not download them twice. That store is keyed per URL path and survives incognito sessions and normal cache clears. A site can write a unique ID into it by controlling which favicon paths your browser is forced to request, then read the ID back later by watching which icons your browser re-downloads (a cache miss) versus which it already has (a cache hit).
### Quick facts
- **Disclosed:** 2021, "Supercookie" research by Jonas Strehle
- **Abuses:** The dedicated, persistent favicon (F-cache)
- **Survives:** Incognito, cookie clears, normal cache flush
- **Encodes ID via:** Hit/miss pattern across N favicon paths (N bits)
- **Status:** Browsers partially mitigated; concept still instructive
### How the favicon supercookie works
Browsers cache favicons in a store separate from the normal HTTP cache, optimised for speed and kept for a long time. The Supercookie attack sets up a set of redirect paths, each with its own favicon. Think of each path as one bit, and the cache as a row of those bits spelling out a number.
To *write* an ID, the server bounces a first-time visitor through a sequence of paths, serving (or withholding) a favicon at each one so the browser ends up caching a specific subset. To *read* the ID back, the server watches which favicon requests the browser makes on return: a cached favicon is not re-requested, so the pattern of which icons are present versus missing reconstructs the stored bits. With about 32 paths you get a 32-bit identifier — enough to tell billions of visitors apart.
### Why it matters for scraping and privacy
Favicon fingerprinting is a reminder that stateful tracking — recognising you by data left behind on your machine — is not limited to cookies and localStorage (a browser key/value store websites can write to). A scraper that rotates cookies but reuses the same browser profile can still be linked across sessions through the favicon cache and similar leftovers (HSTS pins, HTTP/2 connection coalescing, the disk cache itself — see cache-timing fingerprinting). The practical defence for automation is a genuinely fresh, isolated profile per identity — not just cleared cookies. Major browsers added partitioning and mitigations after the 2021 disclosure, but the broader class of attack (persistent side caches used as supercookies) remains relevant.
### Defending against it when scraping
Because the favicon supercookie lives in a cache that survives normal cookie clearing, the defence is isolation: give every scraping session its own browser profile or container, so the favicon cache (like cookies, localStorage and the HTTP cache) starts empty and is thrown away afterward. Reusing one long-lived profile across thousands of requests lets the cache fill up into a stable identifier that ties all your sessions together.
Fresh, disposable profiles also matter for a subtler reason: an *empty* favicon cache on every single visit is itself slightly odd for a "returning" user — so for crawls that need to look like a real return visit, you want the cache to persist within one session but reset between identities. Managed browser pools and a web scraping API handle this rotation for you, pairing each fresh profile with a matching proxy so the favicon cache, cookies and IP all turn over together.
### FAQ
**Q: Does clearing cookies stop favicon tracking?**
No. The favicon cache is stored separately from cookies, and in many browser versions a normal cache clear does not touch it either — which is exactly what made the Supercookie technique powerful.
**Q: Is this still exploitable today?**
Browsers shipped partial fixes and cache partitioning after 2021, but using persistent side-caches as supercookies remains a live research area. Treat any reused profile as potentially linkable across sessions.
**Q: How should a scraper defend against it?**
Use a fully isolated, fresh browser profile per identity rather than reusing one profile with rotated cookies. Disposable containers or per-session profiles stop a site from correlating your separate sessions.
---
## What Is Browser Extension Detection?
URL: https://scrappey.com/qa/anti-bot/what-is-browser-extension-detection
**Browser extension detection infers which extensions are installed by probing for the resources and side effects they expose to web pages.** Extensions are the add-ons you install in your browser - ad blockers, password managers, and so on. Each one ships its own images, scripts, and stylesheets as "web-accessible resources" sitting at predictable URLs, and many also change the page: ad blockers hide elements, others inject globals (variables the extension adds to the page). By requesting those URLs or watching for those changes, a website can build a list of which extensions you have. That list is a useful fingerprinting signal - and, oddly, having no extensions at all is itself a clue that you might be an automated browser.
### Quick facts
- **Probes:** web_accessible_resources URLs (chrome-extension://<id>/...)
- **Also via:** DOM mutations, injected globals, behaviour timing
- **Reveals:** Ad blockers, password managers, automation helpers
- **Bot tell:** A profile with zero extensions + zero history
- **Privacy:** Extension set can be near-unique across users
### Resource probing and behavioural detection
The most direct method targets those shipped files. An extension lists files as web_accessible_resources in its manifest (its config file), which makes them reachable at a fixed address: chrome-extension://<extension-id>/path. Every extension has a stable ID, so a page can try to fetch() or load an <img> at a known ID and path - if it loads, the extension is installed. Manifest V3 (the newer extension format) made this harder by giving each origin a randomised UUID instead of a fixed ID, but small differences in timing and error messages still leak an extension's presence in many cases.
Indirect methods watch for what an extension *does* rather than what it ships: ad blockers remove elements with bait class names, password managers inject icons into form fields, and grammar checkers add overlays. A site plants bait and watches whether it gets altered, the way you might leave a marked item out to see if someone touches it.
### Why it matters for bot detection
For anti-bot purposes the signal cuts both ways. A real human profile usually carries a handful of common extensions (uBlock Origin, a password manager). A freshly spun headless profile - a browser launched with no visible window and a blank slate - carries none, which, combined with empty history and a default font set, paints a clear automation picture. Conversely, some automation frameworks inject their own helper extensions whose resources are detectable directly. The realistic profile for scraping mirrors a believable human: a small, plausible extension set rather than a sterile blank slate.
### Avoiding extension tells in automation
Two things give automation away here. First, automation-specific extensions and helpers (old Selenium IDE artifacts, injected helper scripts) expose web_accessible_resources that a page can probe for with a simple image or fetch load. Second, the *absence* of any extension at all - no ad blocker, no password manager, none of the resource-blocking behaviour a real user's browser exhibits - is itself a weak signal that you are a clean automation profile.
The fix is to drive the browser through the DevTools Protocol (Chrome's built-in remote-control interface) rather than injected extensions, so there are no chrome-extension:// resources to fingerprint, and to let the profile look ordinarily "lived-in" rather than pristine. Tools like Camoufox and managed scraping backends aim for this middle ground: no automation-specific extensions to detect, but a realistic, coherent profile rather than an obviously empty one.
### FAQ
**Q: Did Manifest V3 kill extension detection?**
No. It made resource-URL probing harder by randomising each extension's resource UUIDs per origin, so a page can no longer just request a fixed ID. But behavioural detection (watching for the DOM changes an extension makes) and timing side channels still work, so the signal did not disappear.
**Q: Is having no extensions suspicious?**
On its own, no. But most real browsers carry at least one extension, so a completely empty set stands out. Combined with other freshly-provisioned signals - empty history, default fonts, default screen size - a totally sterile profile contributes to a bot score.
**Q: Can extension detection identify me personally?**
Yes, it can contribute. The specific combination of extensions you have installed is often near-unique, so it adds meaningful entropy (identifying detail) to your fingerprint and helps track you across sites.
---
## What Is Sensor Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-sensor-fingerprinting
**Sensor fingerprinting identifies a mobile device from the minute calibration errors in its motion and environment sensors.** Every accelerometer (the chip that senses movement) and gyroscope (the chip that senses rotation) is built with microscopic manufacturing flaws, so two phones held in the exact same position report slightly different numbers. Web pages can read these values through the DeviceMotion/DeviceOrientation and Generic Sensor APIs - the browser interfaces that expose a device's motion. For bot detection the giveaway is usually the reverse: an emulator or headless browser reports no sensor data at all, or values that are impossibly clean and unchanging.
### Quick facts
- **Sensors:** Accelerometer, gyroscope, magnetometer, ambient light
- **Web APIs:** DeviceMotionEvent, DeviceOrientationEvent, Generic Sensor API
- **Identifies via:** Per-device calibration error in raw readings
- **Bot tell:** Missing sensors or perfectly static/zero values
- **Context:** Mostly relevant to mobile / mobile-emulation scraping
### Calibration noise as a unique ID
Research (notably the “fingerprinting via accelerometer calibration” work) showed something simple but powerful: the fixed offset and gain errors stamped into a phone’s MEMS sensors (the tiny mechanical chips inside) at the factory are stable and can be measured straight from JavaScript. Record a short burst of readings, fit the calibration curve, and you get an identifier that survives app reinstalls and browser resets - because it comes from the silicon itself, not from any software setting you could clear. Ambient light and magnetometer readings pile on extra entropy (distinguishing power), so the combined fingerprint narrows down to one specific unit.
Browsers pushed back by gating these APIs behind HTTPS (the encrypted connection behind https://), explicit user permission, and slower sampling. But on permissive setups the readings are still a usable fingerprinting signal.
### Why it matters for mobile scraping
When you emulate a mobile device - faking a phone User-Agent and touch support - anti-bots on mobile-heavy sites may ask for motion data. A desktop headless browser claiming to be an iPhone returns no DeviceMotionEvent stream, or returns constant zeros. That contradicts the device it claims to be, and a lie detector flags it. So convincing mobile emulation needs plausible, slightly-noisy sensor playback - not silence. That is one reason real-device farms (banks of actual phones) stay popular for the hardest mobile targets.
### What it means for mobile and app scraping
Sensor fingerprinting is mostly a mobile problem, and it is brutal for emulators. A headless Android emulator, or a desktop browser pretending to be a phone, reports either no DeviceMotion/DeviceOrientation data at all, or perfectly flat, noise-free values that no real accelerometer ever produces. A genuine device leaks tiny per-unit calibration offsets and a constant trickle of jitter even sitting still on a table - and detectors look for exactly that texture.
This is why real-device farms and residential mobile setups exist for scraping mobile sites and apps at scale: faking a believable sensor stream is far harder than faking a User-Agent. If you only need the mobile *content* and do not have to pass a hardware check, requesting the mobile version through a web scraping API avoids the sensor surface entirely instead of trying to forge it.
### FAQ
**Q: Does sensor fingerprinting affect desktop scraping?**
Rarely. Desktops and laptops have no motion sensors, so sites do not expect any. It only matters when you emulate a mobile device - there, the missing sensor data contradicts the phone you claim to be.
**Q: Are these APIs still available to web pages?**
Yes, but modern browsers now gate them behind HTTPS, a user permission prompt, and rate limits. On permissive setups they are still a working fingerprinting and bot-detection signal.
**Q: How do real-device farms help?**
They run on genuine phones with genuine sensors, so the motion data looks real and noisy, just as a detector expects - something an emulator cannot easily fake at scale.
---
## What Is Battery Status API Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-battery-api-fingerprinting
**Battery Status API fingerprinting used the precise charge level and charging/discharging times exposed by navigator.getBattery() as a short-lived device identifier.** In plain terms: a website could read exactly how full your battery was — *fingerprinting* means using such details as a clue to tell one visitor apart from another. Because that readout (level to two decimals, plus seconds-to-full or seconds-to-empty) was nearly identical across two pages loaded moments apart, two sites could tell they were seeing the same visitor. It is the textbook case of a helpful browser API that turned into a tracking and bot-detection vector, and most browsers have since restricted or removed it.
### Quick facts
- **API:** navigator.getBattery() (Battery Status API)
- **Leaked:** level, charging, chargingTime, dischargingTime
- **Disclosed:** 2015 "battery privacy leak" research
- **Status:** Removed in Firefox; restricted/secure-context in Chrome
- **Bot tell:** Always-100%/charging, or API missing where expected
### How a battery became an identifier
The 2015 research showed the readout combined the battery level (for example 0.57, meaning 57% full) with chargingTime and dischargingTime in seconds. Together those values produced roughly 14 million possible combinations at any instant — enough to pick out a single device. Two sites reading the API within a few seconds saw the same set of numbers, letting them re-identify and link a user even right after they cleared their cookies, for that short window. Firefox removed the API; Chrome restricted it to secure contexts (pages served over https) and rounded the values to be less precise. The episode is now a standard teaching example of an unintended fingerprinting surface — a feature that quietly leaks identifying detail.
### The bot-detection angle
For anti-bot systems the battery readout is a coherence check — a test of whether a browser's story holds together. A server-hosted headless browser (one running on a server with no screen) has no battery, so the API is missing or reports a permanently full, charging state. A laptop user-agent (UA) that claims no battery, or a battery frozen at exactly 1.0 and charging forever, looks mildly suspicious and feeds a lie-detection score alongside stronger signals. It is rarely decisive on its own, but it is cheap to read and adds to the overall picture.
### Current status and how to handle it
The Battery Status API has been walked back since its 2012 debut: Firefox removed it outright, and Chrome restricts it (notably gating it behind secure contexts and reducing its precision). That history matters for scraping because the API's *presence or absence* is now itself a fingerprint surface — a browser claiming to be current Firefox while still exposing navigator.getBattery() is contradicting itself, and a Chrome profile returning suspiciously round values (exactly 100%, charging, infinite time) looks synthetic.
The right move is not to invent battery values but to match the API surface to the browser you claim to be: expose it only where that browser version really does, and let the values reflect a plausible device. Patched browsers such as Camoufox keep these capability surfaces aligned with the reported identity, which is more robust than bolting a fake getBattery() onto a stock headless build.
### FAQ
**Q: Is the Battery Status API still usable?**
Firefox removed it entirely; Chromium keeps a restricted version behind a secure context (https) with coarser values. You should not rely on it, and its absence is itself a signal that anti-bot systems can read.
**Q: Was the battery a strong identifier?**
Only briefly — it could link a visitor across a window of seconds, not permanently. Its lasting importance is as a cautionary tale about how a convenience API leaks entropy (identifying detail that narrows down who you are).
**Q: Does it matter for scraping today?**
Marginally. It is a coherence and lie-detection input: a laptop UA with no battery, or a frozen full-charge value, adds a small amount to a bot score rather than triggering a block on its own.
---
## What Is Timing & Cache Side-Channel Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-timing-attack-fingerprinting
**Timing-based fingerprinting uses high-resolution clocks to measure how long operations take, turning microarchitectural and rendering behaviour into a hardware signature.** In plain terms: a script times tiny jobs your browser does - running math, drawing graphics - and the exact durations reveal what kind of hardware you have. Using performance.now() (a built-in microsecond stopwatch), SharedArrayBuffer counters, or WebGL/WASM workloads, a script measures CPU cache latency, GPU draw time, and memory speed. Those timings identify a class of hardware and, crucially, reveal when a "browser" is actually rendering in software - a hallmark of headless automation (a browser running with no screen, usually on a server).
### Quick facts
- **Clocks:** performance.now(), SharedArrayBuffer timer, requestAnimationFrame
- **Measures:** CPU cache (Prime+Probe), GPU draw time, WASM/SIMD speed
- **Reveals:** Hardware class + software-rendering fallbacks
- **Mitigated by:** Clock coarsening, cross-origin isolation gating
- **Related:** WASM fingerprinting, WebGL fingerprinting
### From clocks to hardware signatures
Math tests and pixel hashes tell you about the *software* stack; timing tells you about the *hardware* underneath it. A script runs a fixed GPU draw or a WASM SIMD kernel (a CPU-heavy compute task) and measures how long it actually takes on the clock. A real discrete GPU finishes a shader far faster than Chrome's SwiftShader software renderer (which fakes graphics work on the CPU when no GPU is present), and a real CPU's cache hierarchy produces a Prime+Probe latency curve - a known timing pattern from cache attacks - that an emulated environment does not. Academic side-channel work (cache attacks, GPU timing) showed these measurements can even infer what other tabs or processes are doing - which is why browsers coarsened performance.now() and gated SharedArrayBuffer behind cross-origin isolation after Spectre (a 2018 CPU vulnerability that abused precise timers).
### Why headless browsers fail timing checks
Headless browsers on GPU-less servers have no graphics chip, so they fall back to software rendering - drawing on the CPU instead. A draw that takes ~2 ms on a real GPU might take 50–200 ms in SwiftShader - a giant, obvious tell. Anti-bots that pair a canvas probe with a timing probe can catch a replayed canvas hash (a saved-and-reused image fingerprint): the pixels look real but the render took software-renderer time. The same logic flags WASM workloads that run at emulated speed. There is no JavaScript fix; you need real (or GPU-accelerated) hardware to produce real timings.
### Why timing is hard to fake — and what works
Timing side-channels are powerful precisely because they read the real hardware underneath the browser. Even with performance.now() resolution deliberately coarsened to defend against Spectre, the relative cost of operations - cache hits versus misses, JIT warm-up (the moment the engine compiles hot JavaScript to fast machine code), GPU draw timing - reflects the actual CPU and memory you are running on. You cannot convincingly fake those ratios from JavaScript, because the code you add to fake them also takes measurable time.
So the practical answer is to run on hardware whose timing profile matches the identity you present, rather than to spoof clocks. A datacentre VM running headless Chrome behind a residential User-Agent has a timing signature that does not match a real consumer laptop. Residential and real-device infrastructure - the kind a managed web scraping API runs behind - produces genuine timing characteristics instead of trying to forge them.
### FAQ
**Q: Why did browsers reduce timer precision?**
After Spectre/Meltdown (CPU flaws disclosed in 2018), high-resolution timers let attackers run cache side-channel attacks - reading secrets by timing memory access. To blunt this, browsers coarsened performance.now() (made it less precise) and gated SharedArrayBuffer behind cross-origin isolation, which also limits some timing fingerprinting as a side effect.
**Q: Can timing detect a replayed canvas?**
Yes. If the canvas pixels match a real GPU but the render took software-renderer time (much slower), that contradiction gives it away - it reveals a headless browser replaying a harvested hash rather than drawing the image live.
**Q: Is there a software workaround for slow headless rendering?**
Not a reliable one. Real or GPU-accelerated hardware is what produces real timings, so there is no pure-code fix. This is a key reason serious operations run browsers on machines with actual GPUs.
---
## What Is Fingerprint Clustering?
URL: https://scrappey.com/qa/anti-bot/what-is-fingerprint-clustering
**Fingerprint clustering is the practice of grouping fingerprints from millions of real visitors by similarity, then rejecting any new visitor whose fingerprint does not fall inside a known cluster.** A browser fingerprint is the bundle of traits a site can read from your browser (GPU, screen size, fonts, and more). Clustering judges the *combination* of those traits rather than each one in isolation — closely related to fingerprint lie detection, but driven by population statistics instead of internal contradictions. The catch for a spoofer: you can make every field individually valid yet still produce a combination no real device has ever sent.
### Quick facts
- **Decision basis:** Distance to the nearest cluster of real fingerprints
- **Common algorithms:** K-Means / DBSCAN, Isolation Forest, co-occurrence tables
- **Catches:** Field combinations that never occur on real hardware
- **Key inputs:** Canvas/WebGL hash, GPU, fonts, cores, memory, screen, timezone
- **Beats:** Randomised and hand-picked "valid-looking" fingerprints
### Why per-field validation is not enough
A simple anti-bot check validates each signal on its own: the User-Agent is a real Chrome string, the screen is 1920x1080, the GPU is an NVIDIA GTX 1080 Ti, the canvas hash (a fingerprint built from a tiny image the browser draws, which varies by hardware) looks like real pixel data, and the timezone is Europe/Amsterdam. Every field passes on its own — yet that exact combination may never have appeared in reality. Clustering exists to catch this kind of fake: signals that are believable individually but collectively impossible.
### How clustering works
**1. Collect.** Every genuine visit is stored as a full fingerprint vector — a list of values for canvas hash, GPU, screen, font count, CPU cores, device memory, timezone, and platform.
**2. Cluster.** Real hardware and software setups repeat across many people, so the stored data forms natural groups. NVIDIA + Chrome on Windows lands in one cluster with a few hundred known canvas hashes; Apple Silicon + Safari forms another; Intel laptops a third.
**3. Score.** A new fingerprint is measured by its distance to the nearest cluster — how far it sits from any known group. A canvas hash never seen for that GPU, a font count of zero, or 8 GB of RAM paired with a high-end GPU pushes that distance past a rejection threshold, and the visitor is blocked.
### What makes fingerprints cluster naturally
**Deterministic rendering.** The same GPU + driver + browser version always produces the same canvas and WebGL output (WebGL exposes details about your graphics hardware). A synthetic or solid-colour canvas yields a hash no real GPU has ever produced, so it sits far from every cluster.
**OS-bound font sets.** Different operating systems ship different fonts, and the exact pixel measurements differ slightly per platform. A real population shows this natural variation; a single hardcoded value repeated on every request does not.
**Hardware correlations.** Real devices obey constraints that fake data ignores — a high-end discrete GPU rarely pairs with only 8 GB of RAM, and an Apple GPU never reports Win32 as its platform. These joint patterns (which values realistically go together) are exactly what clustering measures.
### Why clustering is hard to beat
**Replaying a real fingerprint fails** — the same fingerprint coming from many IPs is an obvious farm, and it must still match the request's TLS/JA3 fingerprint (TLS is the encryption layer behind https; JA3 is a signature of how the client negotiates it), which most HTTP clients cannot reproduce. **Generating random valid fields fails** — each field constrains the others, so a coherent profile requires a database of real devices. **Enumerating every combination fails** — the theoretical space runs to millions, but only a few thousand combinations actually appear regularly, so the rest stand out. The durable approach is to present one internally-consistent fingerprint from a real browser on real hardware — for example a deeply patched build like Camoufox — rather than assembling fields at runtime.
### Example
```python
# How a server might cluster real fingerprints and score new ones.
from sklearn.ensemble import IsolationForest
# Each row is one real visitor's fingerprint, vectorised:
# [canvas_id, gpu_vendor, screen_w, screen_h, cores, memory_gb, font_count, tz_offset]
X_real = vectorize(fingerprint_db) # millions of genuine visitors
model = IsolationForest(contamination=0.01)
model.fit(X_real) # learn the clusters of real devices
def verify(fingerprint):
score = model.decision_function([vectorize(fingerprint)])[0]
# Low / negative score = far from every cluster = likely fake
return "accept" if score > THRESHOLD else "reject"
```
### FAQ
**Q: How is fingerprint clustering different from lie detection?**
Lie detection checks a single fingerprint for internal contradictions and signs of tampering (for example a patched native function, or a Windows User-Agent paired with Linux fonts). Clustering instead compares the fingerprint against the statistical pattern of millions of real devices and rejects it if it falls outside every known cluster. The two are complementary: a fingerprint can be perfectly self-consistent yet still be a combination no real device has ever produced.
**Q: Can I beat clustering by copying a real fingerprint?**
Not reliably. The same fingerprint arriving from many IPs, or at impossible request rates, is an obvious bot farm — and it must still match the TLS/JA3 fingerprint of the request. Replaying one captured profile gets the device coherence right but fails the cross-signal and rate checks.
**Q: What inputs feed a clustering model?**
Typically the canvas/WebGL hash, GPU vendor and tier, screen dimensions and colour depth, CPU cores, device memory, installed font count, timezone offset, platform, and touch points — turned into a numeric vector so the distance to known clusters can be measured.
---
## How to Build an Anti-Bot Challenge
URL: https://scrappey.com/qa/anti-bot/build-anti-bot-challenge
**An anti-bot challenge is a small test a server makes your browser run — like proof-of-work (forcing the browser to burn some CPU on a puzzle), collecting a fingerprint (a profile of your browser and device), or watching how you behave — to tell real browsers apart from automated scripts before letting them in.** Building a good one is less about clever cryptography and more about four design rules: check every signal on the server (never trust the client's word for it), tie each proof to one session, IP, and time window, assume the challenge runs on a machine the attacker controls, and base the work on something only a genuine browser can produce. Most home-built challenges fail on day one because they trust the client and verify almost nothing.
### Quick facts
- **Challenge types:** Proof-of-work, fingerprint hash, behavioural timing
- **Core principle:** Every signal must reach and influence a server-side decision
- **Most common failure:** Client-collected fingerprint data the server never checks
- **Realistic goal:** Raise automation cost 10–100×, not absolute prevention
### Why most homegrown challenges fail
Here is a real, working example: a custom Proof-of-Work (PoW) system — a puzzle that costs the browser CPU time to solve — guarding a sign-up flow. On paper it looks solid: three endpoints and a heavily obfuscated Web Worker (a background script the page runs off the main thread):
- GET /api/pow/worker — returns a 178 KB obfuscated JavaScript Web Worker.
- GET /api/pow/challenge — returns a base64 challenge blob, fetched by the worker internally.
- POST /api/pow/verify — accepts {challengeId, solution} and returns {"success":true}.
The worker talks to the page using typed postMessage messages (the browser's way to pass data between a page and its worker): the page sends {t:"start", o: origin}, the worker asks for browser fingerprints (canvas frames, DOM text-measurement rects, WebGL info, performance values, a navigator string), and returns a 12-byte solution. The obfuscation is real — a Function() wrapper, randomised identifiers, a scrambled lookup table, and control-flow flattening that hides the order code runs in.
It still fell to a script that used no browser at all. The attacker downloaded the worker, ran it under Node.js with eval(), faked the self, postMessage, onmessage and fetch objects the worker expected, and fed it made-up data shaped to look like real fingerprints. The worker fetched the challenge through the faked fetch, computed the solution, and the script POSTed it back using an HTTP client that copied a Chrome TLS fingerprint (TLS is the encryption layer behind https, and its handshake leaves a recognisable signature) — start to finish in under a second. The obfuscation was irrelevant; the *design* had nothing underneath it. Every principle below comes from why that worked.
### Principle 1: never trust data the server never validates
Here is the fatal flaw: the verify endpoint only checked {challengeId, solution}. All that carefully collected fingerprint data was never sent to the server, so it was never checked. That meant the solver could send solid-colour rectangles instead of real canvas renders, a hardcoded GPU string, and fixed screen dimensions. The server can't object to data it never sees.
**If a signal does not reach the server and change the decision, it does not exist.** So make the fingerprint part of the proof itself:
solution = solve(challenge, sha256(canvasFrames + domRects + webglInfo + perfValues + browserMeta))The server records the expected fingerprint hash when it hands out the challenge, then recomputes it at verify time, so fake inputs simply produce a wrong solution. Then cross-check the pieces against each other: if the metadata claims an NVIDIA GPU, the WebGL renderer must also say NVIDIA; if it reports 4 cores, the solve time should match what a 4-core machine would do. This ties directly into fingerprint clustering — fields that contradict each other are free bot signals.
### Principle 2: bind the challenge to session, IP, and time
In the example, the solution was just a standalone {challengeId, solution} pair, tied to nothing — no session, no IP, no TLS fingerprint, no expiry. The script even fetched the challenge and submitted the verify from totally different places. That opens three easy attacks: solve once and replay it, run a solve farm that hands answers to many clients, and submit today's solution tomorrow.
- Issue the challenge against a session cookie set when the page loads, and require that same session for /challenge, /verify and the protected action.
- Sign an HMAC (a tamper-proof checksum keyed with a server secret) over (challengeId, sessionId, IP, timestamp) and check all four on verify.
- Expire challenges in 30–60 seconds and rate-limit how many each IP/session can request.
- Require the JA3/JA4 TLS fingerprint of the challenge request to match the verify request.
A solution should prove that *this client, on this connection, right now* did the work — not be a portable token anyone can pick up and reuse.
### Principle 3: assume the challenge runs in a hostile environment
The worker had no idea whether it was actually inside a real browser. Run under Node.js eval(), it had full access to global, require, process and module, and the attacker just swapped in a fake fetch. Obfuscation slowed down *reading* the code; it did nothing to stop *running* it. So probe the environment from inside the obfuscated worker and fold the result into the math (don't simply throw an error — a thrown error is trivial to patch out):
// Present in real Workers, absent or different under Node.js eval
if (typeof WorkerGlobalScope === 'undefined') corrupt();
if (typeof importScripts !== 'function') corrupt();
if (typeof process !== 'undefined') corrupt();
if (typeof require !== 'undefined') corrupt();Then make browser-only APIs essential to the calculation (crypto.subtle.digest(), high-resolution performance.now(), SharedArrayBuffer/Atomics) so an attacker can't skip emulating them. Finally, keep the code moving: embed a per-session nonce (a one-time random value) when you serve it, and regenerate the variable names and control flow on every request, so a cached eval()-based solver breaks the instant the script changes shape.
### Principle 4: root the proof in real-browser work
Every fingerprint input in the example could be forged with plain arithmetic. The canvas drawing was a pure formula, so the solver produced identical bytes without ever touching the Canvas API. The DOM rects measured a fixed string in a fixed font — always the same numbers. WebGL info was just a string. To force real hardware into the loop, lean on things that genuinely vary by physical device, and always send the server a *hash*, never the raw values:
- Seed canvas rendering with a per-challenge random value the server knows; use globalCompositeOperation, shadowBlur and system fonts to amplify the tiny GPU anti-aliasing differences between devices, then check the output hash against known-good GPU families.
- Prefer WebGL shader output — its hardware floating-point precision is hard to fake without the real GPU.
- Randomise font family and size per challenge, and use proportional fonts so the text widths can't be precomputed.
- Chain the frames: frame N's input depends on the hash of frame N-1's output, so the work can't be split across CPUs or shortcut.
A good challenge doesn't just collect a fingerprint — it forces one that has to land inside a real device cluster.
### Principle 5: detect automation where it can’t be read, and use timing
The example's automation checks (navigator.webdriver, plus a User-Agent blocklist for selenium/puppeteer/playwright) lived in the readable page code, not the worker, so the solver never ran them. Two fixes matter:
- Move detection into the obfuscated worker and make its result feed the calculation, so it can't be skipped by running the worker on its own.
- Use behavioural timing: a real browser takes 5–50 ms to render canvas, read DOM rects and query WebGL; the solver answered in under 1 ms. Reject impossibly fast responses, capture performance.now() before and after the PoW, and add a server-side fence — too fast (< 200 ms after issuance) is a bot, too slow (> 60 s) is a replay.
### Principle 6: cross-validate every signal
On its own each signal can be spoofed; combined with cross-checks, faking them all consistently gets expensive:
- The User-Agent inside the fingerprint must match the User-Agent HTTP header on verify, and sec-ch-ua-platform must match the reported platform.
- The timezone must be plausible for the location of the client's IP.
- If the same session reports different screen sizes or languages from one request to the next, flag it.
- Mix in genuine per-request randomness (crypto.getRandomValues(), the collection-time Date.now()) so no two submissions are byte-for-byte identical.
### The architecture worth building
The robust version pulls all the principles into a single flow:
- **Issue** returns a signed (challengeId, seed, timestamp, sessionId) bundle, tied to a session cookie.
- **The worker** uses the seed to drive non-deterministic canvas/WebGL/font operations and probes its environment.
- **The worker** hashes all the fingerprint data and folds that hash into the PoW calculation.
- **The solution** is (challengeId, solution, fingerprintHash).
- **Verify** checks that the challenge is valid and unexpired, the session matches, the fingerprint hash fits known-good hardware, the solve time matches the reported core count, the IP and TLS fingerprint match the ones at issuance, and the solution is correct for challenge + fingerprintHash.
If you're building one today, this is the order that buys the most security per unit of effort:
PriorityMoveWhy it matters
P0Embed the fingerprint hash in the solutionBreaks every fake-input solver instantly
P0Bind to session + IP + timeKills replay, farming and cross-context solving
P1Environment probes inside the workerDetects eval() outside a browser
P1Per-challenge canvas/font seedEnds deterministic, precomputed fingerprints
P2Move detection into the workerCan’t be skipped by running the worker directly
P2Server-side timing fenceCatches sub-millisecond “instant” solves
P3WebGL shader-output verificationForces a real GPU into the loop
### What good looks like
A well-built anti-bot challenge has a few non-negotiable properties: every signal it collects feeds a server-side decision; the proof is locked to one client, one connection and one short time window; the challenge code assumes a hostile runtime and makes running it outside a browser expensive; and the work depends on physical-device behaviour rather than arithmetic anyone can reproduce. Obfuscation is the last and least important layer — it buys time, not security.
The custom PoW in the example wasn't beaten by clever cryptanalysis. It was beaten because it trusted the client, validated almost nothing, ran in an environment the attacker controlled, and built its proof out of forgeable arithmetic. Fix those four things and you have a challenge worth the bytes it ships. To see how production vendors put these ideas to work at scale, see Cloudflare Bot Management and anti-bot detection.
### FAQ
**Q: Is proof-of-work enough to stop bots?**
No. Proof-of-work only shows that a client spent some CPU time, and a server or solve farm can do that cheaply. It's useful as one layer, but on its own it proves neither a real browser nor a real user. Tie it to a session, validate fingerprints on the server, and add timing checks.
**Q: Should fingerprint data be validated on the client or the server?**
Always the server. Any check that runs only on the client can be skipped by running the challenge code outside a browser. The classic mistake is collecting rich fingerprint data in the browser and then never sending it to the server to validate.
**Q: Why did obfuscating the worker not protect it?**
Obfuscation only slows down reading the code. An attacker doesn't need to read it — they run it with eval() and fake browser globals, watching the messages to learn the protocol. Security has to come from the design (server-side validation, session binding, real-hardware work), not from making the code hard to read.
**Q: Can a custom challenge ever fully stop automation?**
No design is unbeatable — a real browser farm can solve almost anything. The realistic goal is to raise the cost 10–100×, pushing attackers from a cheap script into running full browsers with real GPUs on rotating residential IPs, which is slow, expensive, and itself easier to detect.
---
## What Is JA4 Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-ja4-fingerprinting
**JA4 is a way to identify a browser by the fingerprint of its TLS handshake — TLS being the encryption layer behind https. It replaced the older JA3 method after Chrome started randomising the order of its TLS extensions.** When a browser opens a secure connection it sends a list of supported settings (ciphers, extensions, and so on). JA3 turned that list into a single hash using the exact order the browser sent it. The problem: in Chrome 110, Chrome began shuffling that order on every connection, so one browser produced billions of different JA3 hashes — making the fingerprint useless for blocking. JA4 fixes this by *sorting* the ciphers and extensions before hashing, so every connection from the same client produces one stable value. It is the lead member of the JA4+ suite and, by 2026, the de-facto standard at Cloudflare, Akamai, and AWS.
### Quick facts
- **Replaces:** JA3 — broken by Chrome’s randomised extension order since Chrome 110
- **Key fix:** Sorts ciphers + extensions before hashing → one stable hash per client
- **Format:** Human-readable prefix + two truncated SHA-256 hashes (e.g. t13d1516h2_8daaf6152771_b186095e22b6)
- **JA4+ suite:** JA4 (TLS), JA4S (server), JA4H (HTTP), JA4L (latency), JA4X (cert), JA4SSH
- **Adopted by:** Cloudflare, Akamai, AWS, VirusTotal (industry standard in 2026)
### Why JA3 broke and JA4 was needed
Every TLS connection starts with a Client Hello — the opening message where the browser lists what it supports. JA3 built its hash by taking five things from that message (the TLS version, cipher suites, extensions, elliptic curves, and curve formats) and stringing them together *in the exact order the client sent them*, then running MD5 over the result (MD5 is just a function that turns any input into a fixed-length code). For years a given browser always sent these in the same order, so JA3 made an excellent key for blocklisting.
That broke in Chrome 110 (early 2023), when Google shipped **TLS extension-order randomisation**. The goal was anti-ossification — stopping middleware on the internet from hard-coding assumptions about Chrome's traffic — so Chrome now shuffles the order of its extensions on every connection. Overnight, a single Chrome install began producing a different JA3 hash on nearly every request, effectively billions of values. Blocklisting a JA3 became pointless. Worse, the tables turned: a scraper that always sent a *fixed* extension order now stuck out, because real Chrome was constantly shuffling.
JA4 solves this by **sorting the cipher and extension lists before hashing**. Once everything is sorted, order no longer matters, so Chrome's randomisation collapses back to a single stable JA4. The cost of throwing away order information is paid back elsewhere in the JA4+ suite.
### How a JA4 fingerprint is built
JA3 was one opaque MD5 string you could not read. JA4 is deliberately **human-readable in three parts**, joined by underscores:
- **Prefix (a/b/c).** A readable summary: protocol (t for TCP, q for QUIC), TLS version (13 = 1.3), whether a hostname was sent (d = domain, i = IP), a two-digit count of ciphers, a two-digit count of extensions, and the first ALPN value (the protocol the client wants to speak — h2 means HTTP/2). Example: t13d1516h2.
- **Hash B.** The first 12 hex characters of a SHA-256 hash over the *sorted* cipher-suite list.
- **Hash C.** The first 12 hex characters of a SHA-256 hash over the *sorted* extension list plus the signature algorithms.
The full value looks like t13d1516h2_8daaf6152771_b186095e22b6. An analyst can read the prefix at a glance — TLS 1.3, 16 ciphers, 15 extensions, HTTP/2 — without decoding anything, while the two hashes act as the precise match key.
### The JA4+ suite — fingerprinting the whole connection
JA4 on its own only fingerprints the TLS Client Hello. The real strength comes from the **+ suite**, a set of related fingerprints for other layers of the connection that are cross-checked against each other for consistency:
- **JA4S** — the server's response (which cipher it picked, which extensions).
- **JA4H** — the HTTP layer: request method, version, the order of headers, whether a cookie and referer are present, accept-language. This is what catches a client that gets the TLS JA4 right but sends Python-shaped HTTP headers.
- **JA4L** — latency, meaning round-trip timing, used to estimate physical distance and spot proxy hops in between.
- **JA4X** — a fingerprint of the X.509 certificate (the document that proves a server's identity).
- **JA4SSH** — a fingerprint of an SSH session.
A scraper is scored on all the relevant members at once. Matching Chrome's JA4 while failing JA4H is the single most common giveaway — it happens whenever a library wraps a Chrome-impersonating TLS stack around its own, non-Chrome HTTP implementation.
### What it means for scrapers
The default Python ssl / requests stack produces a JA4 that no browser ever sends, which means an instant block at any JA4-aware vendor. The fix is a TLS library that **copies a real browser's Client Hello byte-for-byte**: curl_cffi (libcurl + BoringSSL with Chrome presets), tls-client, or rustls-based impersonators. These reproduce the exact cipher list, extensions, supported groups, and ALPN so that, once sorted, the JA4 matches a current Chrome.
Two traps remain. First, **cross-layer coherence**: the JA4H has to match too, and most HTTP clients get this wrong (header order, capitalisation, the order of pseudo-headers). Second, **version drift**: a JA4 frozen to Chrome 120 while live Chrome is on 133 becomes an anomaly of its own, so impersonation presets need regular updating. The most reliable approach is to drive a real browser engine, or use a tool that keeps its presets current, rather than hand-building a Client Hello. It is the same coherence problem clustering exploits, just pushed down to the network layer.
### Example
```text
# A JA4 fingerprint, decoded.
t13d1516h2_8daaf6152771_b186095e22b6
│││││││││ │ └─ SHA-256 of sorted extensions + sig-algs (12 hex)
│││││││││ └──────────── SHA-256 of sorted cipher suites (12 hex)
│││││││└─ ALPN : h2 (HTTP/2)
│││││└─ ext count : 16
│││└─ cipher count : 15
││└─ SNI : d (server name = domain)
│└─ TLS version : 13 (TLS 1.3)
└─ transport : t (TCP; q = QUIC)
# JA3 hashed the extensions in send-order, so Chrome's randomisation
# produced a new MD5 every connection. JA4 sorts first -> one stable value.
# Python's default ssl stack yields a JA4 no browser sends == instant block.
```
### FAQ
**Q: Is JA4 better than JA3 at catching bots?**
For modern Chrome traffic, yes. JA3 is effectively dead because Chrome randomises extension order, so a single browser yields billions of JA3 hashes. JA4 sorts the list before hashing, so it stays stable — which makes it usable again as a blocklist key. JA4 is also more informative: its readable prefix spells out the TLS version, the cipher and extension counts, and the ALPN protocol, and the JA4+ suite extends fingerprinting to the HTTP, QUIC, latency, and certificate layers.
**Q: Can curl_cffi or tls-client beat JA4?**
They can match the TLS-layer JA4, because they copy a real Chrome Client Hello byte-for-byte. What they frequently fail is the HTTP-layer JA4H — header order, capitalisation, and pseudo-header order rarely match the browser they claim to be. JA4 and JA4H are scored together, so matching one while failing the other is itself the signal that gives you away.
**Q: Does randomising my own TLS fingerprint help?**
No. Real Chrome does randomise its extension order, but JA4 sorts that away — so after sorting, every real Chrome lands on the exact same JA4. A scraper that randomises its ciphers or supported groups just produces JA4 values that no real client ever sends, which makes it stand out more, not less. The goal is to match one current browser exactly, not to vary.
---
## What Is Residential Proxy Detection?
URL: https://scrappey.com/qa/anti-bot/what-is-residential-proxy-detection
**Residential proxy detection is how anti-bot systems spot traffic that is being routed through a residential proxy pool — a network of IP addresses that belong to real home internet connections.** The problem for defenders is that the visible IP looks completely legitimate, so a simple blocklist of "bad" IPs does not catch it. Instead, detection focuses on the things a proxy pool cannot hide: too many sessions coming from one IP at once, the tell-tale statistical pattern of thousands of home IPs hitting the same page together, the extra delay added by the proxy hop, and the IP showing up in commercial lists of known proxy exit points.
### Quick facts
- **Problem it solves:** Residential IPs borrow real-user reputation, so IP blocklists don’t work
- **Velocity tell:** A home IP with hundreds of concurrent sessions = proxy gateway
- **Pool signature:** Many residential ASNs hitting one endpoint at once = proxy/botnet pattern
- **Latency tell:** Extra round-trip hops; RTT inconsistent with the claimed geolocation
- **Data sources:** IP reputation DBs, proxy-exit feeds (Spur, IPQS), ASN classification, ML
### Why a residential IP is not enough cover
The whole point of a residential proxy is that the exit IP — the address the website actually sees — belongs to a real ISP customer. It inherits that customer's clean reputation and a residential ASN (the network block an internet provider owns). Against a system that only asks *"is this a datacenter IP?"*, that works perfectly — which is why scrapers pay more for residential than for datacenter proxies.
But modern detection does not stop at the IP type. It asks a harder question: *"does this connection behave like one person on one home connection, or like an exit node serving many sessions at once?"* The IP's reputation is genuinely real. The **usage pattern around it** is not — and that is what gives the proxy away.
### The signals that expose a proxy pool
**Per-IP velocity and concurrency.** A real home connection runs a handful of sessions at a time. A proxy exit relaying traffic for many customers shows hundreds of connections at once — many separate cookie jars, or many distinct browser fingerprints — all from one IP in a short window. That fan-out is the single strongest tell, and rotating IPs cannot fix it, because it is a property of the gateway (the shared entry point all that traffic flows through), not of any one IP.
**Pool-level statistics.** Anti-bot vendors that see traffic across the whole internet (Cloudflare, Akamai) watch for the signature of a *pool*: a sudden spike of requests to one sensitive page arriving from a large, varied set of residential ASNs all at once. No natural event produces that pattern; a rotating proxy campaign does. ML models are trained directly on this cross-IP pattern.
**Latency and round-trip analysis.** Routing through a proxy adds extra network hops and delay. **JA4L** and similar timing probes measure the round-trip time (RTT — how long a packet takes to go out and come back) and compare it against what the IP's claimed location should produce. An IP that geolocates to Berlin but answers with São Paulo-level delay is clearly being relayed through somewhere else.
**Reputation and exit feeds.** Commercial services (Spur, IPQS, IP2Proxy) continuously map out proxy-exit IPs by buying access to the pools themselves and recording which IPs serve their traffic. Many exit nodes also leave a proxy port open, which active scanning can detect.
### Coherence checks layered on top
Once an IP is suspected of being a proxy exit, vendors cross-check it against the rest of the request — the same way lie detection checks a fingerprint for contradictions. The IP's location should match the browser's timezone, the Accept-Language header (the languages the browser says it prefers), and the locale that JavaScript reports. A residential IP in Japan paired with Europe/London and en-US does not add up — a sign the proxy was bolted onto a browser profile that was generated somewhere else.
This is why buying clean residential IPs is necessary but not sufficient. The whole identity around the IP — timezone, language, fingerprint, and how often requests arrive — has to look like one real person living behind that connection.
### How to avoid tripping it
**Keep per-IP velocity low.** Treat each exit IP like a single human: limit how many sessions run at once, space requests out at a human pace, and use sticky sessions so one identity stays on one IP instead of jumping to a new one mid-task.
**Match geo to identity.** Make the proxy's location agree with the browser's timezone, language, and locale. Mobile proxies make detection even harder, because carrier-grade NAT (where the carrier puts thousands of real phone users behind one shared IP) means high concurrency from a single IP is normal — so velocity heuristics carry much less weight.
**Use clean, well-sourced pools.** Cheap pools recycle IPs that already sit in every exit feed. A managed scraping API ties proxy quality, sensible rotation, and fingerprint coherence together, so the proxy and the browser identity match each other by design rather than being hand-assembled and hoping they line up.
### Example
```text
# What separates a real home connection from a residential proxy exit.
real home IP residential proxy exit
──────────── ───────────────────────
1-3 sessions 300+ concurrent sessions
1 browser fingerprint many distinct fingerprints / IP
RTT matches geo RTT inflated by proxy hop (JA4L)
tz + lang match IP geo tz/lang often mismatched to IP
not in any exit feed listed by Spur / IPQS / IP2Proxy
# Rotating IPs hides none of this: velocity, pool statistics, and the
# exit-feed listing are properties of the gateway, not the address.
# Lowering per-IP concurrency and matching geo->timezone->language
# defeats more checks than buying "better" IPs alone.
```
### FAQ
**Q: If the IP is a real residential address, how can it be detected as a proxy?**
The IP's reputation is real, but the behaviour around it is not. A real home connection runs a few sessions; a proxy exit relays hundreds of sessions at once, each with a different fingerprint. Detection also uses network-wide pool statistics (many residential ASNs hitting one page at the same moment), the extra delay added by the proxy hop, and commercial exit-node feeds that record which residential IPs are serving proxy traffic.
**Q: Does rotating IPs more aggressively change how these signals appear?**
Usually not. The session volume, the pool-wide traffic pattern, and the exit-feed listing are all properties of the gateway, not of any single IP — so swapping addresses does not change them. Rotating mid-session also breaks the consistency a real user would have. From an authorized-access standpoint, the way these signals stay coherent is by keeping per-IP session counts low and keeping one identity on one IP.
**Q: Are mobile proxies harder to detect than residential?**
Generally yes. Carrier-grade NAT puts thousands of real users behind a single mobile IP, so heavy traffic and frequent IP changes from one address are normal and expected. That weakens the velocity and pool checks that catch residential proxies — though geolocation, timezone, and language still have to be coherent.
---
## What Is Fingerprint Entropy?
URL: https://scrappey.com/qa/anti-bot/what-is-fingerprint-entropy
**Fingerprint entropy is a way to measure how much a browser attribute gives away about who you are, counted in bits.** Think of entropy as "how much this value narrows down the crowd." A signal that splits everyone into two equal halves is worth one bit; a signal that picks out a single browser from ~16 million carries about 24 bits. When you combine several attributes, their entropy roughly adds up — which is why just a handful of revealing signals (a canvas hash, your font list, your WebGL renderer) is enough to pin down one device. For scrapers the takeaway is counter-intuitive: a fingerprint that is *too unique* gives you away just as much as one that is internally inconsistent.
### Quick facts
- **Unit:** Bits of information: bits = -log₂(probability of the value)
- **Uniqueness threshold:** log₂(population) bits identifies one browser — ~24 bits for ~16M users
- **High-entropy signals:** Canvas/WebGL hash, font list, plugin set, screen + UA combination
- **Low-entropy signals:** Timezone, language, OS family — common, so little information each
- **Scraper trap:** Both too-unique and seen-on-many-IPs fingerprints are suspicious
### Entropy as bits of identifying information
The concept comes from information theory and was popularised for browsers by the EFF’s *Panopticlick* study. The key term is **self-information**: how surprising a particular value is. Its formula is -log₂(p), where p is the share of people who have that same value. The rarer the value, the higher the number. A value that half the population shares is worth 1 bit; a value only 1 in 1,000 people have is worth about 10 bits.
To single out one browser from a group of size *N*, you need log₂(N) bits. For ~16 million visitors that works out to about 24 bits; for the entire web, around 33. Panopticlick found the average browser already leaked roughly 18 bits — enough to make most browsers unique inside a large site’s traffic. And because the bits from independent attributes **add together**, combining even a few medium-strength signals crosses the uniqueness line quickly.
### Which signals carry the most bits
Not all attributes reveal the same amount. **High-entropy** signals carry many bits because they differ a lot from one device to the next:
- Canvas and WebGL render hashes — these depend on your GPU, graphics driver, and OS, so tiny differences produce different results.
- Your installed font list — varies a lot from machine to machine.
- The exact combination of User-Agent, screen resolution, and plugins.
**Low-entropy** signals carry few bits because almost everyone shares them: timezone, primary language, OS family, colour depth. On their own they reveal little, but they still *add* to the running total — and, more importantly, they are used in consistency checks. A low-entropy value that contradicts a high-entropy one (say, a Windows font list paired with a macOS timezone) is itself a giveaway.
This is why clustering works: real devices only ever land on a limited set of high-entropy combinations, and an entropy budget tells the anti-bot vendor how confidently a given fingerprint maps to a single device.
### Why being too unique is a problem
A common but naive approach is to *randomise* the high-entropy signals — a fresh canvas hash, a shuffled font list, random WebGL noise on every request. This **maximises** entropy, and that is exactly the wrong move. Two problems follow:
**1. Impossible uniqueness.** A canvas hash that has never appeared on any real device sits outside every known cluster. It carries very high self-information, but it lives in a region of the space that no real hardware can produce — and that is precisely the kind of combination clustering rejects.
**2. Wrong spread across IP addresses.** A real fingerprint normally shows up on one device behind one IP. A scraper that *reuses* a single fingerprint across thousands of IPs produces abnormally low IP-entropy — a clear sign of a bot farm. A scraper that *randomises* on every request makes the population-wide entropy spike in a way real traffic never does. Either way, the distribution does not match real users.
### The blend-in target
The winning move is neither maximum nor minimum entropy — it is to land in a **high-population, low-self-information bucket**. In plain terms: present a common configuration that millions of real users already share (a popular Intel/Chrome/Windows profile, the default font set, a mainstream screen size) and keep it *stable for the whole session*. Low self-information means your fingerprint blends into the crowd; stability means it behaves like one consistent real device over time.
Doing this by hand is hard, because the attributes have to agree with each other — the same consistency problem behind lie detection and clustering. Tools that serve a coherent, common, real-hardware profile — such as Camoufox — aim for exactly this low-self-information, high-stability target instead of stitching together random values at runtime. Browser vendors are moving the same way for privacy reasons: Chrome’s User-Agent reduction and privacy-budget work deliberately shrink the entropy a page is allowed to read.
### Example
```python
import math
# Self-information of an attribute value: rarer value -> more bits.
def bits(p): # p = fraction of population with this value
return -math.log2(p)
# A few illustrative per-attribute shares and their information content:
attrs = {
"timezone = Europe/Amsterdam": 0.02, # ~5.6 bits (common-ish)
"language = en-US": 0.30, # ~1.7 bits (very common)
"font list (this machine)": 0.001, # ~10 bits (high entropy)
"canvas hash (this machine)": 0.0005, # ~11 bits (high entropy)
}
total = sum(bits(p) for p in attrs.values())
for name, p in attrs.items():
print(f"{name:32} {bits(p):5.1f} bits")
print(f"{'COMBINED':32} {total:5.1f} bits")
# ~24 bits uniquely identifies one browser among ~16 million.
# Independent signals ADD, so a few high-entropy values cross that line.
# Randomising them doesn't hide you -- it pushes you OUTSIDE every real
# cluster, which is the anomaly clustering is built to catch.
```
### FAQ
**Q: How many bits does it take to uniquely identify a browser?**
Roughly log₂ of the population size. For a site with ~16 million distinct visitors, that is about 24 bits; across the whole web it is around 33. The EFF’s Panopticlick study found the average browser already leaks about 18 bits — enough to be unique within most large sites. And because independent signals add together, a few high-entropy values (canvas, fonts, WebGL) push you well past the threshold.
**Q: If high entropy identifies me, should I minimise my fingerprint entropy?**
You should minimise your *self-information* — present common values that millions of others also have — but not by stripping out or randomising signals. A missing or randomised canvas is high-entropy in the wrong way: it lands outside every real-device cluster and gets flagged. The goal is a common, coherent, stable profile that blends into the crowd, not a unique or empty one.
**Q: How does entropy relate to fingerprint clustering?**
They are two views of the same statistics. Entropy measures how much a single attribute narrows down the population; clustering measures whether the *combination* of attributes falls inside the region real devices actually occupy. A fingerprint can look ordinary on every individual field yet still sit outside every cluster if those fields are combined in a way no real hardware produces.
---
## What Is WebGPU Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-webgpu-fingerprinting
**WebGPU fingerprinting reads identifying data from the modern navigator.gpu API.** WebGPU is the newest browser standard for talking to your GPU (the graphics chip), and it gives away a lot about your hardware. Where the older WebGL standard mostly exposes a single renderer string (a short text label naming the GPU), WebGPU exposes a structured GPUAdapterInfo object plus dozens of numeric limits - the GPU's maximum capacities, such as max buffer size, max compute workgroup counts, and max texture dimensions - and a list of supported features. WebGPU can also run real compute shaders (small programs that do math on the GPU), so a probe can hash the floating-point output of a known calculation. That result reflects the actual silicon, which makes it a stronger, harder-to-fake signal than WebGL.
### Quick facts
- **Reads:** GPUAdapterInfo (vendor/architecture), ~40 numeric limits, feature list, compute output hash
- **Distinct from WebGL:** WebGPU exposes structured adapter limits + compute shaders, not just a renderer string
- **Headless tell:** No adapter at all (requestAdapter returns null) on GPU-less servers
- **Coherence trap:** Adapter limits must match the WebGL renderer and the claimed platform
- **Status:** Shipped in Chrome 113+; anti-bot vendors began scoring it in 2024-2025
### What WebGPU exposes that WebGL does not
WebGPU was designed for heavy compute work, so it reveals far more structured hardware detail than WebGL. A fingerprinting probe collects three layers:
- **Adapter info** - adapter.info returns vendor ("nvidia", "apple", "intel"), architecture ("turing", "apple-m", "gen-12lp"), and sometimes device and description strings.
- **Limits** - adapter.limits is an object of roughly 40 numeric ceilings: maxTextureDimension2D, maxBufferSize, maxComputeWorkgroupSizeX, maxStorageBufferBindingSize, and more. The exact set of numbers is tightly tied to the GPU model and its driver.
- **Compute output** - run a known compute shader over known inputs, read the result back, and hash it. The tiny floating-point rounding differences in the shader cores vary between vendors and architectures.
The combination carries more entropy (more identifying power) than WebGL, because the limits object alone encodes the GPU tier as about 40 correlated numbers rather than one string.
### Why headless servers fail WebGPU
On a server with no GPU, navigator.gpu.requestAdapter() resolves to null - there is simply no graphics card to describe. Real desktop Chrome on consumer hardware almost always returns a working adapter, so a null adapter on a request that claims to be a normal desktop user is a strong anomaly. Forcing software rendering instead (Chrome's Dawn/SwiftShader WebGPU backend, which imitates a GPU in plain code) returns an adapter whose vendor is reported as a software fallback and whose limits match no real GPU - an equally clear tell.
The fixes mirror the ones for WebGL: pass a real GPU through to the browser (xvfb plus a physical or virtual GPU), or patch the adapter at the engine level so the info, limits, and compute output all come from one coherent real-device profile. Spoofing adapter.limits from JavaScript is detectable, because the patched getter functions fail Function.toString() inspection (a check that prints a function's source to see if it has been tampered with).
### The WebGL/WebGPU coherence check
The decisive technique is cross-checking WebGPU against WebGL. Both APIs describe the same physical GPU, so their stories must agree. If WebGL reports *"Apple GPU"* but the WebGPU adapter vendor is *"nvidia"*, or WebGL reports an RTX 4070 while the WebGPU maxComputeWorkgroupStorageSize matches an integrated Intel chip, the request contradicts itself and gets blocked. A request that spoofs one API but leaves the other at its headless default is worse off than spoofing neither.
This is the same lesson as everywhere else in fingerprinting: a believable identity is a coherent cluster of values harvested from one real machine, not a pile of individually plausible spoofs. Engine-level tools (Camoufox, Chromium forks) that serve matched (WebGL, WebGPU) pairs per session are the only reliable defence.
### Example
```javascript
// What an anti-bot script reads to fingerprint WebGPU
async function webgpuFingerprint() {
if (!navigator.gpu) return 'no-webgpu'; // suspicious on modern Chrome
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) return 'null-adapter'; // GPU-less headless server tell
const info = adapter.info || {};
const limits = {};
for (const k in adapter.limits) limits[k] = adapter.limits[k]; // ~40 numbers
return {
vendor: info.vendor, // 'nvidia' | 'apple' | 'intel' | software-fallback
architecture: info.architecture, // 'turing' | 'apple-m' | 'gen-12lp'
features: [...adapter.features].sort().join(','),
maxBufferSize: limits.maxBufferSize, // tightly bound to GPU model
maxTexture2D: limits.maxTextureDimension2D
};
// Cross-checked against the WebGL renderer - a mismatch is block-grade.
}
```
### FAQ
**Q: Is WebGPU fingerprinting replacing WebGL fingerprinting?**
It is adding to WebGL, not replacing it. WebGL still ships in every browser and is checked first. WebGPU is used as a second, higher-entropy GPU signal and - more importantly - as a way to cross-check WebGL for consistency. Vendors that read both can catch tools that only harden one of them.
**Q: Can I disable WebGPU to avoid the fingerprint?**
On Chrome 113+ a missing navigator.gpu is increasingly odd for something claiming to be desktop Chrome, though it is still common enough today to be a soft hint rather than an outright block. Returning a null adapter looks more normal than removing the API entirely, but the safest setup is a coherent, real adapter.
**Q: Why is the limits object such a strong signal?**
Because it is about 40 numbers that all come from the same GPU and driver, so they are highly correlated. You cannot change one without making the whole set inconsistent with any real device. Spoofing a believable limits object means copying a complete one from a real GPU, not editing individual values.
---
## What Is Client Hints Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-client-hints-fingerprinting
**User-Agent Client Hints (UA-CH) are a set of structured HTTP headers plus a matching JavaScript API that report the same browser and operating-system facts the old User-Agent text string used to carry.** On every request Chrome sends Sec-CH-UA, Sec-CH-UA-Platform, and Sec-CH-UA-Mobile; it exposes the same data to JavaScript through navigator.userAgentData; and it answers requests for more detailed "high-entropy" hints (Sec-CH-UA-Platform-Version, Sec-CH-UA-Arch, Sec-CH-UA-Full-Version-List, Sec-CH-UA-Model) - "entropy" here meaning how much a value narrows down who you are. Anti-bot systems fingerprint scrapers by checking whether all three sources - the legacy UA string, the Sec-CH-UA headers, and the JS API - tell exactly the same story.
### Quick facts
- **Headers:** Sec-CH-UA, Sec-CH-UA-Platform, Sec-CH-UA-Mobile (low-entropy, always sent)
- **On request:** Sec-CH-UA-Arch, -Bitness, -Model, -Platform-Version, -Full-Version-List
- **JS mirror:** navigator.userAgentData.getHighEntropyValues()
- **Core check:** UA string == Sec-CH-UA headers == userAgentData (all must agree)
- **GREASE:** Sec-CH-UA includes a deliberately random brand entry browsers must tolerate
### The three sources that must agree
Since Chrome 89+ the browser reports its identity in three places, all generated from one internal source - so on a real browser they always match:
- **The UA string** - navigator.userAgent and the User-Agent header, now frozen/reduced on Chrome (deliberately trimmed and locked down).
- **Sec-CH-UA headers** - Sec-CH-UA: "Chromium";v="131", "Not_A Brand";v="24", "Google Chrome";v="131" plus platform and mobile flags on every request.
- **navigator.userAgentData** - the JS API, including getHighEntropyValues(["platform","platformVersion","architecture","model","fullVersionList"]).
A scraper that edits the User-Agent header but leaves Sec-CH-UA and navigator.userAgentData at their real (or missing) values is instantly incoherent - the three sources contradict each other. Python HTTP clients that send a Chrome UA string but no Sec-CH-UA headers at all are an obvious tell, because real Chrome never omits them.
### High-entropy hints and the GREASE trap
Low-entropy hints (brand, mobile, platform) ship on every request. High-entropy hints (full version list, architecture, bitness, model, platform version) are sent only when the server asks for them via the Accept-CH response header - so an anti-bot endpoint can request them and watch how the client answers. The values must agree with each other: Sec-CH-UA-Mobile: ?1 (claiming a mobile device) paired with a desktop platform, or Sec-CH-UA-Arch: "arm" with Sec-CH-UA-Bitness: "32" on a claimed Apple Silicon Mac, are contradictions.
The Sec-CH-UA header also contains a **GREASE** entry - a deliberately fake brand like "Not_A Brand";v="24" that Chrome adds so servers cannot hardcode the brand list, and whose exact text and punctuation vary by Chrome version. Vendors know the real GREASE patterns per version; a hand-built header with the wrong GREASE string, or with the brands in the wrong order, fails the check. navigator.userAgentData.brands must contain the same GREASE entry as the header.
### Why this is hard to spoof by hand
Getting UA-CH right means generating one complete, version-accurate identity across all three surfaces at once: the reduced UA string, every Sec-CH-UA header with the correct GREASE and ordering, and a navigator.userAgentData object whose getHighEntropyValues() returns matching platform/arch/model. Change the Chrome major version and all of them have to move together.
This is why brand-switching and UA spoofing are gated behind engine-level tooling in serious anti-detect browsers - the engine regenerates all three from one config so they cannot drift apart. A managed scraping API solves it the same way: it impersonates a real Chrome build end to end rather than editing one header. Patching just the UA string with a Python requests override is the single most common reason a scraper that "looks like Chrome" still gets blocked.
### Example
```javascript
// Server-side coherence check an anti-bot endpoint runs
// 1) ask for high-entropy hints
res.setHeader('Accept-CH',
'Sec-CH-UA-Platform-Version, Sec-CH-UA-Arch, Sec-CH-UA-Full-Version-List, Sec-CH-UA-Model');
// 2) on the next request, compare all sources
function coherent(req) {
const ua = req.headers['user-agent'] || '';
const chUA = req.headers['sec-ch-ua'] || ''; // '"Chromium";v="131", "Not_A Brand";v="24"...'
const plat = req.headers['sec-ch-ua-platform'] || ''; // '"Windows"'
const mob = req.headers['sec-ch-ua-mobile'] || ''; // '?0'
if (ua.includes('Chrome') && !chUA) return false; // Chrome never omits Sec-CH-UA
if (ua.includes('Windows') && !plat.includes('Windows')) return false;
if (mob === '?1' && plat.includes('Windows')) return false; // mobile flag vs desktop OS
const major = (ua.match(/Chrome\/(\d+)/) || [])[1];
if (major && !chUA.includes('v="' + major + '"')) return false; // version drift
return true;
}
```
### FAQ
**Q: What is the difference between the User-Agent string and Client Hints?**
The UA string is one freeform line of text that Chrome is freezing and reducing to limit passive fingerprinting. Client Hints carry the same facts in a structured, opt-in form - as headers plus a JS API. The key detection point is that they must agree: the UA string, the Sec-CH-UA headers, and navigator.userAgentData all derive from one source on a real browser, so any mismatch stands out.
**Q: Do I need to send Sec-CH-UA headers from a Python scraper?**
If you send a Chrome User-Agent, yes - real Chrome always sends Sec-CH-UA, Sec-CH-UA-Platform, and Sec-CH-UA-Mobile, so their absence flags you. Better still, impersonate a real Chrome build end to end (a TLS-impersonating client or a real browser) instead of hand-assembling headers, because the GREASE token and brand ordering are easy to get wrong.
**Q: What is GREASE in Sec-CH-UA?**
GREASE is a deliberately fake brand entry (e.g. "Not_A Brand";v="24") that browsers include so servers cannot hardcode the brand list. Its exact text, version, and position vary by Chrome version. Anti-bot vendors know the real patterns, so a wrong or missing GREASE entry is a clear spoofing tell.
---
## What Is a Timezone / IP Mismatch?
URL: https://scrappey.com/qa/anti-bot/what-is-timezone-ip-mismatch
**A timezone/IP mismatch is when the location a browser claims and the location of its IP address disagree.** Anti-bot systems (the software sites use to block scrapers) read the browser timezone (via Intl.DateTimeFormat().resolvedOptions().timeZone and Date.getTimezoneOffset()), the Accept-Language header, and any GPS/geolocation claims, then compare all of them against the IP geolocation - where the connecting IP physically maps to. A request from a German residential proxy that reports America/New_York and en-US is geographically incoherent and scores block-grade. It is the single most common reason proxied scrapers get caught.
### Quick facts
- **Signals compared:** Intl timezone, getTimezoneOffset, Accept-Language, navigator.language(s), IP geo
- **Why it is reliable:** Real users almost always have a timezone matching their IP region
- **Common cause:** Rotating proxy IP changes country but the browser keeps a fixed timezone
- **Fix:** Derive timezone + locale + language from the exit IP, per session
- **Bonus check:** WebRTC / DNS geo and Accept-Language must agree too
### Everything a site can read about your location
Your location leaks from several independent places at once, and they are all supposed to agree:
- **Browser timezone** - Intl.DateTimeFormat().resolvedOptions().timeZone returns a named zone like "Europe/Berlin"; new Date().getTimezoneOffset() returns the numeric offset from UTC. Both come from the operating system.
- **Language** - navigator.language, navigator.languages, and the Accept-Language header.
- **IP geolocation** - the server maps the connecting IP to a country/region/city using a GeoIP database (MaxMind and similar).
- **Geolocation API** - if you grant permission, exact GPS coordinates.
- **Locale-derived signals** - currency, number/date formatting, and even the set of bundled fonts can hint at your region.
On a real device these all line up, because they trace back to the one place where the person installed and configured their OS. A scraper instead assembles them from different sources - a US server timezone, a default en-US locale, and a proxy IP in Brazil - and the seams show.
### Why rotating proxies make it worse
The mismatch is most damaging with rotating proxies (proxies that change exit IP from request to request). The browser process keeps one fixed timezone for its whole lifetime, but the exit IP may hop from Germany to Japan to Brazil across requests. Within a single session the timezone is now wrong for two of the three countries. Worse, an IP that changes country mid-session while the timezone stays fixed is itself behaviour no real user shows - so the inconsistency is not just static but temporal (it shows up over time).
Anti-bot vendors specifically watch for: a timezone offset that does not match the IP country, an Accept-Language that does not match the IP country, and an IP whose country changes while the client-side geo signals stay constant. Any one of these is a soft signal; two together is usually a block.
### Keeping geo coherent
The fix is to derive every location signal from the exit IP, per session, instead of leaving OS defaults in place. That means: set the browser timezone to the IP's zone, set navigator.languages and Accept-Language to a locale plausible for that country, and if geolocation is granted, return coordinates inside the IP's region. Pin the proxy to one country for the lifetime of a session so the timezone never contradicts a later request.
Serious anti-detect browsers and managed scraping APIs automate this: they look up the proxy exit IP, then configure timezone, locale, language, and geolocation to match before the first page load. This is exactly why frameworks warn against using a separate proxy-auth path the fingerprint layer does not know about - if the browser does not know the exit IP, it cannot align the timezone to it. Keeping geo coherent is a clustering problem, not a single setting: see fingerprint clustering.
### Example
```javascript
// Client signals an anti-bot script collects
const geo = {
tz: Intl.DateTimeFormat().resolvedOptions().timeZone, // 'Europe/Berlin'
offset: new Date().getTimezoneOffset(), // -120 (minutes, DST-aware)
langs: navigator.languages, // ['en-US','en']
lang: navigator.language
};
// Server compares against IP geo (pseudo)
function geoCoherent(geo, ipCountry /* from MaxMind */) {
const tzCountry = zoneToCountry(geo.tz); // 'Europe/Berlin' -> 'DE'
if (tzCountry !== ipCountry) return false; // timezone vs IP
const langCountry = (geo.lang.split('-')[1] || '').toUpperCase();
if (langCountry && langCountry !== ipCountry) {
// en-US on a DE IP is common for expats, so this is a soft signal,
// but tz mismatch + lang mismatch together is block-grade.
return 'soft';
}
return true;
}
// Rule of thumb: pin one country per session and derive tz+locale+lang from the exit IP.
```
### FAQ
**Q: How strong a signal is a timezone/IP mismatch on its own?**
Strong. Real users overwhelmingly have a browser timezone that matches their IP region. A clear mismatch (timezone country != IP country) is enough for a soft block at most vendors, and a hard block when combined with any other anomaly. It is cheap to compute on the server, so nearly every anti-bot system checks it.
**Q: What about expats and travellers whose language does not match their IP?**
That is exactly why an Accept-Language mismatch on its own is treated as a soft signal - en-US on a German IP is plausible. But the timezone almost always still matches the IP for those users, because the OS timezone follows wherever they physically are. Timezone mismatch is the harder signal; language mismatch only corroborates it.
**Q: How do I keep timezone and IP consistent with rotating proxies?**
Pin a session to one proxy exit country instead of rotating mid-session, look up the exit IP, and set the browser timezone, languages, and (if used) geolocation to match that country before navigating. Tools that handle this automatically read the exit IP first, then configure the fingerprint - which is why a proxy the browser is unaware of breaks the alignment.
---
## What Is navigator.webdriver?
URL: https://scrappey.com/qa/anti-bot/what-is-navigator-webdriver
**navigator.webdriver is a standardized boolean that returns true when the browser is being controlled by automation.** Think of it as a built-in honesty flag: any browser driven by software, instead of a person, is supposed to admit it. It is defined by the W3C WebDriver spec and set by Selenium, Playwright, Puppeteer, and any CDP-driven browser (one controlled through Chrome's DevTools Protocol) launched in automation mode. Because reading it is a single property access, it is the very first and cheapest check almost every anti-bot script runs. Hiding it is table stakes - necessary to not fail instantly, but far from sufficient, because dozens of harder signals remain.
### Quick facts
- **Spec:** W3C WebDriver - navigator.webdriver is true under automation
- **Set by:** Selenium, Playwright, Puppeteer, ChromeDriver, any --enable-automation launch
- **Cost to check:** One property read - the cheapest bot signal that exists
- **Spoof tell:** A JS-defined getter fails Function.toString() inspection
- **Reality:** Hiding it is necessary but never sufficient
### Where the flag comes from
The WebDriver standard - the W3C rulebook for how software remote-controls a browser - requires a conforming automation session to set navigator.webdriver to true, so that pages can tell they are being driven. Chrome sets it when launched with the --enable-automation switch or controlled via the DevTools Protocol; Firefox sets it under Marionette (Firefox's automation engine); every mainstream automation framework triggers it by default. On a normal human browser session it is false.
That makes it a perfect first-pass filter: one line, if (navigator.webdriver) flag(), catches every scraper that has not specifically dealt with it - which is a surprising number, because many tutorials never mention it.
### Why naive hiding gets caught anyway
The obvious fix is to overwrite the property: Object.defineProperty(navigator, "webdriver", { get: () => false }). This makes the value read false, but it introduces new, detectable artifacts (telltale leftovers of the tampering):
- The getter - the small function that runs when the value is read - is now ordinary JavaScript. Object.getOwnPropertyDescriptor(navigator, "webdriver").get.toString() returns the patch source instead of "[native code]" (the marker browsers show for built-in functions) - caught by Function.toString() inspection.
- On real Chrome, webdriver lives on Navigator.prototype (the shared blueprint all navigator objects inherit from), not as an own-property of the navigator instance. Defining it on the instance changes where it appears in the prototype chain - itself a tell.
- If the patch runs after page scripts, there is a race where the real value is briefly observable.
So spoofing the value with JavaScript trades one obvious signal for a subtler one. The clean fix is to launch the browser so the flag is never set - excluding the enable-automation switch - or to patch at the engine level so the property reads false natively.
### Necessary but not sufficient
The most important thing to understand about navigator.webdriver is its place in the detection stack: it is the floor, not the ceiling. Passing it means you are not in the bottom tier of trivially-detectable bots. It tells a vendor nothing reassuring - real users pass it too - so failing it is fatal but passing it earns you nothing. Serious anti-bot systems (Kasada, DataDome, Akamai) treat it as a checkbox and move on to TLS fingerprints (clues from the https handshake), canvas/WebGL/audio coherence (whether those graphics and sound APIs all describe the same machine), behaviour, and the harder probes.
This is why tooling that *only* hides navigator.webdriver (some minimal patching scripts) still gets blocked everywhere that matters. The flag is the first gate; the real contest is the coherent fingerprint behind it. Engine-level browsers launch with the flag genuinely absent and harden the rest.
### Example
```javascript
// The cheapest bot check in existence
if (navigator.webdriver) {
// Block-grade for the bottom tier of scrapers.
}
// Naive hide - works on the VALUE but adds a new tell
Object.defineProperty(navigator, 'webdriver', { get: () => false });
// How a vendor catches the naive hide:
const d = Object.getOwnPropertyDescriptor(navigator, 'webdriver');
const patched = d && d.get &&
!Function.prototype.toString.call(d.get).includes('[native code]');
// patched === true -> the getter is JS, not native -> flagged
// Real fix: launch without the automation switch so the flag is never set,
// e.g. Puppeteer: ignoreDefaultArgs: ['--enable-automation'],
// or use an engine that reports webdriver=false natively.
```
### FAQ
**Q: Does the navigator.webdriver flag tell an anti-bot system everything?**
No. It clears the single cheapest check, which is necessary, but real users pass it too, so passing earns no trust. Every serious anti-bot system then evaluates TLS fingerprints, canvas/WebGL/audio coherence, behaviour, and harder probes. The flag is the floor of what detection looks at, not the ceiling.
**Q: Why does Object.defineProperty on navigator.webdriver still get caught?**
Because the replacement getter is a JavaScript function whose toString() returns the patch source instead of [native code] (the marker for built-in functions), and because the property ends up on the navigator instance rather than on Navigator.prototype where it lives in real Chrome. Both are detectable. Launching without the automation switch avoids creating the flag at all.
**Q: How do I launch a browser so navigator.webdriver is false from the start?**
Exclude the automation switch - for example Puppeteer with ignoreDefaultArgs: ["--enable-automation"], or use a build/engine that does not set the flag. Engine-level anti-detect browsers report it false natively, so there is no JS getter to inspect.
---
## What Is JA3 Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-ja3-fingerprinting
**JA3 is a method for fingerprinting a TLS client by hashing the fields of its Client Hello.** TLS is the encryption layer behind https, and the Client Hello is the first message a client sends when opening a connection - it lists the encryption settings the client supports. Created at Salesforce, JA3 concatenates the TLS version, the ordered list of cipher suites, the ordered list of extensions, the elliptic curves, and the curve point formats, then MD5-hashes that string into a 32-character fingerprint (MD5 turns any input into a fixed-length code). Because the hash comes from the very first packet of the handshake - before any HTTP request, User-Agent, or cookie - JA3 lets a server identify the client library (real Chrome vs Python vs Go) no matter what the application layer claims.
### Quick facts
- **Hashes:** TLS version + ciphers + extensions + curves + point formats (MD5)
- **Output:** 32-char hex string, computed from the Client Hello
- **Catches:** Python/requests, Go, Node, curl - their TLS stacks differ from browsers
- **Weakness:** Extension order randomization (Chrome GREASE) makes raw JA3 unstable
- **Successor:** JA4 - sorts fields and adds structure to fix JA3 instability
### What goes into the hash
JA3 builds a comma-and-dash-delimited string from five Client Hello fields, always in this order:
TLSVersion,Ciphers,Extensions,EllipticCurves,PointFormatsA real example looks like 771,4865-4866-4867-49195-...,0-23-65281-10-11-...,29-23-24,0, and the MD5 of that string is the JA3 fingerprint. The key idea is that this captures the *TLS library*, not the application. Real Chrome offers a specific set of ciphers and extensions in a specific order baked into BoringSSL (the encryption library Chrome ships with). Python's requests (which uses OpenSSL), Go's crypto/tls, and Node each produce a different, recognizable Client Hello. So a scraper that sends a flawless Chrome User-Agent over Python's TLS stack ends up with a JA3 that screams "Python" - a contradiction the server spots before reading a single header.
### JA3's weakness and why JA4 replaced it
JA3 has a real flaw: it depends on the **order** of the fields, and modern Chrome deliberately shuffles the order of its TLS extensions on every connection (part of GREASE, a scheme that keeps servers from hard-coding assumptions about clients). Because of that shuffling, real Chrome produces a *different* JA3 hash each time it connects, so you cannot match it against one known-Chrome JA3 - the raw hash is unstable for exactly the clients you most want to allow.
The fix is JA4, which sorts the cipher and extension lists before hashing (so reshuffling no longer changes the result), splits the hash into readable segments, and adds a transport/version prefix. Most vendors now compute JA4 (and the wider JA4+ suite, including JA4H for HTTP/2), but JA3 is still widely deployed and remains the term most engineers recognize, so the two coexist.
### Why a JA3 match depends on the TLS stack
A JA3 value is determined below the HTTP layer, so it reflects the underlying TLS client rather than anything set in the application. A client's JA3 matches a given browser only when it sends the same Client Hello that browser's TLS library produces. In practice this is why certain tools share a browser's JA3:
- **curl-impersonate / curl_cffi** - curl built against BoringSSL with Chrome's cipher/extension configuration.
- **TLS-aware libraries** - tls-client (Go), rnet/noble-tls, which ship per-browser Client Hello profiles.
- **A real browser** - Playwright/Puppeteer driving real Chrome naturally produces Chrome's JA3/JA4.
Coherence is the broader point: a JA3 that matches a browser is only one signal. Vendors cross-check the TLS fingerprint against the HTTP/2 preface and the Client Hints, so a request whose TLS layer looks like Chrome but whose HTTP/2 layer looks like a library is internally inconsistent. The signals only agree end-to-end on a genuine browser stack.
### Example
```text
JA3 string = TLSVersion,Ciphers,Extensions,Curves,PointFormats
= 771,4865-4866-4867-49195-49199-...,0-23-65281-10-11-35-16-5-13-...,29-23-24-25,0
JA3 hash = MD5(JA3 string) -> e.g. cd08e31494f9531f560d64c695473da9
Same site, three clients hitting the same URL:
real Chrome 131 JA3 ~ 'cd08e31494...' (order randomized by GREASE)
python requests JA3 'a0e9f5d64b...' (OpenSSL cipher set + order)
go net/http JA3 '1f3a8d...' (crypto/tls cipher set)
The Chrome User-Agent does not matter - the Client Hello identifies the stack.
A client shares Chrome's JA3 only when its TLS library emits Chrome's Client Hello
(curl_cffi / tls-client / a real browser); HTTP/2 + headers must also be coherent.
```
### FAQ
**Q: What is the difference between JA3 and JA4?**
JA3 MD5-hashes the TLS Client Hello fields in the order they appear on the wire, which makes it unstable against Chrome's extension-order shuffling (GREASE). JA4 sorts the cipher and extension lists before hashing and splits the result into readable segments, so it stays stable per client and tells you more. JA4 is the modern successor, but JA3 is still widely deployed.
**Q: Why does my Python scraper get blocked even with a perfect Chrome User-Agent?**
Because the User-Agent lives at the application layer, while JA3 is computed from the TLS handshake underneath it. Python's OpenSSL stack produces a different Client Hello than Chrome's BoringSSL, so the JA3 reveals Python no matter what headers you set. You need a TLS-impersonating client like curl_cffi or tls-client - and the HTTP/2 fingerprint has to match too.
**Q: Is JA3 still used if JA4 exists?**
Yes. Many anti-bot systems and detection tools compute both, and a lot of existing rules and threat-intelligence are written in terms of JA3. JA4 is the better signal, but JA3 has not gone away - treat them as complementary TLS fingerprints.
---
## What Is HTTP/3 / QUIC Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-http3-quic-fingerprinting
**HTTP/3 / QUIC fingerprinting identifies a client from the QUIC transport layer that HTTP/3 runs on.** QUIC is the modern transport beneath HTTP/3: instead of TCP, it runs on UDP (a fast, connectionless way to send packets) and carries its own TLS 1.3 handshake - TLS is the encryption layer behind https. Bundled in is a set of transport parameters (idle timeout, max packet size, active connection-id limit, flow-control windows) sent in the very first "Initial" packet. The exact values and their order, combined with the QUIC version and the embedded TLS Client Hello, form a fingerprint distinct from the TCP-based JA3/JA4 and HTTP/2 fingerprints. It also creates a leak risk, because UDP often slips past proxies that only carry TCP.
### Quick facts
- **Transport:** QUIC over UDP - carries its own TLS 1.3 handshake
- **Fingerprint inputs:** QUIC version, transport parameters + order, Initial packet, embedded Client Hello
- **Distinct from:** TCP JA3/JA4 and HTTP/2 - a separate signal vendors cross-check
- **Leak risk:** UDP escapes TCP-only proxies, exposing the real IP/path
- **Common defence:** Disable QUIC, or tunnel UDP (SOCKS5 UDP ASSOCIATE)
### What QUIC exposes
When a client uses HTTP/3, it opens a QUIC connection, and the first few packets give a lot away. Three things vary depending on which software sent them: the **QUIC version**, the set and order of **transport parameters** (max_idle_timeout, max_udp_payload_size, initial_max_data, active_connection_id_limit, and others), and the structure of the **Initial packet**. Chrome's QUIC stack, Firefox's, and a Go or Rust QUIC library each produce a recognizable signature. On top of that, QUIC embeds a TLS 1.3 Client Hello, so the same JA3/JA4-style hash that identifies TLS applies inside QUIC too.
The result is a transport fingerprint that exists before the client has sent a single HTTP/3 request. A client whose QUIC signature does not look like a real browser's is as identifiable as one with a non-browser JA3 - and because HTTP/3 is newer, these mismatches are common in DIY stacks that bolt HTTP/3 onto a library.
### The UDP leak problem
QUIC runs on UDP, and this is where many proxied scrapers fail in a way that has nothing to do with the fingerprint hash. Most proxy setups only carry TCP. If the browser is allowed to use HTTP/3, its QUIC/UDP packets may travel **around** the proxy straight to the destination - exposing the real IP and network path even though all the TCP traffic is cleanly proxied. The same UDP-escape problem affects WebRTC STUN (the mechanism browsers use to discover their own IP for peer-to-peer calls). It is a transport-level leak sibling.
So HTTP/3 is a double exposure for scrapers: a transport fingerprint that can mismatch the claimed browser, and a UDP path that can bypass the proxy entirely.
### How scrapers handle it
There are two strategies. The conservative one is to **disable QUIC/HTTP3** so the browser falls back to HTTP/2 over TCP. This keeps all traffic inside the proxy and avoids the QUIC fingerprint surface entirely. It is common and safe - though a site that offers HTTP/3 and sees a "Chrome" client that never attempts it can treat that as a (weak) signal, since real Chrome usually tries HTTP/3.
The thorough one is to **tunnel UDP through the proxy** (using SOCKS5 UDP ASSOCIATE, a proxy feature that forwards UDP as well as TCP) so QUIC/STUN stay inside the tunnel and the QUIC fingerprint comes from a real browser stack. Few proxy providers support UDP ASSOCIATE, which is why high-end anti-detect tooling that implements UDP-over-SOCKS5 natively is relatively rare. For most scraping, disabling QUIC is the pragmatic default; serious operators who need HTTP/3 parity invest in UDP-capable proxies or managed APIs that handle the transport for them.
### Example
```text
Three layers a modern anti-bot stack fingerprints, all before HTTP content:
TCP path: TLS Client Hello -> JA3 / JA4
HTTP/2 SETTINGS -> JA4H / Akamai h2 fingerprint
UDP path: QUIC Initial -> QUIC version + transport params + embedded CH
QUIC transport parameters (order + values are implementation-specific):
max_idle_timeout, max_udp_payload_size, initial_max_data,
initial_max_stream_data_*, active_connection_id_limit, ...
Two failure modes for a proxied scraper using HTTP/3:
1) QUIC signature != claimed Chrome -> transport fingerprint mismatch
2) UDP escapes the TCP-only proxy -> real IP leak
Pragmatic default: disable QUIC so everything stays HTTP/2 over the proxy.
Thorough: tunnel UDP (SOCKS5 UDP ASSOCIATE) with a real browser QUIC stack.
```
### FAQ
**Q: Should I disable HTTP/3 in my scraper?**
For most proxied scraping, yes - disabling QUIC keeps all traffic on TCP inside the proxy, avoids the UDP-leak risk, and removes the QUIC fingerprint surface. The minor downside is that a site offering HTTP/3 may notice a Chrome client that never attempts it, which is a weak signal. If you need true HTTP/3 parity, you must tunnel UDP through the proxy.
**Q: Is the QUIC fingerprint the same as the TLS fingerprint?**
No, but it contains one. QUIC carries its own TLS 1.3 Client Hello (so a JA3/JA4-style hash applies inside it), but it also adds the QUIC version and transport parameters, which are a separate signal. Vendors can cross-check the QUIC signature against the TCP-based TLS and HTTP/2 fingerprints to see whether they all agree - a mismatch is a red flag.
**Q: Why does HTTP/3 leak my real IP through a proxy?**
QUIC runs on UDP, and most proxies only carry TCP. If the browser uses HTTP/3, its UDP packets can travel directly to the destination, bypassing the proxy and exposing your real IP - even though your TCP traffic is fully proxied. The fix is to disable QUIC or use a proxy that supports UDP ASSOCIATE (which forwards UDP too).
---
## What Is Hardware Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-hardware-fingerprinting
**Hardware fingerprinting reads device capability signals - CPU cores, RAM, and screen metrics - that JavaScript exposes directly.** These are values any website can read from the browser without permission. The main ones are navigator.hardwareConcurrency (the number of logical CPU cores), navigator.deviceMemory (RAM, reported only as one of 0.25/0.5/1/2/4/8), and the screen object (resolution, color depth, available area, and device pixel ratio). On their own each value is low-entropy - meaning it barely narrows down who you are - but they have to be coherent with each other and with the platform the browser claims to be. Certain combinations - especially server-grade core counts paired with mobile user-agents (the text a browser sends to identify itself) - are reliable signs of a bot.
### Quick facts
- **CPU:** navigator.hardwareConcurrency - logical core count
- **RAM:** navigator.deviceMemory - bucketed to 0.25 / 0.5 / 1 / 2 / 4 / 8 GB
- **Display:** screen.width/height, colorDepth, devicePixelRatio, availWidth/Height
- **Bot tell:** 64+ cores with a phone UA; or hardwareConcurrency = 1
- **Coherence:** Cores, RAM, GPU, and screen must describe one plausible device
### What the hardware APIs report
A handful of properties summarise the device:
- **navigator.hardwareConcurrency** - the number of logical cores, which sites use to decide how many background workers to spin up. Real consumer devices cluster at 4, 8, 10, 12, 16. A value of 1 or 2 is unusual on modern hardware; 32, 64, or 96 indicates a server.
- **navigator.deviceMemory** - approximate RAM, deliberately rounded for privacy to one of 0.25, 0.5, 1, 2, 4, or 8 (capped at 8). A phone reporting 8 with a desktop screen, or a desktop reporting 0.5, is suspicious.
- **The screen object** - width/height, availWidth/availHeight (the screen minus OS taskbars), colorDepth (almost always 24), and devicePixelRatio (1 on standard displays, 2 on Retina, 1.5/1.25 on scaled Windows).
None is unique, but together with the GPU and platform they describe a class of device, and that class has to be internally consistent.
### The cloud-server signature
The most actionable hardware tell is the cloud instance that is either too small or too big to be a real device. Scrapers run on VPS and container hosts whose hardwareConcurrency reflects the VM size - often 1, 2, or alternatively 32/64 on big boxes - and whose headless browser (a browser running with no visible window) reports a default or zero-size screen. Combinations that no real user produces:
- hardwareConcurrency: 1 with a desktop Chrome UA (real desktops are multi-core).
- hardwareConcurrency: 64 with an Android UA (no phone has 64 cores).
- screen.width: 800, height: 600 or 0x0 with a flagship-phone UA.
- deviceMemory: 8 (the cap) on every request from a fleet, while real traffic spreads across buckets.
These are cheap server-side checks that catch entire scraping fleets sharing one VM profile.
### Coherence and stability
Spoofing the values is easy; spoofing them *coherently* is the hard part. The core count, RAM bucket, GPU tier (read from WebGL/WebGPU), screen resolution, and device pixel ratio must all describe one believable machine that also matches the User-Agent and Client Hints (extra device-detail headers Chrome sends). A request claiming an iPhone should report the core count and screen metrics of that specific iPhone, not generic desktop values.
Two further traps: the values must be **stable within a session** (real hardware does not gain or lose cores mid-visit), and they must match what timing-based probes infer - a timing attack can estimate the true core count by loading up all the workers and measuring how they run, catching a browser that claims 8 cores but schedules like 2. This is why hardware spoofing works best as part of a complete device profile rather than field-by-field edits.
### Example
```javascript
// Cheap hardware signals an anti-bot script collects
const hw = {
cores: navigator.hardwareConcurrency, // 4/8/12/16 real; 1 or 64 suspicious
memory: navigator.deviceMemory, // 0.25..8 (bucketed); 8 is the cap
screen: [screen.width, screen.height].join('x'),
avail: [screen.availWidth, screen.availHeight].join('x'),
depth: screen.colorDepth, // ~always 24
dpr: window.devicePixelRatio // 1 / 2 / 1.5
};
// Server-side incoherence checks (pseudo)
function hardwareSuspicious(hw, ua) {
if (/Android|iPhone/.test(ua) && hw.cores > 16) return true; // phone with server cores
if (/Windows|Macintosh/.test(ua) && hw.cores <= 1) return true; // desktop with 1 core
if (/iPhone/.test(ua) && hw.dpr < 2) return true; // iPhone is always >=2 dpr
if (hw.screen === '0x0' || hw.screen === '800x600') return true; // headless default
return false;
}
```
### FAQ
**Q: How much entropy is in hardwareConcurrency and deviceMemory?**
Little on their own - real devices cluster at a few core counts, and deviceMemory is rounded to just six possible values, so neither narrows you down much. Their value is coherence and anomaly detection: catching impossible combinations (a phone UA with 64 cores) and fleets of cloud scrapers that all share one VM profile. They are corroborating signals, not unique identifiers.
**Q: Can I just set hardwareConcurrency to 8 to look normal?**
You can, but it has to be coherent with everything else - the GPU tier, screen size, device pixel ratio, User-Agent, and Client Hints all have to describe the same machine - and it has to survive a timing attack that estimates the real core count by loading up all the workers. Spoofing one field at a time tends to create a contradiction somewhere.
**Q: What screen size should a headless scraper use?**
A common real resolution for the claimed device, with matching availWidth/availHeight and device pixel ratio - never the headless default of 0x0 or 800x600. For a desktop, 1920x1080 at dpr 1 is the safest common choice; for a specific phone, use that phone's real metrics.
---
## What Is CDP Detection?
URL: https://scrappey.com/qa/anti-bot/what-is-cdp-detection
**CDP detection is the family of techniques anti-bot scripts use to tell that a browser is being driven through the Chrome DevTools Protocol (CDP).** CDP is the remote-control channel that automation tools use to drive Chrome - they send it commands like "navigate" or "run this JavaScript." Playwright, Puppeteer, and Selenium 4 all control Chrome via CDP, and switching on certain CDP features - especially the Runtime domain (the part that handles JavaScript and the console) - changes behaviour that a page can observe in JavaScript. The best-known tell is the Runtime.enable serialization side effect, probed via a getter on a logged object (explained below). Because these signals come from the automation channel itself, they catch CDP-driven scrapers even when every static fingerprint looks perfect.
### Quick facts
- **Target:** Browsers driven via Chrome DevTools Protocol (Playwright/Puppeteer/Selenium 4)
- **Flagship tell:** Runtime.enable causes the page to serialize console-logged objects
- **The probe:** Log an object with a getter; if the getter fires, CDP is listening
- **Also leaks:** Console.enable, and framework binding artifacts in page scope
- **Mitigation:** Avoid enabling Runtime/Console; drive below CDP or patch the engine
### The Runtime.enable serialization tell
When an automation client sends the Runtime.enable command, Chrome starts forwarding console activity back to the controller. To do that, it has to **serialize** the arguments passed to console.log and similar calls — that is, convert them into data that can cross the protocol boundary (the gap between the page and the tool driving it). That conversion is observable: if you log an object that has a getter (a function that runs when a property is read) on one of its properties, the act of serializing the object *invokes the getter* — even though no human ever opened DevTools.
So the probe is tiny: create an object whose id (or toString, or a numeric coercion) is a getter, console.log it, and check whether the getter ran. On a normal browser nothing reads the property and the getter never fires. Under CDP with Runtime.enable active, the getter fires — revealing the automation channel. This is the same kind of Error.stack-style trap that catches frameworks which keep Runtime enabled by default.
### Other CDP leaks
Beyond Runtime.enable, CDP-driven automation leaks in several other ways:
- **Console.enable** - similar to Runtime; turning it on changes how console objects are handled, which a page can probe for.
- **Framework bindings** - the tools leave their own fingerprints in the page. Playwright injects window.__playwright__binding__ / __pwInitScripts; older Puppeteer and ChromeDriver leave their own global variables (cdc_ properties). These are leftover artifacts of the CDP control channel exposing functions to the page.
- **Timing and event anomalies** - mouse and keyboard events dispatched through CDP can lack the trusted-event properties and natural timing of input from a real person.
- **toString inspection** - any JavaScript a CDP driver injects to hide itself can itself be inspected and exposed.
The common thread: the control channel, and the scripts it injects, live in or affect the page's own scope — so a sufficiently paranoid anti-bot script can find them.
### Reducing the CDP surface
Ways to shrink these tells, roughly from easiest to most robust:
- **Do not enable Runtime/Console** unless you actually need them, and delete the framework bindings in an init script that runs before the page loads (delete window.__playwright__binding__; delete window.__pwInitScripts;). This removes the cheapest tells.
- **Hook earlier / lower** - run your automation logic in a privileged context before the page navigates, rather than injecting code into the page's scope, so there is less for the page to observe.
- **Patch the engine** - anti-detect browsers suppress the console serialization side effect and hide the bindings down in the C++ source, so Runtime.enable no longer changes anything the page can see.
The reason CDP detection matters so much is the same theme that runs through all fingerprinting: a perfect static fingerprint does not help if the *way you are driving the browser* betrays you. Real-browser drivers and managed APIs that minimise the CDP footprint exist precisely because the control channel is its own detection surface.
### Example
```javascript
// The Runtime.enable serialization trap (paraphrased)
let cdpDetected = false;
const trap = {};
Object.defineProperty(trap, 'id', {
get() { cdpDetected = true; return 'x'; } // fires only if something serializes us
});
console.log(trap); // human browser: nobody reads .id, getter never runs
// CDP + Runtime.enable: Chrome serializes the object -> getter fires
setTimeout(() => {
if (cdpDetected) {
// Automation channel is listening - block-grade even if the fingerprint is clean.
}
}, 0);
// Related cheap checks the same script runs:
// 'window.__playwright__binding__' in window -> Playwright
// /\$?cdc_/.test(Object.keys(document).join()) -> ChromeDriver
```
### FAQ
**Q: How do anti-bots detect Playwright or Puppeteer specifically?**
Through the CDP control channel these tools use to drive Chrome. The main tell is the Runtime.enable serialization side effect — logging an object with a getter and watching the getter fire. Anti-bots also look for framework bindings like window.__playwright__binding__ and ChromeDriver cdc_ globals, and for CDP-dispatched mouse/keyboard events that lack the timing and trusted-event flags of real user input.
**Q: Does deleting window.__playwright__binding__ hide all of Playwright's CDP signals?**
No — it removes one cheap tell, not all of them. The Runtime.enable serialization trap, console handling changes, input-event anomalies, and any injected patches (exposed via toString inspection) all remain. Reducing the CDP surface means not enabling Runtime/Console, hooking earlier, or using an engine that suppresses the side effects at the C++ level.
**Q: Is Selenium detected the same way as Playwright?**
Largely yes. Selenium 4 drives Chrome over CDP too, so it shares the Runtime.enable family of tells. Classic Selenium also leaves ChromeDriver artifacts (cdc_ properties) and sets navigator.webdriver. The control-channel signals are broadly the same across all CDP-based tools.
---
## What Is Incognito Detection?
URL: https://scrappey.com/qa/anti-bot/what-is-incognito-detection
**Incognito detection is the set of techniques that reveal whether a browser is in private / incognito mode.** Private mode is the browser feature that opens a clean, temporary window - history and cookies do not stick around after you close it. The catch is that private mode also changes how the browser stores data: the storage allowance (quota) shrinks, some data-saving APIs are turned off or act differently, and nothing survives the session. Websites probe these differences (in the past through the FileSystem API, now mainly through the storage quota reported by navigator.storage.estimate() and how IndexedDB behaves) to gate content or to score how risky a visitor looks - because automated and abusive traffic shows up in private sessions far more often than in normal ones.
### Quick facts
- **Probes:** Storage quota (navigator.storage.estimate), IndexedDB, persistence APIs
- **Old method:** window.requestFileSystem was disabled in Chrome incognito (now removed)
- **Current tell:** Reduced/quantized storage quota relative to disk size
- **Why sites care:** Bots and abuse skew toward private sessions; used for risk scoring + paywalls
- **Arms race:** Browsers actively close these tells; detection shifts each release
### How private mode leaks
Private mode isolates and shrinks the browser's storage, and that is exactly what gives it away. The classic Chrome tell was window.webkitRequestFileSystem, an API that returned an error in incognito but worked normally otherwise - a single function call was enough to tell the difference. Google closed that gap, so detection moved to the **storage quota**: navigator.storage.estimate() (an API that tells a page how much space it may use) reports a much smaller quota in private mode - capped to a small fraction of disk, or to a fixed ceiling - than a normal session would on the same machine. Compare that reported quota against how big the device looks, and the mode shows through.
Other differences have been used over the years: IndexedDB (the browser's built-in database) throwing errors or acting differently, the Quota API's temporary-versus-persistent storage limits, and service-worker or cache data not surviving the session. Each browser release tends to patch whichever tell is currently popular, so the exact probe in use keeps shifting - but the underlying fact, that private mode constrains storage, keeps producing new ones.
### Why sites detect it at all
There are two motivations. **Risk scoring**: abusive and automated traffic is disproportionately private, because scrapers and fraud tools default to fresh, stateless sessions - no saved cookies or history - to avoid carrying identifiers between visits. Incognito alone is not proof of a bot, since plenty of real people browse privately, so it is used as a soft signal folded into a broader risk score, not a standalone block.
The other motivation is **business logic**: metered paywalls (news sites) and free-trial limits historically used incognito detection to stop readers from resetting their free-article count by opening a private window. This is the same plumbing the anti-bot use case relies on, which is why incognito detection turns up both in paywall scripts and in fraud/bot stacks.
### Implications for scrapers
For scraping, the practical lesson is that running headless does not put you in incognito by default - a headless browser with a normal user-data directory (the on-disk folder that holds a browser profile) reports normal storage. But several stealth setups *do* use private contexts or throwaway profiles, and if the site scores incognito as risk, that choice is working against you. If a target gates on private mode, the fix is to run with a persistent profile and real storage quota so navigator.storage.estimate() looks like an ordinary install.
As always, coherence matters more than any single value: the storage quota should match the kind of device you are claiming to be, and the persistence behaviour should look like a returning user if you are presenting cookies and history. Treat incognito detection as one more consistency probe in the stack rather than a special case.
### Example
```javascript
// Modern incognito probe: storage quota is constrained in private mode
async function looksPrivate() {
if (!navigator.storage || !navigator.storage.estimate) return null;
const { quota } = await navigator.storage.estimate();
// Normal session: quota is a large fraction of free disk (often GBs).
// Private mode: quota is capped much lower (historically ~120MB on some builds,
// or a small fraction of disk), so a small quota on a large device is a tell.
return quota < 300 * 1024 * 1024; // heuristic threshold, tuned per browser
}
// Legacy tell (removed in modern Chrome, still seen in old scripts):
// window.webkitRequestFileSystem(0, 1, () => normal, () => incognito);
```
### FAQ
**Q: Does running a headless scraper mean I am in incognito?**
Not by default. A headless browser with a normal, persistent user-data directory (its on-disk profile folder) reports normal storage quotas and is not in private mode. But some stealth setups deliberately use throwaway or private contexts, and if a target scores incognito as risk, that choice hurts you - run a persistent profile with real storage instead.
**Q: Is incognito detection a reliable bot signal?**
On its own, no - many real users browse privately, so it is a soft signal folded into a broader risk score rather than a hard block. It became popular partly for metered paywalls (stopping people from resetting their free-article count). For bot detection it is corroborating evidence, strongest when combined with other anomalies.
**Q: How do sites detect private mode now that the FileSystem trick is gone?**
Mainly via the storage quota: navigator.storage.estimate() returns a much smaller quota in private mode than in a normal session on the same device, so a small quota on a large machine is a tell. IndexedDB and persistence-API behaviour have also been used. Browsers keep closing specific tells, so the exact probe shifts between releases.
---
## What Is Media Devices Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-media-devices-fingerprinting
**Media devices fingerprinting reads the list of cameras, microphones, and speakers a browser reports via navigator.mediaDevices.enumerateDevices().** This is a built-in browser function that lists the audio and video hardware attached to your machine. It returns one entry per audio input (mic), audio output (speaker), and video input (camera), each with a kind, a stable hashed deviceId, a groupId (which links devices that belong to the same physical unit), and - once you grant permission - a human-readable label. The *number* and *shape* of those devices is a fingerprint signal, and the common headless failure - reporting zero devices, or a single suspicious default - is a reliable bot tell (headless means a browser running with no visible screen, typically on a server).
### Quick facts
- **API:** navigator.mediaDevices.enumerateDevices()
- **Reads:** Counts of audioinput / audiooutput / videoinput, deviceId, groupId, labels
- **Headless tell:** Empty list, or zero audiooutput devices, on a claimed desktop
- **Permission gate:** Labels are blank until camera/mic permission is granted
- **Coherence:** Device counts should match the claimed platform (Mac always has output)
### What the device list reveals
Even before you grant permission, enumerateDevices() returns one object per device with its kind (audioinput, audiooutput, or videoinput) and a non-empty deviceId/groupId; only the label stays blank until the user grants camera or microphone access. So a site learns *how many* devices of each type exist and how they are grouped - enough to tell what class of machine it is. A typical laptop reports at least one microphone, one camera, and one or more audio outputs; a desktop without a webcam reports a microphone and output but no video input.
The numbers cluster by device type and stay stable per machine. That makes them a low-entropy but coherent signal - on its own it does not narrow you down to one person (entropy = how much a value pins down who you are), but it has to agree with the rest of the fingerprint.
### The headless tell
Headless browsers on servers usually have no real media hardware, so enumerateDevices() returns an **empty array** (an empty list) or a single placeholder. On a request claiming to be a normal desktop or laptop, zero devices - particularly zero audiooutput - is anomalous, because real consumer machines almost always have at least a default audio output. macOS in particular always exposes audio output devices, so a "MacBook" with none is incoherent.
Chrome flags like --use-fake-device-for-media-stream add synthetic devices, but those fakes have recognisable default labels and group structure that differ from real hardware. As with audio and WebGL fingerprinting, the believable fix is a device list copied from a real machine of the claimed class, served consistently for the whole session, rather than a generic fake.
### Example
```javascript
// Device enumeration an anti-bot script reads (no permission needed for counts)
async function deviceFingerprint() {
const devs = await navigator.mediaDevices.enumerateDevices();
const count = { audioinput: 0, audiooutput: 0, videoinput: 0 };
for (const d of devs) count[d.kind] = (count[d.kind] || 0) + 1;
return count; // e.g. { audioinput:1, audiooutput:2, videoinput:1 }
}
// Server-side suspicion (pseudo):
// devs.length === 0 -> headless / no hardware
// /Macintosh/.test(ua) && audiooutput===0 -> incoherent (Macs always have output)
// all labels blank AND permission granted -> spoof artifact
```
### FAQ
**Q: Can a site read my camera and microphone names without permission?**
It can read how many input and output devices exist, plus their kind and grouping, without permission - but the human-readable names (labels) stay blank until you grant camera or microphone access. The counts alone are enough to characterise the machine and to catch a headless browser reporting zero devices.
**Q: Why is an empty device list a bot signal?**
Because real consumer desktops and laptops almost always have at least a default audio output, and usually a microphone. A server-hosted headless browser typically has no media hardware, so enumerateDevices() returns an empty array - which, on a request claiming to be a normal computer, is anomalous, especially the absence of any audiooutput.
**Q: Do Chrome fake-device flags fix this?**
Partly - they make the list non-empty, but the synthetic devices have default labels and group structure that differ from real hardware and can be recognised. A device list copied from a real machine of the claimed class and served consistently is more convincing than the generic fakes.
---
## What Is Speech Synthesis Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-speech-synthesis-fingerprinting
**Speech synthesis fingerprinting reads the list of text-to-speech voices exposed by window.speechSynthesis.getVoices().** "Text-to-speech" means the built-in feature that reads text aloud, and the voices it offers are installed by your operating system, not your browser. That makes the list a strong clue about which OS and version you are running: Windows ships specific Microsoft voices, macOS/iOS ship Apple voices, and Android ships Google voices, each with characteristic names, languages, and voiceURI values (a unique ID string per voice). A headless Linux server - a browser running on a server with no screen or speech engine - typically returns an **empty** voice list, which both flags automation and contradicts any Windows or macOS User-Agent (the text a browser uses to say what it is).
### Quick facts
- **API:** window.speechSynthesis.getVoices()
- **Reveals:** OS + version via bundled TTS voice names, langs, voiceURIs, localService flag
- **Headless tell:** Empty array on a typical headless Linux server
- **Coherence trap:** Voice set must match the claimed OS (Windows voices on a Windows UA)
- **Quirk:** Often returns [] on first call until the voiceschanged event fires
### Why the voice list is an OS fingerprint
Text-to-speech voices come from the platform, not the browser, so the set is tightly tied to the operating system. Windows 11 exposes voices like *Microsoft David*, *Microsoft Zira*, and language-specific additions; macOS exposes Apple voices such as *Samantha* and *Daniel* with com.apple. voiceURIs; Android exposes Google voices. Each voice carries a name, a lang (its language), a voiceURI (its ID), a localService boolean (true if the voice runs on the device rather than in the cloud), and a default flag. The full list - which voices, in what order, with which languages - forms a recognisable per-OS, often per-version signature.
That makes it a useful way to double-check what platform a browser claims to be, and a relatively high-entropy signal - meaning it narrows down who you are quite a lot - because the exact voice set changes with OS version and installed language packs.
### The empty list and the coherence trap
Headless browsers on Linux servers usually have no speech engine installed, so getVoices() returns [] (an empty list). An empty voice list is unusual for a real desktop user and, worse, it contradicts a Windows or macOS User-Agent, which should come with a known set of voices. So this one check does two jobs: an empty list hints at a headless bot, and a *mismatched* list (macOS voices under a Windows User-Agent, or the reverse) hints at platform spoofing - lying about which OS you run.
There is a well-known timing quirk: in Chrome the first call to getVoices() often returns [] because the engine loads the voices in the background and only signals it is ready by firing a voiceschanged event. Detection scripts know this and wait for that event, so a spoofing tool that just hands back a fixed list - without the realistic background loading - can itself look suspicious. The reliable fix, as with fonts and audio fingerprinting, is to expose a coherent, real voice list for the OS you are claiming, rather than a generic or empty one.
### Example
```javascript
// Voice-list fingerprint (handle the async population quirk)
function getVoiceFingerprint() {
return new Promise(resolve => {
let v = speechSynthesis.getVoices();
if (v.length) return resolve(summ(v));
speechSynthesis.onvoiceschanged = () => resolve(summ(speechSynthesis.getVoices()));
});
function summ(voices) {
return voices.map(x => x.name + '|' + x.lang + '|' + (x.localService ? 'L' : 'R'))
.sort().join(',');
}
}
// [] on headless Linux. Windows UA with macOS 'com.apple.*' voices = spoof tell.
```
### FAQ
**Q: Why does speechSynthesis.getVoices() reveal my operating system?**
Because the voices are installed by the OS, not the browser. Windows, macOS, Android, and Linux each ship a characteristic set of voice names, languages, and voiceURIs (the per-voice ID strings). The list is therefore a strong clue about your OS and version, and it must agree with your User-Agent - voices that do not match the claimed OS point to platform spoofing.
**Q: Why did getVoices() return an empty array even on a real browser?**
Chrome loads the voice list in the background, so the very first call often returns [] until the voiceschanged event fires to say the list is ready. Real detection scripts wait for that event. The actual bot giveaway is an empty result that never fills in - common on headless Linux with no speech engine installed.
**Q: Is speech synthesis a major detection vector?**
It is a secondary, supporting check rather than a primary blocker, but it carries a lot of signal: empty lists catch headless servers, and mismatched lists catch OS spoofing. It is most useful as one part of a broader consistency check across fonts, voices, and the platform a browser claims to be.
---
## What Is Stack Depth Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-stack-depth-fingerprinting
**Stack depth fingerprinting measures the maximum JavaScript recursion depth a browser allows before throwing a RangeError: Maximum call stack size exceeded.** In plain terms: it counts how many times a function can call itself before the engine runs out of "stack" space - the limited area where JavaScript tracks calls that are still in progress, like a stack of plates that can only get so tall. That ceiling is decided by the engine (V8, SpiderMonkey, JavaScriptCore), the platform, the CPU architecture, and how much space the test function uses per call. Because the number comes from the real engine and OS - not from the User-Agent (UA), the browser-and-OS label a session claims - it is a cheap way to tell which engine is actually running, and it exposes tools that claim to be one browser while running another underneath.
### Quick facts
- **Measures:** Max recursion depth before RangeError (Maximum call stack size exceeded)
- **Varies by:** JS engine, OS, CPU architecture, and the test function frame size
- **Reveals:** The real engine - V8 vs SpiderMonkey vs JavaScriptCore
- **Catches:** Firefox-based tools (Camoufox) presenting a Chrome User-Agent
- **Variant:** Measured separately on main thread, Workers, and WASM
### How the limit is measured
The test is short: write a function that bumps a counter and then calls itself, let it recurse until it crashes, and record the counter. That final number is surprisingly specific. V8 (Chrome), SpiderMonkey (Firefox), and JavaScriptCore (Safari) each set aside stack space differently and enforce different ceilings, so the depth falls into a distinct range for each engine. The exact value also shifts with CPU architecture (x86-64 vs ARM64), the operating system (default stack sizes differ), and how much space each call uses - so probes fix the function's shape (the number of arguments and local variables) to keep results comparable across runs.
Because the number is produced by the engine actually executing real instructions, you cannot move it from JavaScript - there is no property to overwrite; the limit is simply where execution genuinely runs out of room.
### Why it catches engine spoofing
The most powerful use is catching tools whose *engine* does not match their *claimed browser*. Camoufox is Firefox under the hood; if it presents a Chrome User-Agent, its recursion depth lands in the SpiderMonkey range rather than the V8 range - an immediate contradiction. The same goes for any "Chrome" that is really something else, or a JavaScript runtime pretending to be a browser. Measuring the depth in several places - the main thread, a Web Worker (a background JavaScript thread), and WebAssembly (WASM, a lower-level code format browsers can run) - adds even more resolution, because the ratios between those contexts are also engine-specific.
This is one of a family of engine-level probes (alongside Math precision and WASM measurements) that read the real implementation instead of the advertised one. Because the limit is set inside the engine, the value reported by a session is consistent only when the engine actually running matches the browser it identifies as - for example a Chrome-based engine reporting a Chrome User-Agent; the depth is determined at the C++ level and cannot be altered from JavaScript.
### Example
```javascript
// Measure max recursion depth (fix the frame shape for comparability)
function maxStackDepth() {
let depth = 0;
function recurse() { depth++; recurse(); }
try { recurse(); } catch (e) { /* RangeError */ }
return depth;
}
// Indicative ranges (vary by OS/arch/version - the engine clustering is the point):
// V8 / Chrome ~ 10k-15k
// SpiderMonkey / Firefox~ different band
// JavaScriptCore/Safari ~ different band
//
// A Chrome User-Agent whose depth lands in the SpiderMonkey band
// is a Firefox-based tool (e.g. Camoufox) wearing a Chrome UA.
```
### FAQ
**Q: Why does maximum recursion depth differ between browsers?**
Because it is set by the JavaScript engine and the operating system, not by the browser brand. V8, SpiderMonkey, and JavaScriptCore reserve stack space and enforce limits differently, and the value also shifts with CPU architecture and how much space each call uses. As a result, the depth clusters by the real engine running - which is exactly what makes it a fingerprint.
**Q: Can I spoof my stack depth from JavaScript?**
No. The limit is a consequence of how the engine actually executes code - there is no property to overwrite, and no way to make recursion fail at a different point without changing the engine itself. That is precisely why it catches tools whose underlying engine differs from the User-Agent they advertise.
**Q: How does stack depth catch Camoufox?**
Camoufox is a Firefox fork, so its recursion depth falls in the SpiderMonkey range. If it presents a Chrome User-Agent, that depth contradicts the range V8 (Chrome's engine) would produce, revealing a Firefox engine hiding behind a Chrome identity. Measuring depth separately on the main thread, in Web Workers, and in WASM adds further engine-specific detail.
---
## What Is CSS Media Query Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-css-media-query-fingerprinting
**CSS media query fingerprinting reads operating-system and device preferences through window.matchMedia().** A media query is a yes/no question about the environment, such as "is dark mode on?" Modern CSS lets a page ask about the user's color-scheme preference, reduced-motion and reduced-transparency settings, contrast level, forced-colors mode, the primary pointer type (mouse vs touch), hover capability, and display metrics like screen size. Each answer is low-entropy on its own - meaning it only slightly narrows down who you are - but together they describe a device class and a user configuration that must be coherent with the platform. Headless browsers (browsers run by automation, with no visible window) return revealing defaults that real users rarely all share.
### Quick facts
- **API:** window.matchMedia(query).matches
- **Reads:** prefers-color-scheme, prefers-reduced-motion, contrast, forced-colors, pointer, hover
- **Pointer/hover:** pointer:coarse + hover:none = touch device; pointer:fine = mouse
- **Coherence trap:** A phone UA must report coarse pointer and no hover
- **Headless default:** Uniform light scheme, fine pointer, hover - same across a fleet
### What media queries expose
matchMedia() answers yes/no to environment questions that come from the OS and the device:
- **prefers-color-scheme** - light or dark, set by the OS theme.
- **prefers-reduced-motion** and **prefers-reduced-transparency** - accessibility settings.
- **prefers-contrast** and **forced-colors** - high-contrast / Windows high-contrast mode.
- **pointer** and **any-pointer** - fine (mouse/stylus) vs coarse (finger).
- **hover** and **any-hover** - whether the primary input can hover.
- **display metrics** - resolution, aspect ratio, orientation.
Each answer is one bit or a small set of choices, so any single feature reveals little (its *entropy* - how much it narrows down the device - is low). The real value is in the combination, and in whether it stays consistent with the rest of the device's story.
### The coherence and uniformity tells
This catches scrapers in two ways. First, **device coherence** - the values have to agree with the device the browser claims to be. A request with an Android or iPhone User-Agent (the browser/device string sent with each request) must report pointer: coarse and hover: none, because phones have no fine pointer and cannot hover. A "phone" that reports pointer: fine and hover: hover is really a desktop headless browser wearing a mobile UA. Likewise a touch tablet and a mouse-driven desktop have different pointer/hover signatures, and these must match the claimed device and the touch/hardware signals.
Second, **fleet uniformity** - bots tend to look identical to each other. Headless browsers default to a light color scheme, no reduced motion, a fine pointer, and hover enabled. Real human traffic is a mix: a meaningful fraction use dark mode, some enable reduced motion, mobile users report coarse pointers. A population of visitors that *all* report the same headless-default media profile stands out against that natural spread. Anti-bot systems look at the distribution across many visitors, not just the values from one.
### Example
```javascript
// OS/device preferences read via matchMedia
const mq = q => window.matchMedia(q).matches;
const cssProfile = {
dark: mq('(prefers-color-scheme: dark)'),
reduceMotion:mq('(prefers-reduced-motion: reduce)'),
forcedColors:mq('(forced-colors: active)'),
coarse: mq('(pointer: coarse)'),
noHover: mq('(hover: none)')
};
// Coherence check (pseudo):
// /iPhone|Android/.test(ua) && (!cssProfile.coarse || !cssProfile.noHover)
// -> mobile UA but desktop pointer/hover = headless wearing a phone UA
// Distribution check:
// entire visitor fleet reporting {dark:false, coarse:false, noHover:false}
// -> headless default profile, anomalous vs real-user spread
```
### FAQ
**Q: How much does a single media query reveal?**
Very little on its own - most are just one bit (dark mode on or off, hover yes or no). The signal comes from combining them and checking they are coherent with the device the browser claims to be, plus comparing them across a whole population of visitors. A phone that reports a fine pointer, or an entire fleet sharing the identical headless-default profile, is what stands out.
**Q: What pointer and hover values should a mobile scraper report?**
A real phone reports pointer: coarse and hover: none (and any-pointer: coarse, any-hover: none). If you present a mobile User-Agent, these values must match it, along with touch support and screen metrics. Reporting a fine pointer or hover capability under a phone UA is a direct contradiction that gives the bot away.
**Q: Should I randomize prefers-color-scheme to look human?**
Setting a realistic mix instead of always-light helps you avoid fleet uniformity, but the value must stay coherent and stable within a session and match any theme actually rendered on the page. The goal is to resemble the natural distribution of real users, not to flip the value at random - random flipping can itself look anomalous.
---
## What Is Screen Resolution Fingerprinting?
URL: https://scrappey.com/qa/anti-bot/what-is-screen-resolution-fingerprinting
**Screen resolution fingerprinting reads the display measurements a browser reports - screen.width/height, availWidth/availHeight, colorDepth, devicePixelRatio, and the inner/outer window sizes.** It is a way to identify a device by its display. On their own these values are only moderately identifying (lots of people share common resolutions, but the available-area and pixel-ratio mix adds entropy - extra detail that narrows down who you are). What matters more is that they must fit together: the window cannot be larger than the screen, the available area must be the screen minus realistic OS bars like a taskbar, and the device pixel ratio (physical pixels per CSS pixel) must match the device being claimed. Headless browsers (browsers driven by code, with no visible window) report default sizes and impossible combinations that give them away.
### Quick facts
- **Reads:** screen.width/height, availWidth/availHeight, colorDepth, devicePixelRatio
- **Window:** innerWidth/Height, outerWidth/Height, screenX/Y
- **Coherence rules:** window <= screen; avail = screen - OS chrome; dpr matches device
- **Headless tell:** 800x600, 0x0, or inner == outer (no browser chrome)
- **Retina rule:** macOS/iPhone report devicePixelRatio >= 2
### The display geometry surface
A browser exposes two related sets of measurements. First the **screen** itself: screen.width/height (the full display), availWidth/availHeight (the display minus the taskbar, dock, or menu bar), colorDepth (bits of color per pixel, almost always 24), and devicePixelRatio (physical pixels per CSS pixel - 1 on a standard screen, 2 on Retina, 1.25 or 1.5 on scaled Windows). Then the **window**: innerWidth/innerHeight (the viewport, meaning the visible page area), outerWidth/outerHeight (including the browser's own toolbars, known as chrome), and screenX/screenY (where the window sits on the screen). Common resolutions (1920x1080, 1440x900, 390x844) are shared by many users, so resolution alone is low-entropy. But the full set together - resolution plus available area plus pixel ratio plus window size - is meaningfully identifying and, just as important, it is structured: the values have to make sense relative to each other.
### Coherence rules and headless defaults
That structure is what catches bots, because the values cannot be set independently:
- **Window within screen** - outerWidth/Height cannot exceed screen.availWidth/Height. A window bigger than the screen is impossible on real hardware.
- **Available area** - availHeight should be screen.height minus a plausible taskbar/dock; if the two are equal (avail == full) it implies no OS bars at all, which is common in headless.
- **Chrome height** - outerHeight - innerHeight is the height of the browser's toolbars; zero means no chrome (headless), and the same fixed, unrealistic value across many machines is a tell.
- **Pixel ratio coherence** - an iPhone or Retina Mac User-Agent paired with devicePixelRatio: 1 is a contradiction, because those devices are always >= 2.
Headless browsers ship with telltale default geometry - 800x600, 1280x720 with inner == outer, or 0x0 - and because those defaults are reused across an entire scraping fleet, they stand out against the natural spread of real displays.
### Setting believable display metrics
The fix is to present one complete, internally consistent display profile that matches the device you claim to be: a common real resolution, an available area reduced by a realistic amount of OS chrome, a window smaller than the screen with a non-zero toolbar height, and a device pixel ratio that fits the platform. For a desktop, 1920x1080 with availHeight around 1040, dpr 1, and a windowed viewport is safe; for a specific phone, copy that model's exact metrics and pixel ratio.
As with the rest of the fingerprint, the display values must also agree with the User-Agent, Client Hints, hardware signals, and media queries (orientation, pointer type). The display is just one facet of a single coherent device identity - editing screen.width on its own, while leaving the window sizes and pixel ratio at their headless defaults, creates exactly the contradictions anti-bot systems look for.
### Example
```javascript
// Display tuple an anti-bot script reads
const display = {
screen: [screen.width, screen.height].join('x'),
avail: [screen.availWidth, screen.availHeight].join('x'),
depth: screen.colorDepth, // ~24
dpr: window.devicePixelRatio, // 1 / 2 / 1.5
inner: [innerWidth, innerHeight].join('x'),
outer: [outerWidth, outerHeight].join('x')
};
// Coherence checks (pseudo):
// outerWidth > screen.availWidth -> window bigger than screen (impossible)
// outerHeight - innerHeight === 0 -> no browser chrome (headless)
// /iPhone|Macintosh/.test(ua) && dpr < 2 -> Retina device with dpr 1 (contradiction)
// screen === '800x600' || screen === '0x0' -> headless default
```
### FAQ
**Q: Is screen resolution a strong fingerprint on its own?**
Only moderately. Common resolutions are shared by many users, so resolution by itself is low-entropy, but the full set (resolution, available area, pixel ratio, window sizes) is more identifying. Its bigger value is coherence: the window cannot exceed the screen, the pixel ratio must match the device, and headless defaults get reused across whole fleets - so the combination, not the resolution alone, is what flags a bot.
**Q: What gives away a headless browser in the screen metrics?**
Telltale default sizes (800x600, 1280x720, 0x0), inner dimensions equal to outer (meaning no browser toolbar height), available area equal to the full screen (no taskbar or dock), and these same values repeated across an entire fleet. A window larger than the screen, or a Retina-class User-Agent reporting devicePixelRatio 1, is also a direct contradiction.
**Q: How should I set window and screen sizes for a scraper?**
Use one complete, coherent profile for the device you claim to be: a common real resolution, an available area reduced by a realistic amount of OS chrome, a windowed viewport smaller than the screen with a non-zero toolbar height, and a matching device pixel ratio. Keep it consistent with the User-Agent, Client Hints, and media-query signals - do not edit one value in isolation.
---# Crawling
Discovering and fetching pages at scale — crawl scope, politeness, sitemaps, and how scrapers traverse links without getting blocked or wasting budget.
## What Is a Web Crawler?
URL: https://scrappey.com/qa/crawling/what-is-a-web-crawler
**A web crawler is a program that finds and downloads web pages on its own by following links - it starts from a few given pages (called seed URLs), reads the links on them, visits those, and keeps going, collecting URLs and their contents along the way.** Think of a crawler as the part that *discovers* pages, and a scraper as the part that *extracts* data from them. Google's indexer is the most famous crawler. In a scraping setup, the crawler walks a target site to find URLs worth scraping, while the scraper pulls structured data out of each URL the crawler finds.
### Quick facts
- **Job:** Discover URLs by following links from seeds
- **Output:** A set of (URL, fetched content) pairs
- **vs Scraper:** Crawler discovers; scraper extracts
- **Politeness:** Respect robots.txt, rate limits, crawl-delay
- **Budget:** Bounded by depth, page count, or domain whitelist
### Crawling vs scraping
People mix up these two words. A crawler works in *breadth* - its job is to find URLs by following links. A scraper works in *depth* - its job is to extract specific fields (a price, a title) from a URL you already have. A full pipeline usually does both: crawl to list the URLs you care about, then scrape each one. If you already have the list of URLs - from an export, a sitemap, or an API - there's nothing to discover, so you skip crawling and go straight to scraping.
### How crawlers work
You start with seed URLs in a queue (this to-do list of pages-to-visit is called the frontier). The crawler then repeats one loop: take a URL off the queue, fetch the page, pull out all its links, tidy them up and drop duplicates, keep only the ones that fit your rules (same domain, allowed paths), and add the new ones back to the queue. It repeats until the queue is empty or it hits the limit you set. Real crawlers add a few things on top: respecting robots.txt, limiting how fast they hit each host, removing duplicate URLs by canonicalizing them (reducing different-looking URLs that point to the same page down to one standard form), and an incremental mode (re-crawling only URLs whose content might have changed).
### Politeness and limits
A crawler that ignores robots.txt or hammers a host at 100 requests/second is hostile and gets blocked at the first opportunity. Polite crawling means: respect Disallow directives (the robots.txt rules that say which paths are off-limits), honor Crawl-delay if present (a requested wait between requests), cap per-host concurrency (1-5 connections at a time), back off when you get 429/503 responses (the server telling you to slow down or that it's overloaded), and identify yourself with a real User-Agent and a contact URL so site owners can reach you. Polite crawlers get a lot further than aggressive ones.
### Example
```python
from collections import deque
def crawl(seed, max_pages=1000):
frontier = deque([seed])
seen = set()
while frontier and len(seen) < max_pages:
url = frontier.popleft()
if url in seen: continue
seen.add(url)
# fetch + extract links, then frontier.append() new links
return seen
```
### FAQ
**Q: When do I need a crawler vs a scraper?**
Use a crawler when you do not have the list of URLs yet and need to discover them. Use a scraper when you already have the URLs and just need to pull data out of them. Most projects need both.
**Q: Should I respect robots.txt?**
For public-facing crawls, yes - it is the industry norm, and ignoring it invites IP blocks and legal pushback. For internal use against sites you own, it is your call.
**Q: What is the best crawler library?**
Scrapy for Python, Apify SDK for Node, or a managed crawl endpoint if you do not want to operate it yourself. For small one-off crawls, a few hundred lines of custom code is faster than learning a framework.
---
## What Is Crawl Budget?
URL: https://scrappey.com/qa/crawling/what-is-crawl-budget
**Crawl budget is the upper limit on how much of a site a crawler will fetch in a single run - measured in pages, requests, or wall-clock time.** In plain terms, it is a cap on how much crawling you allow before stopping. The term started in SEO (Google's crawl budget for a given site) but the idea applies to any custom crawler you write. Without a budget, a crawler can run forever on a large site; with a budget that is too small, you miss the pages you actually wanted. The real skill is spending the budget on the URLs that matter.
### Quick facts
- **Units:** Pages, requests, wall time, or all three
- **Per-host caps:** Avoid being abusive; per-host limit is its own budget
- **Spend on:** Content URLs, not pagination/sort/filter combinations
- **Common waste:** Faceted nav, search results, infinite calendar URLs
- **SEO equivalent:** Google's crawl budget — same concept, server-controlled
### Why budgets exist
Real sites have a near-endless supply of URLs: pagination, sorting, filtering, search results, calendar pages. A naive crawl follows every combination and hits millions of low-value pages before it ever reaches the content you came for. A budget forces you to set priorities: which URL patterns are worth crawling, in what order, and where to stop.
### Spending the budget well
The standard playbook: grab the sitemap first to get the canonical list of content URLs, then crawl section by section in priority order. Limit depth to 3-5 hops (link clicks) from each seed URL, and skip patterns that explode into endless combinations - faceted filters, sort variants, and session IDs (per-visit identifiers stuck in the URL). When you hit the budget, log what you reached and what you missed, so the next run can pick up where this one stopped.
### SEO crawl budget
In SEO, "crawl budget" means how often Googlebot will fetch your site - a limit Google sets based on your site speed, how fresh your content is, and your domain authority. You spend it wisely by exposing fast, canonical (single official version) URLs and not wasting it on duplicate content. The principle matches a custom crawler exactly: spend the budget on URLs that matter, and prevent waste on URLs that do not.
### Example
```python
import time
from urllib.parse import urlparse
class BudgetedCrawler:
def __init__(self, max_pages, max_seconds, max_per_host):
self.max_pages = max_pages
self.deadline = time.time() + max_seconds
self.per_host_cap = max_per_host
self.host_counts = {}
self.seen = set()
def can_fetch(self, url):
if len(self.seen) >= self.max_pages: return False
if time.time() > self.deadline: return False
host = urlparse(url).netloc
return self.host_counts.get(host, 0) < self.per_host_cap
```
### FAQ
**Q: How big should my crawl budget be?**
Start with enough pages to cover the content you care about, plus a 20% buffer. For a site you do not know well, run a small recon crawl first (around 500 pages) to estimate its shape and size, then set the real crawl's budget accordingly.
**Q: Should I budget per host or globally?**
Both. A global cap stops a single runaway job from spiraling out of control, while a per-host cap stops you from hammering any one site even when your overall crawl is reasonable.
**Q: What if my budget runs out mid-crawl?**
Save the frontier (the queue of URLs still to visit) and the seen-set (URLs already visited) to disk. The next run loads them and resumes where you left off. Starting over from scratch wastes both your budget and the target site's bandwidth.
---
## What Is Crawl Depth Limit?
URL: https://scrappey.com/qa/crawling/what-is-crawl-depth-limit
**Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL.** A "hop" is one click along a link. The page you start on (the *seed*) is depth 0; depth 1 is the seed plus everything linked from it; depth 2 follows the links on those pages, and so on. Combined with a budget (a cap on total pages), depth shapes which parts of a site get reached. Most content lives within 2-4 hops of the homepage; beyond that you mostly find pagination, filters, and tag pages.
### Quick facts
- **Depth 0:** Seed page only
- **Depth 1:** Seed + direct links — usually category pages
- **Depth 2-3:** Most content on well-structured sites
- **Depth 4+:** Diminishing returns; mostly filters and tags
- **Combined with:** Crawl budget (pages cap) and scope (domain/path filter)
### Where content lives
Most sites follow a simple layering. The homepage links to category pages (depth 1), category pages link to listings (depth 2), and listings link to the actual detail pages you want (depth 3). Going deeper rarely reveals anything new — you start hitting pagination, sort variants, and tag clouds instead. Setting depth at 3-4 captures the bulk of meaningful URLs without spending your budget on this combinatorial junk (the explosion of near-duplicate URLs created by filters and sort options).
### Depth vs budget interaction
Depth and budget pull on each other. A high depth limit with a small budget runs out partway through and stops mid-traversal; a low depth limit with a large budget leaves capacity unused. The rule of thumb: set depth to match the natural shape of the site (3-4 hops for most), then size the budget to roughly "depth × average fanout" — fanout being how many links a typical page has — plus a safety margin. For example, a site with 20 categories and 200 items each fits in about 5,000 pages at depth 3.
### Per-pattern depth
Advanced crawlers set depth differently depending on the type of URL, rather than using one number everywhere. Detail pages get depth 0 (fetch them, but do not follow their links). Category pages get a high depth, since that is where you discover items. Pagination links (the "next page" links) get capped at 50-100 to avoid infinite-calendar traps — pages like a calendar's "next month" link that go on forever. This takes more setup than a single global limit, but it dramatically improves how efficiently you spend your budget on large sites.
### Example
```python
from collections import deque
def crawl_with_depth(seed, max_depth=3):
frontier = deque([(seed, 0)])
seen = set()
while frontier:
url, depth = frontier.popleft()
if url in seen or depth > max_depth: continue
seen.add(url)
for link in extract_links(url):
if link not in seen:
frontier.append((link, depth + 1))
```
### FAQ
**Q: What depth should I start with?**
Start with 3 for content sites and 2 for e-commerce listings (homepage → category → item is exactly 2 hops). Then adjust after looking at what you actually reached in the first run.
**Q: Does depth limit help with infinite-link traps?**
Partially. A depth limit caps the worst case, but a single bad pattern — for instance calendar URLs that link to next-month forever — can still burn through your budget at depth 1. Combine depth limits with URL pattern exclusions (rules that skip URLs matching a pattern) for real protection.
**Q: What is the difference between depth and crawl budget?**
Depth limits how far you walk from each seed; budget limits the total amount of crawling overall. They are separate controls, and you need both.
---
## What Is the robots.txt Protocol?
URL: https://scrappey.com/qa/crawling/what-is-the-robots-txt-protocol
**robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch.** Think of it as a "please don't go in here" note posted at a site's front door - it states the owner's wishes but locks nothing. It is a voluntary convention: there is no enforcement, no authentication (no login check), and no penalty in the protocol itself for ignoring it. Reputable crawlers (Googlebot, Bingbot, Common Crawl) and well-behaved custom crawlers respect it anyway. Ignoring it for public-facing crawling is a fast way to get IP-blocked and, in some jurisdictions, sued.
### Quick facts
- **Location:** /robots.txt at the root of every domain
- **Format:** Plain text, User-agent + Allow/Disallow rules
- **Enforcement:** Voluntary — convention only, no protocol enforcement
- **Common directives:** Disallow, Allow, Crawl-delay, Sitemap
- **Does NOT cover:** Authentication, content restrictions, rate limiting
### The basic format
A robots.txt file is a list of User-agent blocks. A user-agent is the name a crawler sends to identify itself, so each block names which crawlers its rules apply to (or * for all of them). Inside a block, Disallow lists paths to skip and Allow lists exceptions that are okay to fetch. At the top level, Sitemap directives point to sitemap XML files (machine-readable lists of a site's URLs). Crawl-delay (where honored) requests a minimum gap between requests so you don't hammer the server. Modern additions like ai.txt extend the same convention to let sites opt out of AI training.
### What it does NOT do
robots.txt is not authentication, not access control, and not rate limiting. A disallowed path is still publicly reachable — anyone can type the URL and fetch it directly. If you want to actually prevent access to content, put it behind a login (authentication). If you only want to keep a page out of search results, use X-Robots-Tag headers or <meta name="robots"> tags on the page itself. robots.txt only tells well-behaved crawlers what to skip — it changes behavior, not permissions.
### How custom crawlers should handle it
The basic etiquette: fetch /robots.txt once per host at the start of a crawl, parse it (Python's urllib.robotparser or a third-party library does this for you), and check every candidate URL against the rules before fetching it. Cache the parsed rules for the duration of the crawl so you don't re-download them. Treat Crawl-delay as a minimum gap between requests to that host. If the file is missing or can't be fetched, default to "allow all" — that is the agreed-on convention.
### Example
```python
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
if rp.can_fetch('MyCrawler/1.0', 'https://example.com/private/'):
pass # allowed
delay = rp.crawl_delay('MyCrawler/1.0')
```
### FAQ
**Q: Is robots.txt legally binding?**
In most jurisdictions, no — but ignoring it has been argued in court as evidence of bad faith. Reputable scraping projects respect it for both ethical and risk-management reasons.
**Q: What if robots.txt blocks everything I want?**
Reach out to the site owner. Most are open to negotiated access — a license, an API key, or a polite scrape schedule — for legitimate use cases.
**Q: Does robots.txt block AI training?**
Standard robots.txt does not, but the ai.txt convention is emerging for that purpose, and major search engines now interpret noai directives (a signal that says "don't use this for AI"). Respect both if your pipeline feeds an LLM.
---
## What Is a Sitemap?
URL: https://scrappey.com/qa/crawling/what-is-a-sitemap
**A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, change frequency, priority.** XML is a tag-based text format, and a canonical URL is the single "official" address a site picks for a page. Sitemaps were designed for search engines, but they are gold for custom crawlers — pulling the sitemap gives you the site's preferred URL list without crawling at all (without following links page by page). For most content sites this is faster, cheaper, and more complete than link-following.
### Quick facts
- **Locations:** /sitemap.xml, /sitemap_index.xml, or listed in robots.txt
- **Format:** XML following the sitemaps.org schema
- **Size limit:** 50,000 URLs / 50MB per file — large sites use index files
- **Per-URL metadata:** loc, lastmod, changefreq, priority
- **Crawl benefit:** Skip link traversal entirely — go straight to known URLs
### Why crawlers should check the sitemap first
If your goal is to grab content pages (not literally every link on a site), the sitemap is usually a more complete and more efficient source than link-following. Picture a site with 100,000 articles buried behind faceted navigation (filter menus by date, category, tag) — walking link by link is a nightmare. The sitemap lists every article flat, in one file. Fetch /sitemap.xml, parse it, and you have the URL list — then scrape each URL directly. This can cut crawl time by 10-100x.
### Sitemap index files
Large sites split their sitemap into several files tied together by an index — a sitemap whose only job is to point to other sitemaps. For example, /sitemap_index.xml points to /sitemap-articles-1.xml, /sitemap-articles-2.xml, and so on. Your crawler should handle this: fetch the index, fetch each child sitemap it lists, then join the URL lists together. Site owners often split files by content type (articles, products, categories), so you can target just the section you care about.
### When the sitemap is missing or stale
Many smaller sites either have no sitemap or one that has not been regenerated in months (a stale sitemap). When that happens, fall back to link-following or use the site's news/RSS feeds (auto-updating lists of recent posts). If a site has a sitemap but it is stale, combine the two: use the sitemap for the bulk of the URLs, and add a quick recent-changes crawl (the homepage plus the first few category pages) to catch new pages the sitemap missed.
### Example
```python
import requests
from xml.etree import ElementTree as ET
def fetch_sitemap_urls(url):
r = requests.get(url)
root = ET.fromstring(r.text)
ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
if root.tag.endswith('sitemapindex'):
urls = []
for sm in root.findall('s:sitemap/s:loc', ns):
urls.extend(fetch_sitemap_urls(sm.text))
return urls
return [loc.text for loc in root.findall('s:url/s:loc', ns)]
```
### FAQ
**Q: Where do I find the sitemap?**
/sitemap.xml is the conventional location, but the authoritative answer is in /robots.txt — the text file that tells crawlers what they may access. Look for a Sitemap: directive (line) there. Large sites have multiple sitemaps, so robots.txt is where you'll find the full list.
**Q: Can I trust the sitemap to be complete?**
Mostly — it represents the site owner's canonical URL set, meaning the pages they want crawlers to find. Pages they do not want indexed (showing up in search results) won't appear. For a truly complete crawl, combine the sitemap with link-following.
**Q: Does the lastmod field actually update?**
Sometimes. lastmod is supposed to show when a page last changed, but many sites don't keep it accurate — treat it as a hint, not ground truth. To reliably detect changes, hash the page content yourself (compute a short fingerprint and compare it over time) rather than relying on the sitemap.
---
## What Is Polite Crawling?
URL: https://scrappey.com/qa/crawling/what-is-polite-crawling
**Polite crawling means running your crawler at a speed and rhythm that won't strain the websites it visits. In practice that means obeying robots.txt (the file where a site lists which pages bots may touch), opening only a few connections per site at a time, honoring any rate limits, saying honestly who your crawler is, and slowing down when the server returns errors.** It is partly good manners and partly self-interest: polite crawlers get blocked less, get more cooperation when they reach out, and avoid the legal and reputational risk of being seen as abusive.
### Quick facts
- **Concurrency per host:** 1-5 connections; not 100
- **Request rate:** 1-10 requests/second/host, lower for small sites
- **Identification:** Real User-Agent + contact URL or email
- **Backoff:** 429 and 503 → exponential backoff, not retry hammer
- **Honor:** robots.txt, Crawl-delay, Retry-After
### Why polite wins
Sites notice aggressive crawlers within minutes — too many simultaneous connections, ignoring robots.txt, hammering on through repeated 5xx server errors without pausing. The response is a block at the IP, the ASN (the network block an IP belongs to), or the fingerprint level. A polite crawler stays quiet enough to slip under that radar and can run for hours or days without being stopped. The math favors politeness: 10 requests per second sustained over an hour gets you 36,000 pages; 100 requests per second that gets blocked after five minutes gets you 30,000 — and you have burned the IP.
### The practical recipe
Per host: keep no more than 1-5 connections open at once. Insert a 100-1000ms gap between requests. Treat the Crawl-delay value in robots.txt as a minimum wait. When you hit a 429 ("too many requests") or 503 ("service unavailable"), back off exponentially — wait 1s, then 2s, 4s, 8s, 16s — and give up after 5 attempts. If the server sends a Retry-After header telling you exactly how long to wait, honor it. Set a User-Agent (the line every request sends to identify itself) that names your crawler and includes a URL or email so the site owner can reach you. Rotate IPs between crawls, not in the middle of a single session.
### Why identification matters
A User-Agent like "MyCrawler/1.0 (+https://example.com/crawler)" signals good faith. A site owner who spots it in their logs and has a concern can reach out to you instead of simply blocking. An anonymous crawler wearing a faked browser User-Agent looks like an attack and gets treated like one. Being honest costs you nothing; the goodwill it buys when something goes wrong is significant.
### Example
```python
import time, requests
class PoliteCrawler:
def __init__(self, user_agent='MyCrawler/1.0 (+https://example.com)'):
self.ua = user_agent
self.last_request = {}
def fetch(self, url, host, delay=1.0):
gap = time.time() - self.last_request.get(host, 0)
if gap < delay: time.sleep(delay - gap)
self.last_request[host] = time.time()
return requests.get(url, headers={'User-Agent': self.ua}, timeout=30)
```
### FAQ
**Q: How slow is polite enough?**
1-2 requests per second per host is safe for almost any site. Large content sites can absorb 10 per second without blinking; small ones may not. When in doubt, slower is better.
**Q: Should I use a real browser UA for polite crawling?**
No — that is dishonest. Use a User-Agent that clearly identifies your crawler. Real-browser User-Agents are for scraping past anti-bot defenses, which is a different problem entirely.
**Q: Is polite crawling slower in total?**
Per host, yes. But polite crawlers keep running without getting blocked, so total throughput often ends up higher — and you keep the IP usable for the next run.
---
## Breadth-First vs Depth-First Crawling
URL: https://scrappey.com/qa/crawling/what-is-breadth-first-vs-depth-first-crawling
**Breadth-first crawling (BFS) visits every link at the current depth before going any deeper; depth-first crawling (DFS) follows one chain of links as far as it goes, then backtracks.** A crawler is a program that follows links from page to page to collect data, and these two strategies just decide the *order* it visits them. BFS surfaces the broad shape of a site quickly and is the default for general-purpose crawlers. DFS reaches deep content faster but risks getting lost in one section of a large site. Most real crawlers use BFS with a depth limit — the simplicity wins.
### Quick facts
- **BFS shape:** Wide and shallow first; depth grows over time
- **DFS shape:** Narrow and deep; one section explored completely before the next
- **Data structure:** BFS = queue; DFS = stack
- **Memory:** BFS uses more memory for the frontier on wide sites
- **Most crawlers use:** BFS with depth cap — the safe default
### Why BFS is the default
BFS works level by level: it gives you the homepage, then every category page, then every listing page, before drilling into individual items. So for a site whose content sits three clicks deep (depth-3), BFS surfaces a meaningful map of the site within the first few hundred requests. DFS, by contrast, might burn the first thousand requests inside one tag's pagination before it ever touches another category. Most crawl goals — "get a sample of every section" — are a better fit for BFS.
### When DFS makes sense
DFS wins when you have one specific deep target and broad coverage does not matter — scraping every product in a single category, say, or every page in one documentation section. It also uses less memory on very wide sites. The crawler holds a waiting list of links it has found but not yet visited; DFS keeps that list as a **stack** (it always takes the newest link next), so the list only grows with how *deep* the crawl is — which stays small. BFS keeps it as a **queue** (oldest link next), so the list grows with fanout × depth — every link at the current level has to be stored, which can get huge.
### The hybrid that wins in practice
The strategy most production crawlers actually use is a mix: BFS with a depth limit, then targeted DFS for extraction. The first pass discovers the site's structure (BFS, depth 3). The second pass dives into the specific subtrees you identified (DFS, with no depth limit but kept within that scope). This gives you both a broad lay of the land and the deep coverage your pipeline needs.
### Example
```python
from collections import deque
def bfs_crawl(seed, max_depth):
frontier = deque([(seed, 0)]) # queue → BFS
seen = set()
while frontier:
url, d = frontier.popleft()
if url in seen or d > max_depth: continue
seen.add(url)
def dfs_crawl(seed, max_depth):
stack = [(seed, 0)] # stack → DFS
seen = set()
while stack:
url, d = stack.pop()
if url in seen or d > max_depth: continue
seen.add(url)
```
### FAQ
**Q: Which is faster?**
Neither is faster by nature — total wall-clock time depends on the network, not the strategy. The two simply reach pages in a different order. That ordering matters if you plan to stop early once you have enough data, but not if you intend to crawl every page anyway.
**Q: What about priority queues?**
Both BFS and DFS can be upgraded to a priority queue: instead of strict oldest-first (FIFO) or newest-first (LIFO) ordering, you visit pages in order of an importance score — sitemap priority, link count, or freshness. This is called "best-first" crawling, and it is what Google's crawler does.
**Q: Does it matter for a single-section crawl?**
Less so — inside a tight scope the two strategies end up visiting roughly the same pages in roughly the same time. The choice matters most for general-purpose crawls across an unfamiliar site, where the visiting order shapes how quickly you understand its structure.
---
## What Is Link Extraction?
URL: https://scrappey.com/qa/crawling/what-is-link-extraction
**Link extraction is the crawling step where you pull every URL out of a page you have just downloaded, so you can decide which ones to visit next.** The result is a clean list of full (absolute) URLs with duplicates removed. It sounds trivial, but the awkward cases - relative URLs (shorthand paths like /about that omit the domain), anchors that only jump within the page, links that exist only after JavaScript runs, data-href attributes, and links buried in event handlers - make it the number-one cause of "missing pages" bugs in crawlers.
### Quick facts
- **Source:** <a href> primarily; also <link>, <area>, <iframe src>
- **Steps:** Parse → resolve relative URLs → strip fragments → normalize → dedupe
- **Easy to miss:** JS-rendered links, data-* attributes, onclick handlers
- **Normalize:** Lowercase host, sort query params, strip trailing slashes consistently
- **Output:** Set of absolute, normalized URLs for the frontier
### The basic algorithm
Parse the HTML (turn the raw text into a structure you can search). Select every <a> tag that has an href attribute. Resolve each href against the page's base URL - that is, combine a relative path like /about with the page's address to get a full URL (and respect the <base> tag if the page has one). Strip the fragment (the #section part, which only scrolls within a page). Drop javascript:, mailto:, tel:, and empty hrefs, since none of those are pages to crawl. Normalize so equivalent URLs look identical: lowercase the host, decode percent-encoding (turn %20 back into a space), and sort query parameters alphabetically. Finally, dedupe against the seen-set - the running list of URLs you have already collected.
### JS-rendered links
Modern sites load many links via JavaScript: lazy-rendered cards, "load more" buttons, infinite scroll. A static HTML parser - one that only reads the originally downloaded HTML and never runs scripts - misses them entirely. You have two options. Either render the page in a real browser before extracting, so the JavaScript runs and the links appear; or find the underlying XHR endpoint (the background request the page makes to fetch its data) and crawl from its JSON response directly. Both are valid; the XHR path is usually cheaper if the endpoint is accessible.
### Edge cases that cause missing pages
Links in data-href, data-url, or other custom attributes — most parsers ignore them because they only look at standard href. Links inside JSON-LD structured data (machine-readable metadata embedded in the page) — same problem. Links built dynamically from React state — only visible after the page renders. PDF/document URLs in <embed> and <object> tags — easy to skip. For a thorough crawl, audit one rendered page by hand and compare against your extractor's output to see what it is dropping.
### Example
```python
from urllib.parse import urljoin, urldefrag, urlparse, parse_qsl, urlencode
from bs4 import BeautifulSoup
def extract_links(html, base_url):
soup = BeautifulSoup(html, 'html.parser')
base = soup.find('base', href=True)
base_url = urljoin(base_url, base['href']) if base else base_url
links = set()
for a in soup.select('a[href]'):
href = a['href'].strip()
if not href or href.startswith(('javascript:', 'mailto:', 'tel:')):
continue
clean = urldefrag(urljoin(base_url, href)).url
p = urlparse(clean)
sorted_q = urlencode(sorted(parse_qsl(p.query)))
links.add(p._replace(query=sorted_q).geturl())
return links
```
### FAQ
**Q: Should I normalize trailing slashes?**
Yes, but pick one rule and apply it consistently. A path with a trailing slash and the same path without one usually point to the exact same page; if you treat them as two different URLs, you crawl that page twice and waste crawl budget for nothing.
**Q: Do I extract links from PDFs?**
Only if your scope requires it. Pulling links out of PDFs is a separate problem with its own tools (pdfplumber or similar), and most crawls never need it.
**Q: What about links in noindex pages?**
A noindex page is one that asks search engines not to list it in results. For SEO crawls, still follow its links but tag them as "from-noindex", because search engines do not give those links ranking credit. For data crawls, just follow them like any other link.
---
## What Is Throttling?
URL: https://scrappey.com/qa/crawling/what-is-throttling
**Throttling means deliberately slowing down how fast requests are sent or handled.** A website throttles incoming traffic so it doesn't get overwhelmed or abused; a scraper throttles its own outgoing requests so it stays under those limits and avoids getting blocked. Think of it like easing off the gas pedal. It's the sibling of rate limiting - throttling is the act of slowing down, and a 429 Too Many Requests is the error you get when you don't.
### Quick facts
- **What it is:** Limiting request rate (inbound or outbound)
- **Server-side:** Protects the origin; enforced via rate limits / WAF
- **Client-side:** Self-imposed delays to avoid blocks
- **Related status:** 429 Too Many Requests
- **Right approach:** Concurrency caps + delays + backoff + proxy rotation
### Server-side vs client-side throttling
Throttling happens on both ends. The server throttles *you*: it sets rate-limit rules - caps on how many requests it will accept in a given time window, counted per IP address, per URL, or per account - and once you cross the line it replies with a 429 (too many requests) or 503 (service unavailable). Your scraper throttles *itself*: it limits how many requests it sends at once and spaces them out, so it stays below those limits before the server ever has to push back. Good scraping is mostly the second kind - you pace yourself so the server never needs to.
### Why throttling matters for scraping
Blasting a site with rapid-fire requests is one of the loudest bot signals there is. It triggers soft blocks (429s), and if you keep pushing those escalate into hard bans on your IP address. Respecting the limits - obeying the Retry-After header (the server's hint for how long to wait before trying again) and slowing down when you see 429s - keeps your access steady and your IPs in good standing. Throttling is the difference between a scraper that runs for months and one that's banned in an hour.
### How to throttle a scraper correctly
Pick a sensible limit on how many requests run at the same time, add small random delays between requests (jitter, so your timing doesn't look robotically uniform), and use exponential backoff when you hit a 429 - wait a bit, then double the wait each time it happens again. Then spread the load across rotating proxies so each individual IP stays at a human pace even as your total volume goes up. If you'd rather not tune all of this by hand, a web scraping API handles request pacing, proxy rotation, and retries for you.
### FAQ
**Q: What's the difference between throttling and rate limiting?**
Rate limiting is the rule (for example, 60 requests per minute); throttling is the act of slowing down to stay within that rule. Servers rate-limit; clients throttle.
**Q: Does throttling prevent bans?**
It cuts them down dramatically. Combined with proxy rotation and honoring the Retry-After header, well-paced requests keep you off the radar far better than raw speed does.
**Q: How slow should I scrape?**
It depends on the target. Start slow - say one request every few seconds per IP - obey any Retry-After header, and only speed up if you keep seeing no 429s.
**Q: What's the difference between throttling and a 429?**
A 429 is the error the server sends when you go over its limit; throttling is what you do to avoid ever getting one.
---
## List Crawling in Web Scraping
URL: https://scrappey.com/qa/crawling/list-crawling-in-web-scraping
**List crawling is the technique of crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching each item detail page in a second phase.** Instead of guessing item URLs, you walk the site the way a human browses a catalog: open a list page, read every item link on it, advance to the next page, and repeat until you have collected the full set of item URLs. Once enumeration is complete, a separate detail phase visits each URL and extracts structured fields. This two-phase split - crawling list pages to find what exists, then scraping detail pages to get the data - keeps each phase simple, resumable, and easy to rate-limit.
### Quick facts
- **Phase 1:** Crawl list pages to enumerate item URLs
- **Phase 2:** Fetch each detail page, extract fields
- **Pagination:** Page params, cursor/API, or infinite scroll
- **Dedup:** Canonicalize URLs into a seen set
- **Budget:** Cap page count, depth, and per-host rate
### The two-phase architecture
**List crawling separates discovery from extraction into two phases that run independently.** Phase one crawls list pages - a category index, search results, or an archive - and pulls out the link for every item shown, collecting them into a deduplicated set of detail URLs. Phase two takes that set and fetches each detail page on its own, parsing the fields you actually want. Keeping the phases apart has real payoffs: you can checkpoint the URL list to disk and resume detail fetching after a crash, you can rate-limit each phase differently, and you can re-run extraction without re-crawling the lists. This is the same discover-then-extract split that separates a web crawler from a scraper - the list phase is the crawl, the detail phase is the scrape. Extracting item links from a list page is just link extraction scoped to the item-card selector rather than every anchor on the page.
### Pagination patterns you will meet
**Crawling list pages comes down to recognizing how the site advances to the next page, and there are three common patterns.**
- **Page parameters.** The URL carries the page number or offset, e.g. ?page=2 or ?offset=40. You loop, incrementing the parameter, and stop when a page returns no item links or repeats the previous page.
- **Cursor / API pagination.** The page (or an XHR call behind it) returns a nextCursor or next token. You pass that token to the next request and stop when it is null. This is the cleanest pattern - see how REST APIs work for the request shape.
- **Infinite scroll.** New items load over JavaScript as you scroll. The list page has no static next link, so you either drive a real browser to scroll and render (handled for you when you fetch with a full-browser request) or call the underlying JSON endpoint the page itself uses. See dynamic content scraping for why this matters.
Always set a hard ceiling on pages crawled so a broken stop-condition cannot loop forever.
### Dedup, polite crawling, and budget
**Robust list crawling needs deduplication, polite pacing, and an explicit budget, or it either wastes work or overloads the target.** The same item often appears on multiple list pages (sorting changes, overlapping filters), so canonicalize each detail URL - strip tracking parameters and fragments, normalize the host and trailing slash - and keep a seen set so you fetch each detail page exactly once. For pacing, polite crawling means capping per-host concurrency, adding a small delay between requests, and backing off on 429/503 responses; respecting the robots.txt protocol keeps you on the right side of site rules. For budget, bound the crawl by total list pages, by crawl depth, and by a per-domain page cap so crawl budget stays predictable. A web scraping API that rotates residential proxies and handles browser verification lets the list and detail phases run at steady concurrency without each phase managing its own proxy pool.
### Example
```python
import requests
API = "https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY"
def fetch(url, browser=False):
# direct HTTP for static list pages; full browser for JS-rendered ones
cmd = "request.get" if not browser else "request.get"
body = {"cmd": cmd, "url": url, "proxyCountry": "UnitedStates",
"session": "list-crawl", "autoparse": True}
if browser:
body["browser"] = [{"action": "scroll"}]
r = requests.post(API, json=body, timeout=180)
return r.json()["solution"]["response"]
def enumerate_items(base, max_pages=50):
# phase 1: crawl paginated list pages, collect detail URLs
seen, page = set(), 1
while page <= max_pages:
html = fetch(f"{base}?page={page}")
links = extract_item_links(html) # your CSS/regex selector
new = [u for u in (canonicalize(l) for l in links) if u not in seen]
if not new:
break # no new items: stop
seen.update(new)
page += 1
return seen
def crawl_list(base):
# phase 2: fetch each detail page once
return {url: fetch(url) for url in enumerate_items(base)}
```
### FAQ
**Q: What is list crawling in web scraping?**
List crawling is crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching each item detail page in a separate phase to extract its data. It splits discovery from extraction so each phase stays simple and resumable.
**Q: How do I handle pagination when crawling list pages?**
Match the pattern: increment a page or offset parameter until a page returns no new item links, follow a nextCursor token until it is null, or drive a full browser to scroll infinite-scroll lists. Always set a maximum page count so a broken stop-condition cannot loop forever.
**Q: How do I avoid fetching the same item twice?**
Canonicalize every detail URL - strip tracking parameters and fragments, normalize the host and trailing slash - and keep a seen set. Items often repeat across list pages when sorts or filters overlap, so dedup before queuing the detail phase.
---# Reverse Engineering
Reading code that was built to resist reading — how obfuscation works, why every layer is reversible, and the techniques used to recover the original logic.
## How Does Deobfuscation Work?
URL: https://scrappey.com/qa/reverse-engineering/how-deobfuscation-works
**Deobfuscation is the process of turning deliberately unreadable code back into something a human can read and reason about.** Obfuscators scramble how code looks, but never change what it does — so every scrambling step they apply can be undone. Deobfuscation just means applying those reversals, one by one, until the original structure reappears. And because the code still has to run, every secret inside it can be recovered: the engine must un-scramble it to execute it, so you can too.
### Quick facts
- **Goal:** Recover readable, original-equivalent source from scrambled code
- **Key principle:** Obfuscation is semantics-preserving, therefore reversible
- **Common layers:** String arrays, control-flow flattening, dead code, bytecode VMs
- **Primary technique:** Constant folding + AST rewriting
- **Core tools:** Babel/acorn ASTs, beautifiers, webcrack, custom disassemblers
### Why code gets obfuscated
People obfuscate code to protect intellectual property, hide license checks, slow down tampering, and make anti-analysis scripts harder to study. The trade-off is always the same: obfuscated code is bigger and slower, and because it still has to *run*, every secret it contains is recoverable. This page focuses on JavaScript — where most client-side obfuscation lives — but the same ideas apply to any language.
### The golden rule: semantics are preserved
The single most important idea in deobfuscation is that an obfuscator may only apply **semantics-preserving** transformations — changes that keep the behaviour identical. If f(2) returned "WordArray" in the original, it still returns "WordArray" after obfuscation.
That means you are never *guessing*. You can always recover the original behaviour by running the obfuscated pieces, because they are guaranteed to produce the same values. Most deobfuscation is just "run the parts that are constant, and simplify."
### Layer 1 — String array encoding
The most common layer. Every string and name in the code is pulled out into one big array, and each place that used it is replaced by a call to a decoder function that fetches it back by index.
// After obfuscation
const _0x2f9f = ["u7d1", "WordArray", "update", "secret", /* ...hundreds more */];
function _0x3b01(i) { return _0x2f9f[i - 336]; }
const key = _0x3b01(337);
_0x3b01(429)(_0x3b01(412));Often the array is **rotated** when the script loads: a self-running function shifts the elements around until a checksum matches. This means reading the array in source order shows the wrong values:
(function (arr, target) {
while (true) {
const sum = parseInt(decode(347)) / 1 + parseInt(decode(541)) / 2 /* ... */;
if (sum === target) break; // correct rotation found
arr.push(arr.shift()); // rotate by one and retry
}
})(_0x2f9f, 458958);**How to reverse it:** grab the array and the decoder function as text, run the rotation loop yourself (it always produces the same result), then replace every decoder call with the value it returns. This is constant folding — _0x3b01(429) always returns the same thing for a given number, so evaluate it and substitute the answer:
output = source.replace(/_0x3b01\((\d+)\)/g, (_, n) => JSON.stringify(decode(+n)));After this pass _0x3b01(429)(_0x3b01(412)) becomes "update"("secret") — already far more readable.
### Layer 2 — Decoder aliasing
Obfuscators often copy the decoder into a local variable, so a naive find-and-replace looking for the original name misses it:
function handler() {
const d = _0x3b01; // alias
return d(512) + d(530); // won't match a /_0x3b01\(\d+\)/ regex
}Reverse it by scanning for assignments like x = _0x3b01, collecting those alias names, then resolving x(NNN) calls within their scope using the same decoder. Limit this to short, single-purpose names to avoid false matches.
### Layer 3 — Member-access and literal disguising
With the strings restored, you can undo the cosmetic disguises:
- **Bracket → dot notation:** obj["update"] becomes obj.update (skip reserved words and keys that aren't valid names).
- **Numeric obfuscation:** 0x1a4, 1e3, 0b1010, and arithmetic like 0x1 << 0x4 are all constants — work them out to 420, 1000, 10, 16.
- **String concatenation:** "up" + "da" + "te" folds to "update".
output = output
.replace(/\["([a-zA-Z_$][\w$]*)"\]/g, '.$1') // bracket -> dot
.replace(/0x([0-9a-fA-F]+)/g, (_, h) => parseInt(h, 16)); // hex -> decimal
### Layer 4 — Control-flow flattening
This is the layer that most resists regex. Normal top-to-bottom code is rewritten into a while loop driven by a state variable and a switch, so the *order* things run no longer matches the order they appear in the file:
let state = 0;
while (true) {
switch (state) {
case 0: a = init(); state = 2; continue;
case 1: return a + b; // exit
case 2: b = step(a); state = 1; continue;
}
}The real flow is 0 → 2 → 1, but it is written 0, 1, 2. To undo it you build a small **control-flow graph** (a map of which block leads to which): each case is a block of code, and the state = N assignments are the arrows between them. Re-thread the blocks into execution order and the original linear code falls out. This is where an AST (Abstract Syntax Tree — a structured tree of the code) becomes essential, because plain text replacement cannot track state cleanly.
### Layer 5 — Dead code and opaque predicates
Obfuscators add branches that *look* conditional but always go the same way ("opaque predicates"), plus unreachable junk to bulk up the file:
if ((function () { return !![]; })()) { realWork(); } else { garbage(); }!![] is always true, so the else branch can never run. Once you work out that the condition is constant, you can delete the dead branch entirely. Removing decoder definitions, rotation functions, and unused helpers shrinks the file dramatically on this final pass.
### From regex to ASTs
Text-based replacement gets you surprisingly far on string arrays, but it is fragile: it cannot respect scope (which variable means what, and where), track variable values, or safely reorder code. Serious deobfuscation works on the **Abstract Syntax Tree** (AST) instead — a structured tree representing the code's grammar. The workflow with a toolchain like Babel is: **parse** the source into an AST, **traverse** and transform its nodes (fold constants, inline the decoder, evaluate fake conditions, rebuild control flow), then **regenerate** clean source. Babel's path.evaluate() even tells you when a piece of code is a fixed constant.
import * as parser from '@babel/parser';
import traverse from '@babel/traverse';
import generate from '@babel/generator';
const ast = parser.parse(source);
traverse(ast, {
CallExpression(path) {
const { confident, value } = path.evaluate();
if (confident) path.replaceWithSourceString(JSON.stringify(value));
},
});
const clean = generate(ast).code;Because the AST understands scope and structure, a visitor (a function that runs on each matching node) can do things regex never could — like "replace every call to this function with its constant return value, but only within the scope where it is bound."
### The hardest layer — bytecode VMs
The strongest obfuscators do not just hide the code — they **replace it with a custom virtual machine** (a mini interpreter built into the script). The original logic is compiled down to a private bytecode (a stream of low-level instructions, often shipped as a base64 blob), and what you actually see is an interpreter that walks that bytecode step by step. There is no JavaScript left to tidy up.
; decoded bytecode, disassembled
LOAD_STRING r3, "update"
PROPACCESS r4 = r2[r3]
LOAD_STRING r5, "secret"
FUNC_CALL r6 = r4.call(r2, [r5])
JUMP_COND_NEG if(!r6) goto @1487Reversing a VM is a different discipline: **(1)** recover the bytecode blob the interpreter reads; **(2)** figure out what each opcode (instruction byte) means by reading the interpreter's dispatch loop — which byte is LOAD_STRING, FUNC_CALL, JUMP… and how the arguments are arranged; **(3)** write a two-pass disassembler — pass one finds all the jump targets, pass two prints labelled, human-readable instructions; **(4)** optionally translate the disassembly back into equivalent source. It is labour-intensive, but doable: the interpreter *is* the spec, and it is sitting right there in the file.
### A practical order of operations
When you sit down with an obfuscated file, peel the layers from the outside in. Each pass makes the next one easier, because every layer you remove exposes more constants for the following pass to simplify:
- **Beautify** — run it through a formatter so you can see the structure.
- **Decode strings** — extract the array, run the rotation, fold every decoder call and its aliases.
- **Simplify literals** — bracket→dot, hex→decimal, concatenations.
- **Restore control flow** — un-flatten the switch/state loops via an AST.
- **Prune** — evaluate the fake conditions, delete dead branches and obfuscator scaffolding.
- **Rename** — give _0x3b01-style names meaningful ones based on what they now obviously do.
- **If a VM remains** — recover the bytecode, map the opcodes, disassemble.
### Tooling cheat sheet
- **Beautifiers:** Prettier, js-beautify — always step one.
- **AST toolkits:** Babel (@babel/parser/traverse/generator), acorn, esprima, recast (preserves formatting).
- **Purpose-built:** webcrack, synchrony, and the REStringer family handle common obfuscator output (notably obfuscator.io) out of the box.
- **Analysis:** AST Explorer (astexplorer.net) for prototyping visitors; a debugger for stepping through a VM interpreter live.
### Example
```javascript
// Folding a rotated string-array decoder — the core of most deobfuscation.
// 1. Extract the literal array and the decoder's offset.
const stringArray = ["u7d1", "WordArray", "update", "secret" /* ... */];
const OFFSET = 336;
const decode = (i) => stringArray[i - OFFSET];
// 2. Replay the rotation loop until the checksum matches (deterministic).
function rotateUntilValid(target) {
for (let i = 0; i < stringArray.length; i++) {
const sum = parseInt(decode(347)) / 1 + parseInt(decode(541)) / 2; // sample
if (sum === target) return;
stringArray.push(stringArray.shift());
}
}
// 3. Constant-fold every decoder call back into the source text.
function foldDecoderCalls(source) {
return source.replace(/_0x3b01\((\d+)\)/g, (m, n) => {
const value = decode(Number(n));
return value === undefined ? m : JSON.stringify(value);
});
}
```
### FAQ
**Q: Is deobfuscation always possible?**
In principle, yes. Obfuscation only applies semantics-preserving transformations — changes that keep the behaviour identical — and the runtime must undo them to execute the code. Anything the engine can resolve, you can resolve too. It is a question of effort, not possibility.
**Q: Why use an AST instead of regular expressions?**
Regex cannot track scope, evaluate expressions, or reorder code safely. An AST understands the program's structure, so it can inline a decoder only within the scope where it is defined, work out constants reliably, and rebuild scrambled control flow — things plain text replacement cannot do correctly.
**Q: What makes a bytecode VM harder than normal obfuscation?**
There is no JavaScript left to tidy up — the logic lives in a custom bytecode that a built-in interpreter runs. You have to recover that bytecode, reverse-engineer what each instruction means from the interpreter, and write a disassembler before you can even read the logic.
**Q: What is the first step on a fresh obfuscated file?**
Beautify it. A formatter alone reveals the overall structure — the string array, the decoder, the rotation function, and whether a VM interpreter is present — which tells you which layers you are facing before you write any transforms.
---
## How Do You Devirtualize an Obfuscated JavaScript VM?
URL: https://scrappey.com/qa/reverse-engineering/how-to-devirtualize-a-javascript-vm
**Devirtualization is the process of recovering a readable program from JavaScript that has been compiled into a tiny interpreter — a virtual machine — bundled inside the script itself.** Instead of leaving the original functions on the page, the build emits an array of opaque “opcodes” plus a blob of bytecode, and a small loop that walks the bytecode and drives the opcodes. The logic you actually care about no longer lives in any function body; it lives in the *composition* the loop performs at runtime. Plain deobfuscation — renaming variables, decoding string arrays, unflattening control flow — gets you nowhere here, because there is no normal control flow to clean up. Devirtualization means reconstructing the VM’s instruction set, replaying its program statically, and collapsing the result back into JavaScript you can read top to bottom.
### Quick facts
- **Technique:** Static devirtualization of a curried-thunk bytecode VM
- **Input shape:** A wall of Base64 → a Uint16Array of bytecode + an array of curried "opcode" thunks
- **Execution model:** A triplet loop: Q[dest] = Q[op](Q[operand])
- **Core trick:** Opcodes are curried — nothing runs until a thunk is fully saturated with arguments
- **Constants:** No literal pool; synthesized via type coercion plus encoded, byte-transformed blob pools
- **Goal:** A semantically faithful IR you can read top-to-bottom, derived without executing the sample
- **Real-world example:** Imperva / Incapsula’s reese84 interrogator (cited here purely as an educational reference)
### A wall of Base64 isn’t always the easy kind
A giant Base64 string at the top of a script is usually one of two things: a trivial “decode-and-read” challenge, or a virtual machine wearing that disguise. A well-known example of the second kind is the reese84 interrogator shipped by Imperva / Incapsula — we reference it here only as a concrete, public example of the pattern, not as a target. The skeleton looks deceptively simple:
(function () {
var A = window.atob("b64_string");
var E = new window.Uint8Array(A.length);
for (var B = 0; B < A.length; B++) E[B] = A.charCodeAt(B);
E = new window.Uint16Array(E.buffer); // the bytecode
var Q = [ null, null, [], /* ... ~120 curried functions ... */ ];
Q[0] = Q;
var B = 0;
while (B < E.length) {
Q[E[B++]] = Q[E[B++]](Q[E[B++]]); // the triplet loop
}
})();Decode some Base64, reinterpret the bytes as 16-bit values, and run a loop three slots at a time. The catch is in Q: **none of those functions do anything on their own**. Each takes one argument and returns another function. Nothing fires until a function has been handed enough arguments to be fully *saturated*. They are curried thunks, and the real behaviour lives in how the triplet loop composes them — not in any single body. So you cannot just read the opcodes; you have to reconstruct what their composition computes.
### The IIFE-fold trap
Before folding these curried bodies you have to respect one detail, because getting it wrong silently corrupts everything downstream. Consider an opcode like this:
function (A) {
return function (E) {
return (function (A) {
return function () { return A(arguments); };
})(A(E));
};
}A naive IIFE folder substitutes the body as return function () { return A(E)(arguments); }. That is wrong. The inner IIFE forces A(E) **once**, at the moment the closure is built. The naive fold re-defers it, so A(E) re-runs on every call — with a different arguments each time. The correct fold binds A(E) once and evaluates it at depth two:
function (v7) {
return function (v8) {
var v9 = v7(v8); // forced once, here
return function () { return v9(arguments); };
};
}This is the kind of bug that does not throw — it just produces an IR that disagrees with the real VM in ways you will not notice until the recovered constants come out as garbage. Build a small unary solver (evaluate single-argument calls to a concrete value where you can) and an IIFE folder that gets this right, and the bodies become readable.
### There’s no constant pool — until there is
Once you fold the bodies and resolve the type coercion, you notice the VM has no constant pool. That does not mean it has no constants — it means they are synthesized on the fly:
function (v66) {
return function (v67) {
return "false[object Window]"[v66()];
};
}The string "false[object Window]" is just false + window coerced; indexing it with v66() extracts a single character, and that index *is* the constant. Many opcodes do the same with different coerced primitives. So the constants are real — they are simply manufactured from coerced JavaScript values one character at a time. That tells you what the opening stretch of the program is doing: it is assembling a character lookup table. You could confirm that by emulating the triplet loop, but a curried VM means an instruction stream in the megabytes, so brute emulation does not scale. There is a faithful, static path instead.
### Lowering opcodes to an expression IR
Rather than hand-writing a bespoke matcher for each of the ~120 opcode indices, lower them **generically**: walk each function’s AST and translate every expression into your own small IR — one that only knows the handful of things these opcodes actually do. The pool collapses fast. A BinaryExpression becomes a BinOp; an indexed coercion becomes an Index that folds to a Const; the rest are small combinations of Call, Force, Index, and Cond.
The real work the translator does is not the lowering — it is **counting**. Because every opcode is curried, its index means nothing until it has been handed enough arguments and called enough times to fire. As you lower a body you count two things: the *arity* (how many curried arguments it takes) and whether the fully-applied body still returns one more thunk that needs a final forcing call. Their sum is exactly how many cycles the opcode needs before it is saturated and ready to emit. That number is the linchpin of the whole approach.
### Classifying shapes (roles)
A tree of BinOp / Index / Call nodes still does not tell you what an opcode *is*. Two opcodes can lower to structurally similar trees and mean completely different things to the VM. So a second, coarser pass — the classifier — matches each lowered body against a fixed catalogue of **roles**. Not one matcher per opcode index, but one matcher per kind of thing the VM does:
- **Leaves** — literals (LiteralNumber, LiteralBool, LiteralNull, LiteralWindow), slot/argument/window reads and writes (SlotRead, ArgsRead, IndexWrite), BinOp/UnaryOp/Ternary, and call shapes (CallN, NewN, MethodCallN, Apply).
- **Forcing glue** — Force, DoubleForce, PassthroughArg0: the no-op-looking opcodes whose entire job is to pull a value out of a thunk.
- **Combinators** — Compose, Pipe, Seq, Iife, With: the higher-order plumbing that wires other opcodes together.
- **Frame ops** — NewFrame, FramePush, Link, Body: the calling-convention machinery around the VM’s slot[2] activation record.
- **Control flow** — While, TryCatch: recognised by their thunk-wrapped shape.
Every match is purely **structural**. Ternary is “a Cond whose test forces argument 2, consequent forces argument 0, alternate forces argument 1.” Compose is “apply argument 0 to the result of applying argument 1 to argument 2.” Nothing is keyed to an opcode’s numeric index, which is exactly why the same matchers survive the polymorphism between samples. The classifier also records one extra bit per opcode: whether the body *needs a frame* (it invokes an argument with arguments, or touches slot[2][0]). That boolean decides whether the saturator may freely inline a body or must treat it as an opaque, effectful call.
### Flattening by shadow-execution
Now use that cycle count. Parse the flat Uint16Array into (dest, op, operand) triplets — JavaScript evaluates the assignment target before the right-hand side, so Q[E[B++]] = Q[E[B++]](Q[E[B++]]) reads in exactly that order. Each triplet means “apply whatever is in the operator slot to whatever is in the operand slot, write the result to the destination slot.”
Instead of running that, you **shadow** it. Build a frame mirroring Q, but the slots hold abstract values, not live functions: an opcode, a partially-applied opcode with some arguments collected, or a finished IR expression. Walk the triplets in the exact order the VM would. For each one, look at the operator slot:
- an opcode → start a fresh partial application with one argument;
- an already-partial opcode → apply another argument;
- an already-finished value → it is a real call, so emit it.
This is where the megabytes evaporate. The overwhelming majority of triplets exist only to thread one more argument into a partial — they do nothing on their own, mirroring how the curried thunks stay inert until saturated. Only when a partial’s argument count reaches its cycle count does it *fire*: you lower the opcode body to a concrete IR expression, and the slot graduates from partial to finished value. Cheap results (a literal, a bare slot reference) go straight into the slot; anything heavier is bound to a fresh temporary. Either way you record the slot’s current definition, so later reads pick up the latest expression — an SSA-flavoured value numbering over the VM’s register file. One more detail: opcode bodies *force* the thunks they are handed (the VM passes thunks, not finished values), so you must track when an argument is called inside a body too. As the opening stretch fires, the synthesized-constant bodies fold to single characters and the character lookup table assembles itself, exactly as predicted.
### Saturation and convergence
The linear walk does not finish the job. Plenty of opcodes never reach their cycle count during the walk because they are handed to *other* opcodes as values rather than called directly — a combinator decides when (or whether) they fire. After flattening they show up as residual partials: correct, but opaque (v3[op](x)(y)). So run a second phase that does to the IR what the walk did to the bytecode — a fixpoint loop of three passes:
- **saturate** — find any residual partial whose collected argument count has reached its cycle count and inline the opcode body, replacing each Arg(n) with the supplied argument (hoisting any argument used more than once into a temporary so effects are not duplicated).
- **fold** — constant-fold and simplify: string-method folding, coercion collapse, unary/const arithmetic.
- **inline** — SSA-style cleanup: a slot or temporary written once and read once gets propagated into its single use; a pure definition with zero remaining reads is dropped, an effectful one demoted to a bare effect.
Each round feeds the next. Count residual partials between rounds; the moment a round fails to reduce the count, you have converged (cap it at, say, six rounds). For partials that are genuinely under-applied and can never saturate, an optional lambda pass eta-expands them into explicit arrows with fresh parameters, so even the leftover plumbing reads as real JavaScript. This is the line between “we emulated the loop” and “we devirtualized it”: the flatten walk transcribes the trace, and saturation collapses the composition that trace was threading together, so a deeply-curried combinator nest becomes a single readable expression.
### Recovering control flow
Pure expressions are not enough — the program loops and catches exceptions, and a faithful transcription has to show that. The VM has **no jump opcodes**; control flow is encoded the same way as everything else, as opcodes whose bodies happen to have a particular shape. A loop opcode is a thunk wrapping a while whose test and body are both forced arguments. A try/catch opcode is a thunk wrapping a try whose handler re-invokes its argument with the caught value. The classifier tags exactly these as the While and TryCatch roles, and the lowering keeps while, for…in and try/catch as first-class IR nodes (WhileStmt, ForInStmt, TryCatchStmt) instead of flattening them into calls. When such an opcode saturates you emit a real while (…) { … } or try { … } catch (__err) { … }.
Loops force one bit of extra care. Value propagation normally threads a slot’s known value forward to its next read — but inside a loop body a slot written on one iteration is read on the next, so propagating a single-pass value would be wrong. Before propagating into a while, scan the loop for every slot it writes and mark those off-limits, so the loop’s own state never gets constant-folded out from under it.
### Bootstrapping and synthesized constants
The VM could trivially manufacture any ASCII character with a String.fromCharCode opcode — but that would be too telling, so it builds one itself. From coerced primitives it already has the characters c o n s t r u f m C h a d e, which is just enough to spell two words: constructor and fromCharCode. Indexing a string with ["constructor"] hands you the String constructor; a second index, ["fromCharCode"], reaches the static method. Where does an uppercase C come from? From a neat use of window.atob: encode something like "128false" and harvest characters out of the decoded bytes.
With fromCharCode in hand the VM builds more vocabulary — and the bulk of it does not come character by character. It comes from **encoded Base64 blob pools** assembled during bootstrap. If they were plain Base64 you would atob them and read the words, so instead each blob is Base64-decoded and then run through a short chain of byte transformations. Those chains are *polymorphic* between samples — one pool might be SwapAdjacent, RotateRight, XorChain, XorChain. You do not guess the chain: you read it straight out of the IR, because by now the fold opcodes are named and their primitives, order, immediates and key lengths are sitting in the decode routine. You recognise each pass by what its loop body does: a loop combining |, <<, >>, & with 255 and 7 is a bit-rotate; a loop whose only real operator is ^ against a key byte is an XorChain; one writing to both i and i+1 is SwapAdjacent. Re-derive the chain per pool, pull the immediates and keystreams from the IR, replay the primitives, and out falls the real vocabulary: getImageData, createOscillator, __SENTRY__, webdriver, MAX_VERTEX_TEXTURE_IMAGE_UNITS, OfflineAudioContext, userAgent. So the VM *does* have a constant pool — it is just encoded. Once you substitute the decoded strings back in, an opaque signal read becomes window["webdriver"], canvas["getContext"], or "__SENTRY__" in window.
### Side effects, cleanup, and checking the work
Everything folded so far has been pure — takes thunks, returns a value. But the program eventually has to *do* something, and the first thing it does is install itself with a saturated three-argument opcode of the shape window[v24()] = v25(). There is no special “install” instruction; the side effect is just an opcode that fired. When you transcribe it, an Assign / Call / New cannot be treated like a value — dropping or reordering it changes the program. So the rule is: the moment an opcode body produces an effect, pin it into the instruction stream in execution order, and let only its result flow onward. Value propagation may thread constants through pure slots, but it may never cross an effect it cannot prove pure — the instant you hit an opaque call or a window write, forget what you knew about the affected slot. IIFEs follow the same model: an Iife-tagged body is decomposed back into a sequence of instructions and re-emitted as (function(){…})() rather than inlined.
Faithfulness leaves litter. The bootstrap writes a constant into a slot[2] frame cell, reads it back a few instructions later, writes the next one, hundreds of times over. A dedicated slot[2] propagation pass walks the stream tracking each frame cell’s known value and substitutes reads, invalidating on any full replacement, frame push, or opaque call; a dead-store pass then removes pure writes that are never read before the next write. Finally, **verify**. It is easy to be confidently wrong about a decode chain, so run the recovered IR through a small abstract interpreter that speaks the same value universe — numbers, byte strings, ArrayBuffer/Uint8Array views, objects, arrays, atob, String.fromCharCode, and the frame model. If the chain is right, the interpreter reproduces the same readable vocabulary you derived statically; if you misread a pass or an immediate, the bytes come out as garbage and you know immediately. That is the difference between “this looks like a SwapAdjacent” and “running it produces webdriver.”
Zoom out and the whole pipeline is: lower each curried body to a tiny expression IR, classify every body into a role, shadow-execute the triplet stream to transcribe the trace, then saturate and clean until the composition collapses into readable JavaScript — all statically, without ever running the original sample. What started as a wall of Base64 ends as a program you can read: the cipher it builds, the pools it unpacks, the global it installs, and the list of every automation-framework, WebGL and DOM probe it goes looking for. That readability is the prize — the moment signals stop being opaque slot indices, the script stops being a black box and becomes something you can reason about. This write-up distills the excellent walkthrough at debug.cat — Imperva Reversal, which is well worth reading in full.
### Example
```javascript
// The single most instructive bug when folding a curried-thunk VM:
// the deferred-evaluation (IIFE-fold) trap.
// An opcode body, as it appears in the sample:
function (A) {
return function (E) {
return (function (A) {
return function () { return A(arguments); };
})(A(E)); // <-- the inner IIFE forces A(E) exactly ONCE
};
}
// WRONG fold. The naive folder lifts A(E) into the returned closure,
// so it re-runs on every call with a different `arguments`:
function (A) {
return function (E) {
return function () { return A(E)(arguments); }; // re-defers A(E) -- BUG
};
}
// CORRECT fold. Bind A(E) once, at the depth the IIFE forced it,
// then close over the bound value:
function (v7) {
return function (v8) {
var v9 = v7(v8); // forced once, here
return function () { return v9(arguments); };
};
}
// Why it matters: this bug never throws. It silently produces an IR that
// disagrees with the VM only at runtime -- you find out when your recovered
// constant pool decodes to garbage instead of "webdriver".
```
### FAQ
**Q: Why is this called devirtualization and not just deobfuscation?**
Deobfuscation cleans up code that is still structured as normal functions and control flow — renaming variables, decoding string arrays, unflattening switch dispatchers. A virtualized program has none of that: the logic has been compiled into bytecode that a small interpreter walks at runtime. Devirtualization means reconstructing that interpreter’s instruction set and replaying its program statically until you recover real JavaScript. It is a superset of deobfuscation aimed at a much harder shape.
**Q: Do you have to run the script to reverse it?**
No — the whole point of the approach is that it is static. You shadow-execute the bytecode: walk the triplet stream in the exact order the VM would, but keep abstract values (opcodes, partial applications, finished IR expressions) in the slots instead of live functions. Because curried thunks do nothing until saturated, you only emit an instruction when a partial reaches its known cycle count. Verification runs the recovered IR in a small sandboxed abstract interpreter, never the original sample.
**Q: Why are the opcodes curried thunks?**
Currying defers all behaviour into composition. Each opcode takes one argument and returns another function, so no single body reveals what it computes — the meaning only emerges from how the triplet loop threads arguments between them over thousands of steps. It also inflates the instruction stream into the megabytes, since most instructions exist only to feed one more argument into a partial. Saturation collapses all of that back down.
**Q: How do you recover the constants if there is no literal pool?**
Two ways. Small constants are synthesized by coercing JavaScript primitives — for example false + window becomes the string "false[object Window]", and indexing it yields a single character. Larger vocabulary (DOM method names, WebGL constants, automation-framework identifiers) lives in encoded Base64 blob pools that are decoded and then run through a polymorphic chain of byte transformations such as swap-adjacent, rotate, and XOR. You read the exact chain out of the recovered IR and replay the primitives to get the plaintext back.
**Q: Why verify with an abstract interpreter instead of trusting the IR?**
Because it is easy to be confidently wrong. A misread byte-transform pass or a wrong immediate produces an IR that looks plausible but decodes its constant pools to garbage. Running the recovered IR through an interpreter that understands the same value universe (typed arrays, atob, fromCharCode, the frame model) gives you cheap, decisive proof: if the readable vocabulary comes back out, your reconstruction is faithful; if garbage comes out, you know exactly where to look.
---
## What Is Lua Bytecode Virtualization?
URL: https://scrappey.com/qa/reverse-engineering/what-is-lua-bytecode-virtualization
**Lua bytecode virtualization is an obfuscation technique that replaces Lua's standard virtual machine with a custom, secret one, so the compiled script can only be run by an interpreter the protector ships alongside it.** Normal Lua compiles source into luac bytecode - a compact instruction set executed by the stock Lua VM, and readable with off-the-shelf tools. A virtualizer rewrites that bytecode into its own private instruction set, encodes it, and bundles a bespoke dispatch loop to run it. The original logic no longer lives in any function you can decompile; it lives in the semantics of opcodes only the custom VM understands. Recovering it - devirtualization - means reverse-engineering that VM. This entry follows the foundation laid out in birk.blog's Lua Virtualization series (Part 1: the internals of the Lua VM), which builds toward devirtualizing the well-known Luraph protector.
### Quick facts
- **Base VM:** Lua 5.1 - a register-based VM whose instructions mirror its C API
- **Compile path:** lua source -> luac bytecode (inspect with luac -l); decompile with unluac
- **Instruction format:** 32-bit: 6-bit opcode + A (8b), B (9b), C (9b), or Bx (18b) / sBx (signed)
- **What virtualization changes:** Renumbered/secret opcodes + encoded bytecode + a custom dispatch loop
- **Why stock tools fail:** unluac and luac -l only know the standard opcode set, not the custom VM
### From Lua source to luac bytecode
Lua is popular in games, config systems, and embedded scripting because it is a small VM whose instruction set mirrors its C API, making C/C++ bindings easy. When you run Lua, the source is compiled to luac bytecode. You can see it with the luac -l listing flag. Compiling print("Hello World!") on Lua 5.1 yields four instructions:
GETGLOBAL 0 -1 ; R(0) := _G["print"]
LOADK 1 -2 ; R(1) := "Hello World!"
CALL 0 2 1 ; R(0)(R(1))
RETURN 0 1Each instruction is an **opcode** plus register indexes. A *negative* index points into the constant table instead of the register stack, which is why GETGLOBAL 0 -1 reads constant 1 ("print") and writes it to register 0. lopcodes.h documents the semantics, e.g. OP_CALL is R(A), ... ,R(A+C-2) := R(A)(R(A+1), ... ,R(A+B-1)). Reverse the process with the decompiler **unluac** (java -jar unluac.jar luac.out) and you get working source back - but not byte-identical to the original, because several source forms compile to the same bytecode. That round-trip working at all is exactly what virtualization is designed to break.
### How the Lua VM is laid out
To understand what a virtualizer is hiding, you first need the stock layout. From lopcodes.h, an instruction is 32 bits packed as a **6-bit opcode** plus operands: A is 8 bits, B and C are 9 bits each, and Bx is the 18-bit fusion of B and C (with sBx its signed form). Constants come in a few types from lua.h - LUA_TNIL, LUA_TBOOLEAN, LUA_TNUMBER (every number is a double), and LUA_TSTRING (stored without the trailing \0, so its on-disk length is length - 1).
Code is organised into **function prototypes**. There is one main function with everything nested inside it; each prototype holds its instruction array (code + count), a constant array (sizek), nested sub-functions (sizep), and - unless compiled with the -s strip flag - debug data (line info, local names, upvalue names, source name). The subtle part is **upvalues and closures**. An upvalue is a variable from an enclosing scope captured by a nested function. When a CLOSURE opcode runs, the instructions immediately after it are not executed - they are metadata describing each captured upvalue: a MOVE means the upvalue is local to the enclosing function (in_stack = 1, the B register is its stack index), and a GETUPVAL means it comes from a further-out scope (in_stack = 0, B indexes the enclosing function's upvalue list). So a closure with 4 upvalues is followed by 4 metadata instructions that the VM skips. Knowing which trailing instructions are data, not code, is essential the moment the opcodes stop being standard.
### What "virtualization" means as obfuscation
Virtualization is the heaviest tier of code protection. Instead of merely renaming locals or encoding strings (ordinary deobfuscation territory), a virtualizer compiles the program down to a **brand-new instruction set that only its own interpreter understands**. In practice that means: the opcodes are renumbered or fully redesigned (so GETGLOBAL is no longer opcode 5, and may not exist as a single opcode at all), the bytecode blob is encoded or encrypted, and the protector ships a hand-written **dispatch loop** - a big switch/handler table - that decodes and executes the private instructions at runtime. Luraph is the best-known commercial example for Lua, cited here purely as a public reference for the technique.
The effect is that every stock tool stops working. luac -l and **unluac** only know the standard opcode set, so pointed at a virtualized chunk they produce garbage or refuse outright - there is no longer a one-to-one map from bytes to known semantics. The logic you want is smeared across the custom opcode handlers and the order the dispatch loop invokes them, which is the same structural trick used by obfuscated JavaScript VMs in the browser - including the interrogator scripts that anti-bot vendors ship for fingerprinting. The language differs; the shape is identical.
### How you devirtualize it
Recovering readable code from a virtualized script means reconstructing the custom VM, then replaying its program statically. A practical starting point is the Lua source itself: lvm.c handles opcode execution, ldump.c serialises bytecode, and print.c renders the luac -l listing. Patching ldump.c and print.c to emit extra debug output teaches you exactly how a chunk is laid out and what data it carries - the groundwork for parsing a non-standard one. From there the job is to map each custom opcode back to a known operation (load, call, arithmetic, jump, closure), recover the constant pool, and lift the dispatch trace into a high-level form you can read - the same flatten-and-saturate approach used to devirtualize a JavaScript VM.
Why does this matter beyond game scripts? Because the same VM-obfuscation pattern guards a lot of client-side anti-bot logic on the web, and teams scraping protected endpoints often hit it. Reversing a bespoke VM per target is slow and brittle, which is why many developers skip the reverse-engineering arms race entirely and let a managed web-data API such as Scrappey run a real browser and return the rendered result - the obfuscated VM executes as intended, server-side, and you consume the output. Reversing it yourself remains the right path when you need to understand or reimplement the logic; offloading it is the pragmatic path when you just need the data.
### Example
```text
# $ luac -l hello.lua (Lua 5.1) -> human-readable standard bytecode
main <hello.lua:0,0> (4 instructions)
1 GETGLOBAL 0 -1 ; R0 := _G["print"] (-1 => constant index, not a register)
2 LOADK 1 -2 ; R1 := "Hello World!"
3 CALL 0 2 1 ; R0(R1), 0 results
4 RETURN 0 1
# Each 32-bit instruction packs: [ opcode:6 | A:8 | C:9 | B:9 ] (Bx = C+B = 18 bits)
# A virtualization obfuscator (e.g. Luraph) renumbers/redesigns these opcodes,
# encodes the bytecode, and ships its OWN dispatch loop to run them. Result:
# unluac and "luac -l" no longer understand the chunk. Recovering the logic
# means reverse-engineering that custom VM -- i.e. devirtualization.
```
### FAQ
**Q: What is the difference between Lua bytecode and Lua virtualization?**
Lua bytecode (luac) is the standard compiled form of a Lua program, executed by the normal Lua VM and readable with tools like luac -l and the unluac decompiler. Virtualization goes a step further: it recompiles the program into a custom, secret instruction set and ships a bespoke interpreter to run it, so the standard tools no longer apply. Plain bytecode is a known format; a virtualized chunk is a private one you must reverse-engineer.
**Q: Can I just use unluac to decompile a virtualized Lua script?**
No. unluac and luac -l only understand the standard Lua opcode set. A virtualizer renumbers or redesigns the opcodes, encodes the bytecode, and runs it through its own dispatch loop, so stock tools produce garbage or fail. Recovering the original logic requires reconstructing the custom VM - mapping its opcodes back to real operations, recovering the constant pool, and lifting the dispatch trace into readable code.
**Q: Why does Lua VM virtualization matter outside of games?**
Because the exact same pattern - compile logic to a private instruction set and hide it behind a custom interpreter - is used to obfuscate client-side anti-bot and fingerprinting scripts in the browser. The techniques for understanding the Lua VM transfer directly to devirtualizing an obfuscated JavaScript VM, which is why this is a foundational reverse-engineering skill rather than a Lua-only curiosity.
---
## What Are Common Lua Obfuscation Techniques?
URL: https://scrappey.com/qa/reverse-engineering/lua-obfuscation-techniques
**Lua obfuscation is the practice of rewriting a script so it still runs identically but actively resists reverse-engineering tools, ranging from cheap constant-hiding tricks up to full bytecode virtualization.** The guiding principle is not that the code becomes unbreakable - it is that it withstands *automation*. If turning a five-minute read into a week of expert work requires manual effort that cannot be scripted, the obfuscation has done its job. The techniques below, drawn from a public survey of real-world Lua protectors, stack on top of each other: each one alone is weak, but combined they make the code irregular enough to defeat generic deobfuscators.
### Quick facts
- **Constant hiding:** Mixed Boolean-Arithmetic (MBA) and table-length (#{...}) numeric encoding
- **Source packing:** load / loadstring with byte-escaped strings (the weakest layer)
- **Noise:** Junk code: dead branches, pcall nonsense, no-op load() blocks
- **Structure:** Lambda-return entry points + control-flow flattening (state-machine while loops)
- **Strongest:** VM-based obfuscation - reimplement the Lua interpreter (LBI / IronBrew lineage)
### Hiding constants: MBA and table lengths
The cheapest layer hides literal values. **Mixed Boolean-Arithmetic (MBA)** rewrites a constant as a tangle of arithmetic - since Lua 5.1 lacks bitwise ops, it leans on multiply/divide/modulo/subtract. The catch: luac constant-folds a fully-inline expression straight back to the original number, so obfuscators force at least one operand to be a *variable* so the compiler must emit the full instruction sequence (LOADK, MUL, SUB, ADD...) instead of the folded constant. A second trick encodes numbers as the **length of a table**: #{"a", 12, {}, "foo", 42, true, "x", 99} evaluates to 8 at runtime, so the number 8 never appears in source - which breaks static analyzers that do not execute the code.
### Packing, junk, and lambda entry points
The most abused (and weakest) technique is load/loadstring packing: the source is byte-escaped into a string like "\112\114\105..." and re-parsed at runtime - trivially recovered by just printing the decoded string. **Junk code** is more annoying than hard: dead branches (if 1 == 2 then ...), pcall wrappers around code that always errors, and no-op load() blocks. None of it executes, but a parser cannot tell junk from real logic, forcing a human to triage every block. A **lambda-return entry point** wraps the program in an anonymous function whose helpers are pulled from a table by obscure integer keys at runtime, adding a layer of indirection that combines nastily with junk and MBA.
### Control flow flattening and the VM endgame
**Control-flow flattening** replaces ordinary sequential code with a state machine: a while true loop plus a state variable and a chain of if state == N branches, so the linear order of operations is scattered across cases dispatched by a value. The more states, the harder it is to follow - and it composes naturally with virtualization. The endgame is **VM-based obfuscation**: reimplement the Lua interpreter from scratch and run the program as custom (or re-encoded) bytecode on it. The first public Lua VM obfuscator, LBI, appeared over a decade ago; descendants like Rerubi, FiOne and FiThree followed, and most obfuscators seen in the wild today derive from **IronBrew** - identifiable by a signature quirk in its OP_JMP optimisation that copies forward into commercial and "skidded" protectors. This is the same pattern that hides client-side anti-bot logic in the browser, which is why a managed web-data API such as Scrappey - which runs the real script server-side and returns the result - sidesteps the need to reverse it at all. This survey follows birk.blog's Lua Virtualization Part 2.
### Example
```lua
-- 1) MBA: force a variable so luac can't fold it back to 122
local a = 7
print(((((a*10) - (102%23) + 9)*2 - (3*7+14)) + ((144/2)-(8*4))) - (((99-30)/3)-2))
-- 2) Table-length encoding: the number 8 never appears literally
local n = #{"a", 12, {}, "foo", 42, true, "x", 99} -- = 8
-- 3) loadstring packing (weakest layer - decode and read)
load("\112\114\105\110\116\40\39...\41")() -- = print('Hello World!')
-- 4) Control-flow flattening: linear code becomes a state machine
local state = 0
while true do
if state == 0 then print("A"); state = 1
elseif state == 1 then print("B"); break end
end
```
### FAQ
**Q: Does Lua obfuscation make code impossible to reverse?**
No, and good obfuscation does not try to. The realistic goal is to resist automation - to make recovery require slow, manual, expert work that cannot be scripted. If an obfuscator turns a five-minute read into a week of effort that no generic deobfuscator can shortcut, it has succeeded, even though a determined human can still reverse it.
**Q: Why does luac undo my MBA-obfuscated constants?**
Because the Lua compiler constant-folds expressions made entirely of inline literals, collapsing the whole arithmetic tangle back to the original number. To prevent it, force at least one operand to be a variable (e.g. local a = 7), which makes the compiler emit the full instruction sequence instead of the folded constant.
**Q: What is the strongest Lua obfuscation technique?**
VM-based obfuscation (virtualization): the protector reimplements the Lua interpreter and runs your program as custom or re-encoded bytecode on it, so standard tools like unluac no longer apply. Most in-the-wild Lua VMs descend from IronBrew, recognisable by a signature quirk in its OP_JMP handling.
---
## What Is Dual-VM Lua Obfuscation?
URL: https://scrappey.com/qa/reverse-engineering/what-is-dual-vm-lua-obfuscation
**Dual-VM Lua obfuscation runs your program through two stacked virtual machines - a deserialization VM that decodes an encrypted blob into an instruction stream, and a "real" VM that executes it - wrapped in anti-tamper traps that crash on inspection.** It is the architecture behind the most resilient commercial Lua protectors. Luraph (in market since 2017 and largely immune to public deobfuscation) is the best-known example, referenced here purely as a public case study. Statically analyzing one means controlling the input to predict the bytecode, beating anti-beautify tricks, inlining factorized helper functions, stripping junk and control-flow noise, and finally lifting the VM's renumbered opcodes back to standard Lua semantics. This follows birk.blog's Lua Virtualization Part 3.
### Quick facts
- **Architecture:** Two stacked VMs sharing one handler: a deserialization VM feeding a real VM
- **Analysis start:** Minimal input (local foo = "bar" -> one LOADK) to get a baseline; 20 bytes -> ~65 KiB
- **Anti-beautify:** Invalid escapes (\!, \:, \#) and semicolon traps that break formatters
- **Dispatch loop:** repeat ... until false with nested if Enum < N - Enum is the opcode id
- **Anti-tamper:** pcall(unpack, {}, 0, 2147483647) segfaults; sensitive opcodes crash on read
### Getting a readable baseline
The way into a virtualized chunk is to control the input. Obfuscating something minimal like local foo = "bar" - which compiles to a single LOADK - yields a ~65 KiB file from a 20-byte source, but you know exactly what it must ultimately do. First you fight **anti-beautify** measures: invalid string escapes (\!, \:, \#) that Lua silently ignores but formatters choke on, and semicolon traps like (foo)[1] = "bar";(bar)[1] = "foo"; where stripping the semicolon makes Lua reinterpret "bar"(bar) as a call and crash. Then you locate the entry point - here a table returned and invoked by colon syntax (:Mj()(...)), which implicitly passes the table as Self.
### Peeling the abstraction: inlining, junk, control flow
The entry point is almost entirely nested function calls into factorized helpers. You recover the logic by **inlining**: a helper called once that just returns U[30446] becomes that index inline, peeling one layer at a time. Layered on top is **control-flow obfuscation** - helpers driven by a state variable (if L > 47 then ... L = 66) that you unwind statically when simple or dynamically when not - and **junk code** you detect by deleting blocks and re-running to see if the script still works. Unpacking main reveals it mostly populates a helper table (h_funcs) - string.match, string.byte, bit ops, the constant decryptor - which you rename from opaque integer keys (h_funcs[12] -> h_funcs["safe_tbl_unpack"]) to readable identifiers. This renaming, by tracing what each function does, is the most time-consuming part.
### The dispatch loop and lifting opcodes
Every VM is a fetch-execute loop. Here it is a repeat ... until false with a tree of if Enum < N branches, where **Enum** is the opcode id read from each instruction (e.g. 45 might be OP_ADD, 12 OP_MOV). Using the standard Lua opcode definitions (see Part 1), you lift renumbered Enums back to real opcodes: some map directly (Stk[REG_B[VIP]] = nil is OP_NIL), some are large (a full OP_CLOSURE with upvalue handling), and some are **custom** (an OP_XOR that has no native Lua equivalent). Hardened VMs add **runtime anti-tamper**: a hidden pcall(unpack, {}, 0, 2147483647) reliably segfaults Lua 5.1, and sensitive opcodes like OP_CONCAT crash the moment you try to print their operands - so they leak nothing under naive inspection. The same VM-in-VM shape guards browser fingerprinting scripts.
### Example
```lua
-- The real VM is a fetch-execute loop; Enum is the (renumbered) opcode id
repeat
if Enum < 49 then
-- ... nested if-tree dispatching each opcode ...
end
VIP = (VIP + 1)
until false
-- Lifting renumbered Enums back to standard Lua opcodes:
Stk[REG_B[VIP]] = nil -- OP_NIL
Stk[REG_B[VIP]] = Stk[REG_C[VIP]] * Stk[REG_A[VIP]] -- OP_MUL
VIP = REG_B[VIP] -- OP_JMP
Stk[REG_B[VIP]] = h_funcs["XOR"](Stk[REG_C[VIP]], Stk[REG_A[VIP]]) -- custom OP_XOR
-- Anti-tamper: this is hidden inside the VM and segfaults Lua 5.1 on purpose
pcall(unpack, {}, 0, 2147483647) -- clamp the 3rd arg by 1 to survive
```
### FAQ
**Q: Why do hardened Lua obfuscators use two VMs instead of one?**
One VM (the deserialization VM) decodes and decrypts an embedded blob into the instruction stream and enforces anti-tamper checks; the second (the real VM) executes the recovered program. Splitting the work hides the real opcodes and constants behind a decryption layer, so an analyst who reaches the inner VM still faces encrypted inputs produced by the outer one.
**Q: How do you find the real opcode behind an obfuscated Enum value?**
You match each Enum branch in the dispatch loop against the known semantics of standard Lua opcodes. Many are recognisable from their stack operations - a multiplication is OP_MUL, a jump that sets VIP is OP_JMP. Controlling the VM input narrows which opcodes can appear, making both the standard and the custom ones easier to identify.
**Q: Why does printing a VM register sometimes crash the script?**
Because hardened virtualizers attach anti-tamper hooks to sensitive opcodes (like OP_CONCAT) that would otherwise leak unencrypted constants. Inspecting those operands trips a deliberate crash - for example a hidden pcall(unpack, {}, 0, 2147483647) that segfaults Lua 5.1. You have to neutralise the trap before you can observe the value.
---
## What Is a Deserialization VM?
URL: https://scrappey.com/qa/reverse-engineering/what-is-a-deserialization-vm
**A deserialization VM is the outer virtual machine in a stacked virtualizer that turns an encrypted data blob into the instruction stream the real VM executes, while enforcing anti-tamper checks along the way.** Before any real logic runs, a pre-VM stage normalizes and decompresses the blob (in Luraph, the long string starting with LPH}), parses constants and function prototypes, and emits a structured table of instructions, register operands, and constants. Reversing it is easier than it looks because its output is structurally identical to a known deserialize function - giving you the exact shape to aim for. This follows birk.blog's Lua Virtualization Part 4, which devirtualizes Luraph as the public case study.
### Quick facts
- **Job:** Decode/decrypt the blob (e.g. LPH}...) into the real VM's instruction stream
- **Pipeline:** Normalize + decompress -> parse constants -> parse prototypes -> clear globals -> entrypoint
- **Output table:** Insts, REG_A/B/C, constants (encrypted), decrypted_constants, function_prototypes, stk_size
- **Anti-tamper:** Hijacks string __tostring; print() crashes unless you neutralize it
- **Scale:** ~20,000 deserialization instructions run before ~5,000 real-VM instructions
### The pre-VM pipeline and its output
Execution splits into three stages: a **pre-VM** stage that deserializes the raw blob, the **real VM** that runs the program, and a **post-VM** stage for error handling and return values. The pre-VM begins by normalizing and decompressing the blob, then parses the constants, then the function prototypes (each prototype's instructions extracted by a "get next function instructions" routine), clears the globals used as scratch interface, and reads the entrypoint index. Its output is one table whose fields you can map by matching against the plaintext deserialize routine: Insts, the register tables REG_A/REG_B/REG_C (all indexed by the virtual IP), function_prototypes, stk_size, and two constant tables. The raw constants table is still encrypted for the real VM; the decrypted_constants table holds them after runtime decryption - a concrete anchor for locating the decryption routine.
### The __tostring anti-tamper trap
A subtle protection nearly blocks analysis: simply calling print crashes, regardless of arguments. The reason is that print internally calls tostring, and in Lua even strings have a metatable - so the VM hijacks __tostring (via debug.setmetatable("", ...) / getmetatable("").__tostring) to detect and punish inspection. Two workarounds combine: use io.write (which does no formatting, like C's printf) plus a custom recursive tostring for tables, and - the real fix - patch the deserializer so that during constant parsing any constant equal to "__tostring" is overwritten with an empty string, at the earliest point constants are plaintext. That defuses the trap before the deserialization VM can arm it.
### One VM at a time: return as dispatcher
Logging every opcode shows ~20,000 deserialization instructions execute before the real VM, which itself runs ~5,000. A defining design choice appears in how closures are handled: what looks like OP_RETURN (return true, REG_C[VIP], 0) is actually a **function dispatcher**. Rather than nesting a child VM inside the parent (the IronBrew model, where both stay alive), the VM runs inside a pcall and *returns metadata* signalling "continue in this closure". The enclosing VM is stopped before the next is dispatched, so only one VM instance exists at any moment - significantly cutting memory use. A second, hidden path invokes closures through a metatable on the prototypes table. Understanding this stage is the groundwork for intercepting the real instruction stream rather than reversing the deserializer in full.
### Example
```lua
-- The deserialized output table (offsets are per-sample "magic" numbers)
local Insts = EXECUTION_DATA[3] -- instruction stream
local REG_A = EXECUTION_DATA[5] -- A operands, indexed by VIP
local REG_B = EXECUTION_DATA[4]
local REG_C = EXECUTION_DATA[10]
local constants = EXECUTION_DATA[9] -- still ENCRYPTED for the real VM
local decrypted_constants = EXECUTION_DATA[8] -- plaintext after runtime decrypt
local function_prototypes = EXECUTION_DATA[7]
-- Anti-tamper: print() crashes because string's __tostring is hijacked.
-- Earliest fix - blank the constant during deserialization, before it's used:
if data == "__tostring" then data = "" end
```
### FAQ
**Q: What does the deserialization VM actually do?**
It is the outer stage that turns the encrypted data blob shipped with the script into the instruction stream the real VM runs. It normalizes and decompresses the blob, parses constants and function prototypes, and emits a structured table of instructions, register operands, and constants - some still encrypted for the inner VM to decrypt at runtime.
**Q: Why does calling print crash a virtualized Lua script?**
Because print calls tostring, and the obfuscator hijacks the string type's __tostring metamethod as an anti-tamper trap. Use io.write plus a custom tostring instead, and patch the deserializer so any constant equal to "__tostring" is blanked out at parse time, before the trap can be set.
**Q: Does the obfuscator run both VMs at the same time?**
Not in this design. What looks like a return opcode is a dispatcher that returns metadata telling the runtime to continue in the next closure, and the enclosing VM is stopped before the next one starts. So only a single VM instance is alive at any moment, which reduces memory use compared with the nested-VM (IronBrew) approach.
---
## What Is Polymorphic (Self-Modifying) Bytecode?
URL: https://scrappey.com/qa/reverse-engineering/what-is-polymorphic-bytecode
**Polymorphic bytecode is virtual-machine code that rewrites its own instructions at runtime before executing them, so the statically dumped instruction stream is intentionally misleading.** A naive dump shows NOPs where real operations belong, familiar opcodes sitting next to mysterious LOAD variants, and control flow that jumps around incoherently - because the program literally assembles itself during execution. The key realisation when devirtualizing a stacked virtualizer is that you do not need to reverse the deserialization VM at all: both VMs share one handler, so you can treat the outer VM as a black box and **intercept** the fully decoded IR right before the real VM consumes it. This follows birk.blog's Lua Virtualization Part 5, the payoff of the Luraph series.
### Quick facts
- **Definition:** Bytecode that patches its own instructions at runtime before they execute
- **Why dumps lie:** Static stream is full of NOPs and decoy jumps; real ops are written in at runtime
- **Shortcut:** Intercept deserialized_execution_data before the real VM - skip reversing the outer VM
- **Recovery:** Replay only the patch/mutation opcodes to resolve the true instruction stream
- **Automation:** Semi-automated: find offsets + interception manually, then script dump/lift/replay
### Intercept instead of reverse
Reversing a deserialization VM instruction-by-instruction is tedious and, it turns out, unnecessary. Because the same handler (h_funcs["VM"]) runs both the deserialization VM and the real VM, the value deserialized_execution_data - the fully unpacked IR, already through every decryption and transformation - exists in plain form for an instant between them. You insert a hook at that exact point and dump the entire IR to disk (e.g. JSON) without understanding how the outer VM produced it. The per-instruction fields (op, A, B, C, dec_const, func_proto, const) come straight out, and even a glance is revealing: strings like "print" and "Hello World!" sit in plain view, so you can already guess the original program. The "magic" offsets into the IR table differ per sample and must be resolved by hand.
### Why the dump is a decoy
Mapping each numeric op to its lifted opcode produces something that looks scrambled: NOPs where real instructions should be, OP_GETGLOBAL next to unknown LOAD_* variants, jumps bouncing around. That is deliberate - the stream is **not meant to be read statically**. Stepping through execution shows the trick: the VM loads its own instruction table onto the stack and *patches entries*. Instruction 16 begins life as a NOP; an earlier instruction writes 45 into Insts[16] (turning it into LOAD_DECRYPTED_STRING) and another writes 2 into its C operand - so by the time execution reaches instruction 16 it has become a string load. Only after these runtime mutations does the true control flow emerge, and the program resolves cleanly to print("Hello World!").
### Recovering it, and the limits of automation
The correct way to defeat self-modification is to **replay only the mutation opcodes**: implement handlers for the patch instructions and simulate execution of the dumped IR so you apply the rewrites without running any untrusted program logic, recovering the fully resolved stream. Full automation is possible in theory but non-trivial: a pipeline must identify deserialized_execution_data, find the interception point, resolve the per-sample magic offsets, map opcode ids, and resolve the self-modifying behaviour. Opcode recovery can be partly automated by pattern-matching against compiled luac output, but locating the interception point and offsets resists automation due to per-sample variability - so a **semi-automated** workflow (manual offsets, scripted dump/lift/replay) is the practical balance. Later versions move the constant pool out of the intercepted IR, so the instruction stream still recovers but constants need a separate solve. The same defensive lesson applies to browser anti-bot VMs: rather than chase a polymorphic client script, a managed API like Scrappey lets the script run as intended server-side and returns the result.
### Example
```text
# Dumped IR (decoy): NOPs and jumps where real ops should be
[15] OP_GETGLOBAL 1 0 0 ; func_proto = "print"
[16] NOP 0 0 148 ; dec_const = "Hello World!"
[17] OP_CALL 0 1 0
# Stepping execution shows instruction 16 PATCHES ITSELF before it runs:
[ 9] Stk[3] = Insts -- load the opcode table onto the stack
[10] Insts[16] = 45 -- rewrite op 16 -> LOAD_DECRYPTED_STRING
[ 8] REG_C[16] = 2 -- patch op 16's C operand
[16] Stk[2] = "Hello World!" -- now a real string load
[17] Stk[1](Stk[2]) -- print("Hello World!")
# Recover by REPLAYING only the patch opcodes - never the program logic.
```
### FAQ
**Q: What is polymorphic or self-modifying bytecode?**
It is VM bytecode that rewrites its own instructions at runtime before executing them. The instruction stream you dump statically is a decoy full of NOPs and decoy jumps; the real operations are written in by earlier "patch" instructions during execution, so only the running program reveals the true logic.
**Q: How do you devirtualize self-modifying VM bytecode without running it?**
You intercept the fully decoded IR at the point the outer VM hands it to the real VM, then replay only the mutation opcodes - the instructions that patch other instructions - in a simulator. That applies the runtime rewrites and resolves the true instruction stream without executing any of the untrusted program logic.
**Q: Can devirtualizing a virtualizer be fully automated?**
Only partially. Opcode recovery can be automated by pattern-matching against compiled luac output, but identifying the IR, the interception point, and the per-sample magic offsets resists automation because they vary between samples. A semi-automated workflow - manual offsets, then scripted dumping, opcode mapping, and mutation replay - is the practical approach.
---
## What Is Dynamic IAT Resolution (Import Hashing)?
URL: https://scrappey.com/qa/reverse-engineering/what-is-dynamic-iat-resolution
**Dynamic IAT resolution (import hashing) is an anti-analysis technique where a binary hides which OS APIs it uses by resolving them at runtime from numeric hashes instead of listing them in a normal Import Address Table.** Combined with encrypted strings, it leaves a binary whose imports view is nearly empty and whose strings view is meaningless - a deliberate roadblock for static reverse-engineering. Recovering the real behaviour means decrypting the strings, then walking each DLL's Export Address Table (EAT) to match the binary's hashes back to real function names. This is a standard defensive malware-analysis skill; the public case study here is the REvil ransomware sample reversed in birk.blog's "Reversing REvil" Part 1, analyzed purely for education.
### Quick facts
- **Symptom:** Near-empty imports view + calls into "data" addresses holding invalid-looking constants
- **Strings:** No valid strings - typically RC4-encrypted arrays decrypted on demand
- **Import hashing:** Each API stored as a hash (e.g. seed 0x2B, h = h*0x10F + byte) resolved at runtime
- **Resolution:** Hash -> DLL DOS header -> Export Address Table walk -> function address
- **Tooling:** Binary Ninja MLIL + scripted annotation; rebuild the IAT as a struct
### Two roadblocks: encrypted strings and a missing IAT
Loading an obfuscated binary, two tells appear immediately. The **imports view is sparse** and functions "call into data" - e.g. a call to an address holding a constant like 0x42a2897c rather than a named import - which signals a dynamic IAT. And the **strings view has nothing meaningful**, implying encrypted strings. The strings come first: tracing from the entry point to a decryptor reveals a wrapper around an **RC4** routine that decrypts *arrays* of data given a base pointer, a key index, a key length, and a data length. Because every call references the same encrypted blob with different arguments, you can script the recovery: enumerate cross-references to the decrypt function, read each call's constant arguments, RC4-decrypt, and annotate the call site with the plaintext.
### Resolving the import hashes
With strings readable, DLL names (kernel32.dll, winhttp.dll, crypt32.dll...) surface, used by the import resolver. Each import is stored as a **hash**, not a name. The resolver formats the hash, derives a DLL hash to locate that module's DOS header, then walks its **Export Address Table** - the list of exported function names - hashing each export name with the same function until it matches, and returns that function's address. To rebuild the table, you reimplement the two small hash functions (here a seed of 0x2B and multiplier 0x10F over the name bytes, plus a formatting step), brute-force every entry in the binary's hash array against real DLL exports, and emit a struct of resolved names you apply over the IAT region. The empty void* slots become CreateFileW, WinHttpConnect, CryptAcquireContextW, GetKeyboardLayoutList, and so on.
### Why this matters beyond malware
Once the imports are named, the binary's intent becomes legible from its API surface alone - file enumeration, crypto, WinHTTP networking, service control, keyboard-layout checks - which is the whole point of recovering it. Import hashing and string encryption are not malware-only tricks; the same anti-analysis ideas appear in commercial packers, DRM, and aggressive software protection, so the technique generalises to any heavily-obfuscated binary. It is the native-code cousin of the bytecode virtualization and deobfuscation work that hides logic in scripting languages: in both cases the analyst restores a readable view by recovering the layer the protector removed - here the import table, there the instruction set. Analysing samples like this defensively is how detection and mitigation get built.
### Example
```cpp
// The binary stores APIs as hashes, not names. Reproduce the hash...
uint32_t import_hash(const char* name) {
uint32_t h = 0x2B;
while (*name) { h = h * 0x10F + (uint8_t)*name; ++name; }
return h;
}
// ...then walk each DLL's Export Address Table and match by hash:
// 1. LoadLibraryExA(dll, DONT_RESOLVE_DLL_REFERENCES)
// 2. parse IMAGE_EXPORT_DIRECTORY -> AddressOfNames
// 3. for each export name: if ((import_hash(name) & 0x1FFFFF) == target) -> resolved
// Emit a struct you apply over the IAT region (0x41fc88) in the disassembler:
struct mw_IAT { void* CreateFileW; void* WinHttpConnect; void* GetKeyboardLayoutList; /* ... */ };
// Strings are RC4 arrays: decrypt(base, key_index, key_len, data_len) and annotate the call.
```
### FAQ
**Q: What is a dynamic IAT, and how do you spot one?**
A dynamic Import Address Table is built at runtime instead of being listed in the binary's headers, so the imports view looks nearly empty. Tell-tale signs are calls that target data addresses holding invalid-looking constants (the stored hashes) and a loop that resolves those constants into function pointers before use. The constants are API hashes the binary maps to addresses at runtime.
**Q: How do you recover function names from import hashes?**
You reimplement the binary's hashing function, then for each DLL it uses, walk the Export Address Table and hash every exported name until one matches the stored value - that export is the resolved import. Brute-forcing the whole hash array against real DLL exports lets you rebuild the import table as a struct and apply it over the binary in your disassembler.
**Q: Is import hashing only used by malware?**
No. While it is common in malware as an anti-analysis measure, the same import-hashing and string-encryption techniques appear in commercial packers, DRM, and software protection. The recovery workflow - decrypt strings, then resolve imports by walking export tables - generalises to any heavily obfuscated native binary, which is why it is a core defensive reverse-engineering skill.
---# Python Web Scraping
Python is the most common language for web scraping. These guides cover the libraries, frameworks, and trade-offs you'll weigh when building scrapers in Python.
## What is the best framework for web scraping with Python?
URL: https://scrappey.com/qa/python-web-scraping/best-framework-python
If you want to pull data off websites with Python, the first decision is which tool to build on. The right choice depends on what you are scraping. This guide walks through the main web scraping options for Python and when each one fits.
### Quick facts
- **Best all-round:** Scrapy — async crawling at scale
- **Best for beginners:** requests + BeautifulSoup
- **Best for JS sites:** Playwright or Selenium
- **Best for hard targets:** A managed scraping API
- **Key trade-off:** Control & speed vs. setup effort
### Popular Frameworks Compared
1. Scrapy: The Enterprise Solution
Scrapy is a full framework built for large, ongoing scraping jobs. It does a lot for you out of the box:
- Asynchronous processing (fetches many pages at once instead of waiting for one to finish) for high-speed crawling
- Built-in support for following links and crawling entire sites
- Robust data processing pipeline
- Export data in multiple formats (JSON, CSV, XML)
- Middleware support for custom functionality
- Built-in proxy rotation and user agent management
- Automatic retry mechanisms
- Extensive configuration options
2. Beautiful Soup: The Beginner's Choice
Beautiful Soup is a simple library that reads HTML and lets you pick out the bits you want. It is the easiest place to start:
- Intuitive API for parsing HTML and XML
- Excellent documentation with many examples
- Works well with requests library
- Perfect for small to medium projects
- Gentle learning curve for beginners
- Multiple parser support (lxml, html5lib)
- CSS and XPath selectors
- Forgiving HTML parsing
3. Selenium: The Dynamic Content Master
Some sites build their content with JavaScript after the page loads, so the raw HTML is nearly empty. Selenium drives a real browser, so it sees the finished page just like a person would:
- Full browser automation capabilities
- Handles dynamic content loading
- Supports user interaction simulation
- Works with modern web applications
- Integrates with various browser drivers
- Screenshot capture functionality
- JavaScript execution support
- Wait conditions and timeouts
4. Playwright: The Modern Alternative
Playwright also drives a real browser, but it is newer and faster. It is gaining popularity:
- Modern browser automation
- Better performance than Selenium
- Multiple browser support
- Network interception
- Mobile device emulation
- Automatic wait functionality
### Making Your Choice
To pick a framework, weigh these factors:
- **Project Scale**
- Small projects: Beautiful Soup
- Large projects: Scrapy
- Dynamic sites: Selenium/Playwright
- API scraping: Requests
- **Performance Requirements**
- High-speed needs: Scrapy
- Basic scraping: Beautiful Soup
- JavaScript rendering: Selenium/Playwright
- Memory efficiency: Scrapy
- **Learning Curve**
- Beginners: Start with Beautiful Soup
- Intermediate: Move to Selenium
- Advanced: Master Scrapy
- Modern needs: Consider Playwright
- **Project Requirements**
- Data volume
- Update frequency
- JavaScript handling
- Authentication needs
- Advanced request handling requirements
### Best Practices
- **Framework Selection**
- Start with simpler tools and graduate to more complex frameworks
- Consider combining frameworks for different tasks
- Always respect websites' robots.txt and scraping policies
- Implement proper error handling and rate limiting
- **Performance Optimization**
- Use async where possible
- Implement proper caching
- Handle rate limiting
- Manage memory usage
- **Error Handling**
- Implement retry mechanisms
- Log errors properly
- Handle timeouts
- Validate data
### Code Examples
Beautiful Soup Example
from bs4 import BeautifulSoup
import requests
# Basic scraping setup
response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all links
links = soup.find_all('a')
for link in links:
print(link.get('href'))
# Using CSS selectors
content = soup.select('div.content p')
Scrapy Example
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']
def parse(self, response):
for item in response.css('div.item'):
yield {
'title': item.css('h2::text').get(),
'price': item.css('span.price::text').get(),
'url': item.css('a::attr(href)').get()
}
Selenium Example
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://example.com')
# Wait for element and click
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'myButton'))
)
element.click()
There is no single best framework, only the best fit for your job. A good path is to learn with Beautiful Soup, then move up to Scrapy for big crawls or Selenium for interactive sites as your needs grow. For modern web applications, Playwright might be the best choice due to its robust features and better performance.
### FAQ
**Q: Is Scrapy overkill for a small scraper?**
For a handful of pages, yes. The requests library plus BeautifulSoup is quicker to write and easier to follow. Reach for Scrapy once you need concurrency (fetching many pages at the same time), automatic retries, data pipelines, and crawling across many pages.
**Q: Do I need a browser framework like Playwright?**
Only when the data is built by JavaScript in the browser, or appears after a click or scroll. If the HTML you need is already in the first response from the server, a plain HTTP client is far faster and lighter.
**Q: When should I use a scraping API instead of a framework?**
When your targets sit behind anti-bot WAFs (web application firewalls that block automated traffic), such as Cloudflare, DataDome, or Kasada. A managed API handles the hard parts for you - TLS fingerprints (the signature of your encrypted connection), proxies, and challenge-solving - so you do not have to build and maintain that layer yourself.
---
## How long does it take to learn web scraping in Python?
URL: https://scrappey.com/qa/python-web-scraping/time-to-learn-scraping
Most people can write a basic web scraping script in Python within a few weeks, but reaching a professional level takes several months. The timeline depends on your background and how often you practise. Here is what to expect at each stage of the journey.
### Quick facts
- **Basics:** 2–4 weeks (requests, BeautifulSoup)
- **Intermediate:** 1–2 months (Scrapy, dynamic sites)
- **Advanced:** 3–6 months (anti-bot, scale)
- **Prerequisite:** Basic Python + HTML/CSS
- **Fastest path:** Build real projects, not tutorials
### Basic Level (2-4 weeks)
In your first month you learn the core ideas and write simple scripts. The goal is to pull data off a plain, static web page.
HTML/CSS Fundamentals
You need to read a page's structure so you can point your code at the right piece of data:
- Understanding basic HTML structure
- Learning common CSS selectors (the patterns, like .price, that target elements)
- Identifying page elements and their relationships
- Working with developer tools in browsers
- Understanding DOM hierarchy (the tree of elements that makes up a page)
- Mastering XPath basics (another way to address elements by their path in the tree)
- Learning about HTML forms and inputs
- Understanding web page layouts
Python Basics for Scraping
Then you learn the Python tools that fetch pages and tidy up the results:
- Setting up your Python environment
- Working with requests library
- Understanding HTTP methods (GET to fetch, POST to send)
- Basic error handling
- String manipulation
- Regular expressions
- JSON and CSV processing
- File handling operations
First Scraping Projects
A first scraper is short: fetch a page, then pick out the parts you want with BeautifulSoup (a library that turns HTML into searchable objects).
# Your first scraper
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract titles
titles = soup.find_all('h1')
for title in titles:
print(title.text)
# Extract specific data
data = {
'titles': [title.text for title in soup.find_all('h1')],
'links': [a['href'] for a in soup.find_all('a', href=True)],
'paragraphs': [p.text for p in soup.find_all('p')]
}
### Intermediate Level (1-2 months)
Next you meet pages that fight back a little: content that loads after the page does, and sites that need you to log in.
Advanced Techniques
- Working with APIs and JSON data
- Handling dynamic content loading (data that appears via JavaScript after load)
- Managing sessions and cookies (the tokens that keep you logged in across requests)
- Implementing pagination handling (following page 1, 2, 3 ...)
- Authentication and login handling
- Form submission automation
- File download management
- Data validation and cleaning
Browser Automation
When data only appears after JavaScript runs, you drive a real browser with Selenium. It clicks, types, and waits for elements just like a person would.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Setup browser automation
driver = webdriver.Chrome()
driver.get('https://example.com')
# Wait for dynamic content
wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'dynamic-content'))))
# Handle login forms
username = driver.find_element(By.ID, 'username')
password = driver.find_element(By.ID, 'password')
username.send_keys('user')
password.send_keys('pass')
driver.find_element(By.ID, 'login-button').click()
### Advanced Level (3-6 months)
At this stage you build scrapers that run at scale and keep running reliably in production.
Enterprise Solutions
- Building scalable scrapers with Scrapy
- Implementing proxy rotation (spreading requests across many IP addresses)
- Handling anti-bot measures
- Database integration
- Distributed scraping systems (work split across many machines)
- Cloud deployment strategies
- Monitoring and alerting
- Performance optimization
Best Practices
A production Scrapy spider crawls links by rules, throttles itself to be polite, and wraps parsing in error handling so one bad page doesn't crash the run.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class AdvancedSpider(CrawlSpider):
name = 'advanced_spider'
allowed_domains = ['example.com']
start_urls = ['https://example.com']
custom_settings = {
'ROBOTSTXT_OBEY': True,
'CONCURRENT_REQUESTS': 16,
'DOWNLOAD_DELAY': 1.5,
'COOKIES_ENABLED': True
}
rules = (
Rule(
LinkExtractor(allow=r'/product/\d+'),
callback='parse_item',
follow=True
),
)
def parse_item(self, response):
try:
yield {
'title': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
'description': response.css('.description::text').get(),
'url': response.url
}
except Exception as e:
self.logger.error(f'Error parsing {response.url}: {e}')
### Factors Affecting Learning Time
The ranges above are averages. Three things push your own timeline faster or slower.
1. Prior Experience
The more of these you already have, the quicker scraping clicks:
- Programming background
- Web development knowledge
- Understanding of HTTP protocols
- Familiarity with HTML/CSS
- Database experience
- Network understanding
- Problem-solving skills
- Debugging experience
2. Learning Resources
Good material and people to ask shorten the road:
- Quality of tutorials
- Access to mentorship
- Practice projects
- Community support
- Documentation quality
- Code examples
- Video tutorials
- Interactive exercises
3. Time Investment
Consistent hands-on practice matters more than anything else:
- Daily practice hours
- Project complexity
- Learning consistency
- Hands-on experience
- Code review opportunities
- Real-world applications
- Debugging time
- Research dedication
### Tips for Success
- **Start Simple**
- Begin with static websites
- Master one tool before moving to next
- Build small, complete projects
- Focus on fundamentals
- **Practice Regularly**
- Code daily, even if briefly
- Experiment with different websites
- Document your learning
- Join coding challenges
- **Join Communities**
- Participate in forums
- Share your projects
- Learn from others' experiences
- Contribute to open source
- **Build Portfolio Projects**
- Create practical scrapers
- Solve real-world problems
- Document your solutions
- Share your code
### Common Challenges and Solutions
A few problems trip up almost everyone. Here is what causes each one and how to deal with it.
1. Dynamic Content
When data loads via JavaScript, plain requests sees an empty page. Drive a real browser instead:
- Learn JavaScript basics
- Master Selenium/Playwright
- Understand AJAX requests (background calls that fetch data after load)
- Practice timing management
2. Anti-Scraping Measures
Sites detect bots and block them. Look more like a normal visitor:
- Implement delays
- Rotate user agents (the string that names your browser)
- Use proxy servers
- Handle CAPTCHAs
3. Data Quality
Scraped data is messy. Check and clean it before you trust it:
- Validate extracted data
- Clean and normalize
- Handle missing values
- Implement error checking
4. Performance
Big jobs need to be fast and efficient:
- Optimize requests
- Use async programming (fetch many pages at once instead of one at a time)
- Implement caching
- Monitor resource usage
Remember that learning web scraping is not just about coding - it's about understanding web technologies, respecting website policies, and building efficient, maintainable solutions. Take your time to build a solid foundation, and the advanced concepts will become easier to grasp.
### FAQ
**Q: Do I need to know Python before learning scraping?**
No expert level needed. If you are comfortable with basic Python - loops, functions, and dictionaries - you can start. You can pick up libraries like requests and BeautifulSoup as you go, alongside the language itself.
**Q: What is the hardest part to learn?**
Understanding how anti-bot systems work and scraping dynamic JavaScript content. Pulling data from static HTML is quick to learn. Reliably working with protected sites you are permitted to access is the part that takes the longest to master.
**Q: How do I practise effectively?**
Scrape real sites you actually care about instead of following tutorials passively. Every new site throws different structure, pagination, and blocking at you, which is exactly the practice that builds real skill.
---
## Which is better for web scraping: Python or JavaScript?
URL: https://scrappey.com/qa/python-web-scraping/python-vs-javascript-scraping
Both Python and JavaScript can scrape websites well, so the "right" one depends on your project, not on which language is objectively better. Picking the language that fits your web scraping goals saves you a lot of friction later. Below we compare what each language is good at and when to reach for it.
### Quick facts
- **Python strength:** Mature libraries, data tooling
- **JavaScript strength:** Native DOM, same-language as the page
- **Static HTML:** Python (requests + BeautifulSoup)
- **Heavy JS / SPA:** Either (Playwright works in both)
- **Verdict:** Python for most; JS if you live in Node
### Python Advantages
Python is the most popular language for scraping, mainly because of its libraries and how easy it is to read.
1. Rich Ecosystem
There is a ready-made tool for almost any scraping job:
- Many libraries to choose from (Scrapy, Beautiful Soup, Selenium)
- Mature frameworks for large-scale scraping
- Strong data processing capabilities
- Excellent documentation and community support
- Robust error handling mechanisms
- Built-in concurrency support (running many requests at once)
- Extensive third-party packages
- Active development community
2. Ease of Use
The code reads almost like plain English, which makes it friendly for beginners:
- Clean, readable syntax
- Straightforward implementation
- Great for beginners
- Extensive tutorial resources
- Consistent coding patterns
- Strong type hints support
- Clear error messages
- Intuitive debugging
3. Data Processing
Once you have scraped data, Python makes it easy to clean, analyze, and store:
- Powerful data analysis libraries (Pandas, NumPy)
- Excellent for data cleaning
- Built-in JSON handling
- Easy database integration
- Statistical analysis tools
- Machine learning capabilities
- Data visualization options
- Export flexibility
### JavaScript Advantages
JavaScript is the language browsers run, so it has a home-field advantage when a page builds its content on the fly (after the initial HTML loads). The examples below run inside a real browser.
1. Browser Integration
JavaScript can read and react to the page directly. The code below grabs headings, watches for content the page adds later, and logs the page's background API calls (AJAX - requests the page makes without reloading):
// Direct DOM manipulation
const titles = document.querySelectorAll('h1');
titles.forEach(title => console.log(title.textContent));
// Handle dynamic content
const observer = new MutationObserver(mutations => {
mutations.forEach(mutation => {
if (mutation.type === 'childList') {
// Process new content
const newElements = Array.from(mutation.addedNodes);
newElements.forEach(processElement);
}
});
});
// Monitor AJAX requests
const originalFetch = window.fetch;
window.fetch = async (...args) => {
const response = await originalFetch(...args);
console.log('Request:', args[0], 'Response:', response);
return response;
};
2. Modern Frameworks
Tools like Puppeteer drive a real browser from code: open a page, block images to save bandwidth, wait for content to appear, then pull out the data you want.
// Puppeteer example
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Intercept network requests
await page.setRequestInterception(true);
page.on('request', request => {
if (request.resourceType() === 'image') {
request.abort();
} else {
request.continue();
}
});
await page.goto('https://example.com');
// Wait for dynamic content
await page.waitForSelector('.dynamic-content');
// Extract data
const data = await page.evaluate(() => {
const items = document.querySelectorAll('.item');
return Array.from(items).map(item => ({
title: item.querySelector('.title').textContent,
price: item.querySelector('.price').textContent,
url: item.querySelector('a').href
}));
});
await browser.close();
})();
### Choosing Between Python and JavaScript
Use this as a quick rule of thumb: pick the language that matches what your project leans on most.
Use Python When:
- **Data Analysis is Priority**
With Pandas you can scrape a table and analyze it in just a few lines:
# Python example with Pandas
import pandas as pd
# Scrape and analyze data
df = pd.read_html('https://example.com/table')
df[0].to_csv('output.csv')
# Data processing
processed_df = df[0].groupby('category').agg({
'price': ['mean', 'min', 'max'],
'rating': 'mean'
}).round(2)
# Statistical analysis
print(processed_df.describe())
- **Building Large-Scale Scrapers**
Scrapy handles the heavy lifting for big crawls, such as running many requests in parallel and rotating proxies (swapping IP addresses so a site is less likely to block you):
# Scrapy spider with advanced features
class EcommerceSpider(scrapy.Spider):
name = 'ecommerce'
custom_settings = {
'CONCURRENT_REQUESTS': 32,
'DOWNLOAD_DELAY': 1,
'ROTATING_PROXY_LIST': [
'proxy1.example.com',
'proxy2.example.com'
]
}
def start_requests(self):
urls = self.get_start_urls()
for url in urls:
yield scrapy.Request(
url,
callback=self.parse,
errback=self.handle_error,
meta={'proxy': True}
)
Use JavaScript When:
- **Dealing with Modern Web Apps**
Single-page apps (sites that render most of their content in the browser, like many Vue or React sites) are JavaScript's home turf. Playwright waits for that content, then reads it:
// Playwright example
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// Handle single-page application
await page.route('**/*.{png,jpg,jpeg}', route => route.abort());
await page.goto('https://spa-example.com');
// Wait for client-side rendering
await page.waitForSelector('.vue-rendered-content');
// Extract dynamic data
const data = await page.evaluate(() => {
return window.__INITIAL_STATE__;
});
})();
- **Browser Extension Development**
Browser extensions are written in JavaScript, so it is the natural choice when scraping happens inside the user's own browser:
// Chrome extension content script
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
if (request.action === 'scrape') {
const data = document.querySelectorAll('.target-element')
.map(el => el.textContent);
sendResponse({ data });
}
});
### Best Practices
These tips apply no matter which language you pick. Think them through before you write much code.
1. Project Assessment
Match the tool to the job by sizing up the work first:
- Evaluate target website technology
- Consider data processing needs
- Assess team expertise
- Review scaling requirements
- Analyze maintenance needs
- Consider deployment options
- Evaluate integration requirements
- Plan for updates
2. Performance Optimization
Keep the scraper fast and polite so it does not waste resources or get blocked:
- Choose appropriate libraries
- Implement caching strategies
- Optimize resource usage
- Monitor execution time
- Handle rate limiting
- Manage memory efficiently
- Implement error recovery
- Use appropriate timeouts
3. Maintenance Considerations
Websites change often, so plan for keeping the scraper working over time:
- Code readability
- Documentation standards
- Error handling
- Testing strategies
- Version control
- Dependency management
- Monitoring tools
- Backup procedures
### Hybrid Approach
Use Python for
- Data processing
- Storage management
- Complex algorithms
- API development
- Statistical analysis
- Machine-learning tasks
- Batch processing
- ETL operations
Use JavaScript for
- Dynamic content handling
- Real-time monitoring
- Browser automation
- Frontend integration
- Event handling
- Interactive scraping
- Client-side validation
- UI manipulation
### Security Considerations
Whichever language you use, scrape responsibly: stay within a site's limits and handle any data you collect carefully.
1. Rate Limiting
Do not hammer a server. Slow down, and back off harder each time you are refused (exponential backoff):
- Implement delays between requests
- Use exponential backoff
- Monitor response codes
- Respect robots.txt
2. Authentication
If you log in to scrape, keep credentials and sessions safe:
- Handle cookies securely
- Manage sessions properly
- Encrypt sensitive data
- Use secure connections
3. Data Privacy
If you collect personal data, follow the rules for storing and keeping it:
- Follow GDPR guidelines
- Handle personal data carefully
- Implement data retention policies
- Secure storage solutions
Remember that both languages have their strengths, and the best choice depends on your specific requirements. Consider factors like team expertise, project scale, and target website characteristics when making your decision.
### FAQ
**Q: Is Python or JavaScript faster for scraping?**
For just downloading pages over HTTP, they perform about the same. Python pulls ahead when you need to parse and crunch the data, while Node (JavaScript outside the browser) wins if your project is already JavaScript and you want to stay in one language end to end.
**Q: Can both handle JavaScript-rendered pages?**
Yes. Pages that build their content in the browser (client-side rendering) are no problem for either: Playwright and Puppeteer drive a real browser and exist for both languages, so this is not a deciding factor.
**Q: Which has the better ecosystem?**
Python has the deeper set of scraping and data-science tools (Scrapy, pandas, lxml). Node has strong browser-automation tooling and is the better fit for full-stack JavaScript teams.
---
## Which is better: Scrapy or BeautifulSoup? (2026 Comparison)
URL: https://scrappey.com/qa/python-web-scraping/scrapy-vs-beautifulsoup
A practical comparison of two popular Python web-scraping tools: Scrapy and BeautifulSoup. Short answer: they solve different problems, so "better" depends on your project. This 2026 guide shows when to pick each.
### Quick facts
- **Scrapy is:** A full crawling framework
- **BeautifulSoup is:** An HTML parsing library
- **Concurrency:** Scrapy: built-in async; BS4: none
- **Learning curve:** BS4 easy; Scrapy steeper
- **Use together?:** Yes — or BS4 + requests for small jobs
### Quick Decision Guide
Use this as a fast gut-check. BeautifulSoup is a small library for reading HTML; Scrapy is a full framework for crawling lots of pages.
Choose Beautiful Soup when
- Building your first web scraper
- You need to scrape < 1000 pages
- Working with simple, static websites
- You want to combine it with the requests library
- You need quick prototypes
- Learning web scraping basics
- You have limited programming experience
- Working on small data-extraction tasks
Choose Scrapy when
- Building production-grade scrapers
- You need to scrape > 1000 pages
- You require high-performance crawling
- You want built-in data-processing pipelines
- You need concurrent request handling
- Working with complex scraping logic
- You have solid Python experience
- You need robust error handling
### Feature Comparison
The same job side by side. With BeautifulSoup you fetch the page yourself (here using the requests library) and then search the HTML. Scrapy bundles fetching and parsing into a "spider" - a class that defines what to crawl and how to read each page.
Beautiful Soup
# Simple Beautiful Soup Example
from bs4 import BeautifulSoup
import requests
def scrape_page(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
return {
'title': soup.find('h1').text.strip(),
'price': soup.find('span', class_='price').text,
'description': soup.find('div', class_='description').text
}
Scrapy
# Equivalent Scrapy Example
import scrapy
class ProductSpider(scrapy.Spider):
name = 'product_spider'
start_urls = ['https://example.com']
def parse(self, response):
yield {
'title': response.css('h1::text').get().strip(),
'price': response.css('.price::text').get(),
'description': response.css('.description::text').get()
}
### Key Differences
The biggest gap is scale. BeautifulSoup fetches one page at a time and holds it in memory; Scrapy fetches many pages at once (asynchronously - meaning it doesn't wait for one request to finish before starting the next) and streams results.
AspectBeautiful SoupScrapy
PerformanceSequential requests; good for small datasetsAsynchronous requests; handles millions of pages efficiently
FeaturesHTML parsing, navigation, searchFull framework with middleware, pipelines, settings
Learning curveA few hours to basic proficiencySeveral days to grasp the core concepts
Memory usageLoads the entire HTML into memoryStreams data; more memory efficient
### Best Practices
A few habits that keep each tool fast and polite (request delays and retries avoid hammering a site).
Beautiful Soup
- Use the lxml parser for better performance
- Implement proper error handling
- Add request delays
- Use session objects for efficiency
Scrapy
- Configure concurrent requests wisely
- Use item pipelines for data processing
- Implement retry middleware
- Monitor memory usage
### Real-World Scenarios
Where each tool tends to fit best in practice.
Use Beautiful Soup for
- Scraping product details from small shops
- Extracting articles from blogs
- Parsing RSS feeds
- Quick data-extraction tasks
Use Scrapy for
- E-commerce price monitoring
- News aggregation services
- Search-engine indexing
- Large-scale data mining
### Integration Tips
Both tools shine when paired with the right companion. BeautifulSoup teams up with the requests library for simple jobs; Scrapy uses middleware - plug-in code that runs between Scrapy and the website, handling things like proxies and retries.
Beautiful Soup + Requests
- Perfect for simple APIs
- Good for authenticated sessions
- Easy to maintain
- Quick to implement
Scrapy + Middleware
- Ideal for complex workflows
- Built-in proxy support
- Robust error handling
- Scalable architecture
Remember: the choice between Beautiful Soup and Scrapy isn't about which is better, but about which tool better suits your needs. Beautiful Soup excels at simplicity and quick implementation, while Scrapy shines in production environments with complex requirements.
### FAQ
**Q: Are Scrapy and BeautifulSoup competitors?**
Not exactly. BeautifulSoup only parses HTML (it reads pages you already downloaded); Scrapy handles requests, concurrency, retries, and pipelines too. You can even use BeautifulSoup inside a Scrapy spider.
**Q: Which is faster?**
Scrapy, for anything involving many pages — its asynchronous engine fetches several at once instead of one after another. For a single page the difference is negligible.
**Q: Which should a beginner start with?**
requests + BeautifulSoup, to learn the fundamentals of fetching a page and pulling data out of it. Move to Scrapy when you need to crawl at scale.
---
## How to extract data from websites using Selenium Python? (2026 Guide)
URL: https://scrappey.com/qa/python-web-scraping/selenium-python-tutorial
How to extract data from websites using Selenium Python? (2026 Guide).
### Quick facts
- **What it is:** Browser automation via WebDriver
- **Best for:** JS-rendered pages & interactions
- **Locators:** CSS selectors, XPath
- **Key skill:** Explicit waits over sleep()
- **Lighter alt:** Playwright (modern API)
### Quick Setup Guide
Selenium drives a real browser from your Python code: it opens pages, clicks buttons, and reads what loads — just like a person would. That makes it good for sites that build their content with JavaScript, where a plain HTTP request would only return an empty shell. The class below is a reusable starting point. headless=True runs Chrome with no visible window (faster on servers), and WebDriverWait lets the script pause until elements actually appear instead of guessing. The __enter__/__exit__ methods let you use it with Python's with statement so the browser always closes, even on errors.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
class ModernSeleniumScraper:
def __init__(self, headless=True):
options = webdriver.ChromeOptions()
if headless:
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
self.driver = webdriver.Chrome(options=options)
self.wait = WebDriverWait(self.driver, timeout=10)
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.driver.quit()
### Essential Features
1. Finding Elements Smartly
The biggest cause of flaky scrapers is asking for an element before the page has finished drawing it. Instead of pausing a fixed number of seconds, an *explicit wait* keeps checking until the element appears (or gives up after the timeout). The helper below wraps that pattern and returns None instead of crashing if the element never shows. You can locate elements by ID, by CSS selector, or by XPath — a path-like query into the page's HTML structure.
def find_element_safely(self, by, value, timeout=10):
try:
element = self.wait.until(
EC.presence_of_element_located((by, value))
)
return element
except TimeoutException:
print(f'Element {value} not found within {timeout} seconds')
return None
# Usage examples:
button = find_element_safely(By.ID, 'submit-button')
heading = find_element_safely(By.CSS_SELECTOR, 'h1.title')
link = find_element_safely(By.XPATH, '//a[contains(text(), "Next")]')
2. Handling Dynamic Content
Many sites load data after the first paint — content appears as you scroll, or a button only becomes usable once it is fully rendered. element_to_be_clickable waits until an element is both visible and enabled. For infinite-scroll pages, the loop below keeps scrolling to the bottom and stops once the page height stops growing, meaning no new content is loading.
def wait_for_dynamic_content(self, selector, timeout=10):
try:
# Wait for element to be clickable
element = self.wait.until(
EC.element_to_be_clickable((By.CSS_SELECTOR, selector))
)
return element
except TimeoutException:
print(f'Dynamic content not loaded: {selector}')
return None
# Handle infinite scroll
def scroll_to_bottom(self):
last_height = self.driver.execute_script('return document.body.scrollHeight')
while True:
self.driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(2) # Allow content to load
new_height = self.driver.execute_script('return document.body.scrollHeight')
if new_height == last_height:
break
last_height = new_height
3. Real-World Example: E-commerce Scraper
Here is how the pieces fit together. This class extends the base scraper to pull product details from a page: it opens the URL, waits for the product container to load, then reads each field with a small helper. get_price shows a common cleanup step — stripping out the $ and commas so the price becomes a real number you can compare or store.
class EcommerceScraper(ModernSeleniumScraper):
def scrape_product_page(self, url):
try:
self.driver.get(url)
# Wait for main content
self.wait_for_dynamic_content('.product-container')
return {
'title': self.get_text('h1.product-title'),
'price': self.get_price('.product-price'),
'description': self.get_text('.product-description'),
'rating': self.get_rating('.product-rating'),
'reviews': self.get_reviews('.review-section'),
'url': url
}
except Exception as e:
print(f'Error scraping {url}: {e}')
return None
def get_text(self, selector):
element = self.find_element_safely(By.CSS_SELECTOR, selector)
return element.text.strip() if element else None
def get_price(self, selector):
price_elem = self.find_element_safely(By.CSS_SELECTOR, selector)
if price_elem:
price_text = price_elem.text.strip().replace('#39;, '').replace(',', '')
try:
return float(price_text)
except ValueError:
return None
return None
### Best Practices
A few habits keep a Selenium scraper reliable and fast.
1. Error Handling
- Always use try-except blocks
- Implement timeouts
- Handle stale elements
- Log errors properly
(A "stale" element is one Selenium found earlier but the page has since reloaded, so the old reference no longer works — re-find it.)
2. Performance Optimization
- Use headless mode when possible
- Implement element caching
- Minimize page loads
- Clean up resources
3. Anti-Detection Measures
A normal browser controlled by Selenium leaves obvious traces that tell a site it is automated. These options remove some of the most visible ones — for example, the AutomationControlled flag and the "Chrome is being controlled by automated software" infobar. Note this only hides the basics; serious anti-bot systems look much deeper.
def configure_stealth_options(self):
options = webdriver.ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--disable-infobars')
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
return options
4. Data Validation
Before saving a result, check that the fields you actually need are present. This quick guard returns True only when title, price, and description all have values.
def validate_extracted_data(self, data):
required_fields = ['title', 'price', 'description']
return all(data.get(field) for field in required_fields)
### Common Challenges & Solutions
1. Handling Popups
Cookie banners and newsletter popups often block the content you want. The pattern here waits briefly for a popup to appear, clicks its close button if found, and simply moves on (pass) if no popup shows up within the timeout.
def handle_popup(self):
try:
popup = self.wait.until(
EC.presence_of_element_located((By.CLASS_NAME, 'popup'))
)
close_button = popup.find_element(By.CLASS_NAME, 'close-button')
close_button.click()
except TimeoutException:
pass # No popup found
2. Managing Sessions
To reach pages behind a login, Selenium can fill in the form just like a user. This method types the username and password, clicks submit, then waits for the dashboard to confirm the login worked. Because it uses the same browser session, later requests stay logged in.
def login(self, username, password):
self.driver.get('https://example.com/login')
username_field = self.find_element_safely(By.ID, 'username')
password_field = self.find_element_safely(By.ID, 'password')
username_field.send_keys(username)
password_field.send_keys(password)
submit = self.find_element_safely(By.ID, 'login-button')
submit.click()
return self.wait_for_dynamic_content('.dashboard')
### Advanced Topics
1. Parallel Scraping
Scraping pages one at a time is slow. A ThreadPoolExecutor runs several at once — here up to four workers — so multiple pages are fetched in parallel and the results collected together. Keep the worker count modest so you do not hammer the target site.
from concurrent.futures import ThreadPoolExecutor
def scrape_multiple_pages(urls, max_workers=4):
results = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(scrape_single_page, url) for url in urls]
for future in futures:
results.append(future.result())
return results
2. Custom Wait Conditions
Selenium's built-in waits cover most cases, but you can write your own. A custom condition is just a function that returns True when you are ready to continue. The example waits until an element's text differs from what it was before — handy after clicking something that updates a label in place.
from selenium.webdriver.support.wait import WebDriverWait
def wait_for_text_change(self, element, original_text):
def text_changed(driver):
return element.text != original_text
self.wait.until(text_changed)
Remember to always respect websites' terms of service and implement proper delays between requests to avoid overwhelming servers.
### FAQ
**Q: Why is my Selenium script not finding elements?**
Almost always a timing problem: your code looks for the element before the page has rendered it. Use explicit waits (WebDriverWait plus expected_conditions) so Selenium keeps checking until the element appears, rather than time.sleep, which guesses a fixed delay and is both brittle and slow.
**Q: Is Selenium detectable as a bot?**
Yes. A default WebDriver browser exposes signals like the navigator.webdriver flag (a JavaScript property that is true only for automated browsers), so protected sites can identify it quickly. A browser configured to behave like a normal user session, or a dedicated scraping API, presents a more consistent profile.
**Q: Should I use Selenium or Playwright?**
Playwright is the more modern choice: a cleaner async API, automatic waiting for elements, and better defaults out of the box. Selenium is still a solid pick for existing projects and supports the widest range of programming languages.
---
## Is Python good for web scraping? (2026 Analysis)
URL: https://scrappey.com/qa/python-web-scraping/python-scraping-benefits
Yes, Python is one of the most popular languages for web scraping — pulling data off web pages automatically. This is a 2026 look at why, with concrete examples and honest trade-offs.
### Quick facts
- **Ecosystem:** Scrapy, BeautifulSoup, lxml, pandas
- **Readability:** Low-boilerplate, fast to prototype
- **Data pipeline:** Seamless into pandas/NumPy
- **Community:** Largest scraping community
- **Weak spot:** CPU-bound parsing vs compiled langs
### Key Advantages
Three things make Python a strong fit for scraping: a deep library ecosystem, very readable code, and the ability to scale up when you need speed.
1. Rich Ecosystem
- **Specialized Libraries** — each tool does one job well, and you mix them as needed:
- Requests: fetching pages over HTTP
- Beautiful Soup: reading and searching the HTML you get back
- Scrapy: a full framework for large, enterprise scraping jobs
- Selenium: driving a real browser for sites that need clicks and JavaScript
- Playwright: a modern, faster take on browser automation
- LXML: very fast HTML parsing
- aiohttp: making many requests at once (async)
- **Community Support** — you rarely get stuck alone:
- Active Stack Overflow community
- Regular library updates
- Extensive documentation
- Numerous tutorials
- Code examples
- Open-source contributions
- Bug fixes and improvements
- Security updates
2. Code Simplicity
A working scraper is just a few lines: fetch the page, parse it, pick out what you want.
# Beautiful Soup Example
from bs4 import BeautifulSoup
import requests
def simple_scraper(url):
# Get webpage content
response = requests.get(url)
# Parse HTML
soup = BeautifulSoup(response.text, 'lxml')
# Extract data
data = {
'title': soup.find('h1').text.strip(),
'paragraphs': [p.text for p in soup.find_all('p')],
'links': [a['href'] for a in soup.find_all('a', href=True)]
}
return data
3. Performance Capabilities
When one page at a time is too slow, async code fetches many URLs in parallel without waiting for each to finish.
# Async Scraping Example
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def async_scraper(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch_url(session, url) for url in urls]
return await asyncio.gather(*tasks)
async def fetch_url(session, url):
async with session.get(url) as response:
html = await response.text()
soup = BeautifulSoup(html, 'lxml')
return {
'url': url,
'title': soup.find('h1').text.strip() if soup.find('h1') else None
}
# Usage
urls = ['https://example1.com', 'https://example2.com']
results = asyncio.run(async_scraper(urls))
### Industry Applications
Here is where teams actually put Python scrapers to work.
1. Data Mining
For example, a price monitor that checks product pages and alerts you when a price changes:
# Example: Price Monitoring System
class PriceMonitor:
def __init__(self):
self.session = requests.Session()
self.db = Database() # Your database connection
def monitor_prices(self, product_urls):
for url in product_urls:
price = self.extract_price(url)
if self.is_price_changed(url, price):
self.notify_price_change(url, price)
self.db.update_price(url, price)
def extract_price(self, url):
response = self.session.get(url)
soup = BeautifulSoup(response.text, 'lxml')
price_elem = soup.find('span', class_='price')
return float(price_elem.text.strip().replace('#39;, ''))
2. Research Automation
- Academic data collection
- Market research
- Competitive analysis
- Trend monitoring
3. Content Aggregation
- News collection
- Social media monitoring
- Product catalogs
- Review aggregation
### Enterprise Benefits
At larger scale, Python helps in three areas: scaling across machines, staying easy to maintain, and plugging into the rest of your stack.
1. Scalability
Tools like Celery (a task queue that spreads jobs across many workers) let you scrape thousands of URLs in parallel:
# Example: Distributed Scraping with Celery
from celery import Celery
app = Celery('scraper', broker='redis://localhost:6379/0')
@app.task
def scrape_url(url):
try:
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, 'lxml')
return {
'url': url,
'status': 'success',
'data': extract_data(soup)
}
except Exception as e:
return {
'url': url,
'status': 'error',
'error': str(e)
}
2. Maintenance
- Clear syntax for debugging
- Easy to modify and extend
- Strong typing support (with type hints)
- Comprehensive logging
3. Integration
- Database connectivity
- API development
- Cloud deployment
- Monitoring tools
### ROI Factors
ROI here means return on investment — what you get back for the time and money spent. Python pays off in three ways.
1. Development Speed
- Rapid prototyping
- Quick iterations
- Extensive libraries
- Code reusability
2. Resource Efficiency
- Low memory footprint
- CPU efficient
- Bandwidth optimization
- Cost-effective scaling
3. Team Productivity
- Easy to learn
- Good readability
- Strong debugging tools
- Extensive documentation
### Best Practices
A few habits keep a scraper reliable as it grows: organize your code, handle errors, and tune for performance.
1. Code Organization
Wrapping the session, parser, and logging in one class keeps the code tidy and reusable:
# Example: Structured Scraping Project
class WebScraper:
def __init__(self):
self.session = self.setup_session()
self.parser = 'lxml'
self.logger = self.setup_logging()
def setup_session(self):
session = requests.Session()
session.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
return session
def setup_logging(self):
logging.basicConfig(level=logging.INFO)
return logging.getLogger(__name__)
def scrape(self, url):
try:
response = self.session.get(url, timeout=10)
soup = BeautifulSoup(response.text, self.parser)
return self.parse_content(soup)
except Exception as e:
self.logger.error(f'Error scraping {url}: {e}')
return None
2. Error Handling
- Comprehensive exception handling
- Retry mechanisms
- Logging and monitoring
- Data validation
3. Performance Optimization
- Connection pooling
- Async operations
- Caching strategies
- Resource cleanup
Python's combination of simplicity, powerful libraries, and extensive community support makes it an excellent choice for web scraping projects of any scale.
### FAQ
**Q: Why is Python so popular for scraping?**
Its syntax is short and readable, its library ecosystem is mature, and scraped data flows straight into analysis tools like pandas (a popular Python data-table library). That combination makes it the fastest language to go from idea to working scraper.
**Q: Is Python fast enough for large-scale scraping?**
Yes. Most scraping time is spent waiting on the network (I/O), not on the language itself, and async frameworks like Scrapy keep many requests running at once. Parsing HTML in pure Python can be a bottleneck, but lxml fixes that.
**Q: What are Python's limits for scraping?**
Heavy CPU-bound HTML parsing is slower than in compiled languages. And like any tool, Python cannot handle anti-bot defences on its own — that still needs proxies (relays that swap your IP address) and fingerprint handling (looking like a real browser).
---
## What does BeautifulSoup do in Python? (Complete Guide 2026)
URL: https://scrappey.com/qa/python-web-scraping/beautifulsoup-explained
BeautifulSoup is a Python library for reading HTML. You give it the raw HTML of a web page (a long string of tags), and it turns that into a tree of objects you can search and pull data from - like grabbing every link, the page title, or the contents of a table. This guide explains what BeautifulSoup does and how to use it.
### Quick facts
- **What it is:** HTML/XML parsing library
- **Pair with:** requests (fetching)
- **Find elements:** find / find_all / select
- **Parsers:** html.parser, lxml, html5lib
- **Does NOT:** Fetch pages or run JavaScript
### Core Functionality
BeautifulSoup does three core jobs: it parses HTML into a searchable tree, it lets you navigate and find elements in that tree, and it lets you extract the text and attributes you care about.
1. HTML/XML Parsing
First you hand the raw HTML to BeautifulSoup along with a *parser* - the engine that reads the tags and builds the tree. The three common choices trade speed for forgiveness with messy HTML:
from bs4 import BeautifulSoup
# Different parser options
soup = BeautifulSoup(html_doc, 'lxml') # Fastest
soup = BeautifulSoup(html_doc, 'html.parser') # Built-in
soup = BeautifulSoup(html_doc, 'html5lib') # Most lenient
# Handle encoding
soup = BeautifulSoup(html_doc, 'lxml', from_encoding='utf-8')
2. Navigation & Search
Once you have the tree, you can walk it like a family tree (parents, children, siblings) or search it directly. find returns the first match, find_all returns every match, and select takes a CSS selector - the same syntax you would use in a stylesheet:
# Tree Navigation
parent = element.parent
children = element.children
siblings = element.next_siblings
# Finding Elements
elements = soup.find_all(['h1', 'h2', 'h3']) # Multiple tags
div = soup.find('div', class_='content') # With class
links = soup.select('div.content > a') # CSS selector
heading = soup.find(id='main-title') # By ID
# Advanced Search
matches = soup.find_all(text=re.compile('pattern'))
elements = soup.find_all(attrs={'data-id': True})
3. Data Extraction
After you find an element, you read its visible text with .text or one of its attributes (like a link's href or an image's src) with .get(). The class below wraps these into helpers that return None instead of crashing when an element is missing:
class ContentExtractor:
def __init__(self, html):
self.soup = BeautifulSoup(html, 'lxml')
def get_text(self, selector):
element = self.soup.select_one(selector)
return element.text.strip() if element else None
def get_attribute(self, selector, attribute):
element = self.soup.select_one(selector)
return element.get(attribute) if element else None
def get_structured_data(self):
return {
'title': self.get_text('h1'),
'description': self.get_text('.description'),
'image_url': self.get_attribute('img.main', 'src'),
'links': [a['href'] for a in self.soup.select('a[href]')],
'metadata': {
'author': self.get_text('.author'),
'date': self.get_text('.date'),
'category': self.get_text('.category')
}
}
### Common Operations
1. Cleaning HTML
Real pages are full of clutter - scripts, styling, comments, and tracking attributes. You can strip these out so only useful content remains. decompose() deletes a tag and everything inside it; extract() pulls a node out of the tree:
def clean_html(html_content):
soup = BeautifulSoup(html_content, 'lxml')
# Remove unwanted tags
for tag in soup.find_all(['script', 'style']):
tag.decompose()
# Remove comments
for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
comment.extract()
# Clean attributes
for tag in soup.find_all(True):
allowed_attrs = ['href', 'src', 'alt']
attrs = dict(tag.attrs)
for attr in attrs:
if attr not in allowed_attrs:
del tag[attr]
return str(soup)
2. Handling Tables
HTML tables are a common scraping target. The pattern is: read the header cells (<th>), then walk each row (<tr>) and pair its cells with those headers, producing one clean dictionary per row:
def parse_table(table_element):
data = []
headers = []
# Extract headers
for th in table_element.find_all('th'):
headers.append(th.text.strip())
# Extract rows
for row in table_element.find_all('tr'):
cells = row.find_all(['td', 'th'])
if cells and not all(cell.text.strip() in headers for cell in cells):
row_data = [cell.text.strip() for cell in cells]
data.append(dict(zip(headers, row_data)))
return data
3. Form Handling
Sometimes you need to understand a form before submitting it - where it posts to, which method it uses, and what fields it expects. This reads a <form> and lists each input with its name, type, and default value:
def extract_form_data(form_element):
form_data = {
'action': form_element.get('action'),
'method': form_element.get('method', 'get'),
'fields': []
}
for input_tag in form_element.find_all(['input', 'select', 'textarea']):
field = {
'name': input_tag.get('name'),
'type': input_tag.get('type', 'text'),
'value': input_tag.get('value', ''),
'required': input_tag.get('required') is not None
}
form_data['fields'].append(field)
return form_data
### Best Practices
1. Performance Optimization
For faster, leaner parsing:
- Use lxml parser for speed
- Cache parsed BeautifulSoup objects
- Use specific searches over general ones
- Minimize DOM traversals
2. Error Handling
Web pages change, so an element you expect may be missing. Wrap extraction in a guard that logs the problem and returns None instead of crashing the whole scraper:
def safe_extract(soup, selector, attribute=None):
try:
element = soup.select_one(selector)
if element:
return element.get(attribute) if attribute else element.text.strip()
except Exception as e:
logging.error(f'Error extracting {selector}: {e}')
return None
3. Memory Management
Parsed trees can be large, so free what you no longer need:
- Use decompose() to remove unused elements
- Clear soup objects when done
- Use generators for large files
- Implement cleanup routines
### Advanced Features
1. Custom Filters
When tag name, class, and ID are not enough, you can pass find_all your own function. It runs against every tag and keeps the ones that return True - here, only <div> elements that have a content class and contain a paragraph:
def custom_filter(tag):
return (tag.name == 'div' and
tag.has_attr('class') and
'content' in tag['class'] and
tag.find('p'))
matches = soup.find_all(custom_filter)
2. Document Modification
BeautifulSoup can also rewrite the tree, not just read it. You can add classes, create brand-new tags with new_tag, wrap existing content, and clean up text in place:
def enhance_html(soup):
# Add classes
for paragraph in soup.find_all('p'):
paragraph['class'] = paragraph.get('class', []) + ['enhanced']
# Create new elements
new_div = soup.new_tag('div', attrs={'class': 'wrapper'})
soup.body.wrap(new_div)
# Modify text
for text in soup.find_all(text=True):
if text.parent.name not in ['script', 'style']:
text.replace_with(text.string.strip())
return soup
3. Encoding Handling
Pages can arrive in different character encodings (the byte-to-character mapping, like UTF-8). If you guess wrong, text comes out garbled. The chardet library detects the likely encoding so you can decode the bytes correctly before parsing:
def handle_encoding(html_content):
# Detect encoding
detected = chardet.detect(html_content)
# Create soup with proper encoding
soup = BeautifulSoup(html_content.decode(detected['encoding']), 'lxml')
return soup
BeautifulSoup is a powerful library that makes HTML parsing in Python intuitive and efficient. Understanding these patterns and best practices will help you build robust and maintainable web scraping solutions.
### FAQ
**Q: Does BeautifulSoup download web pages?**
No. It only parses HTML you already have - it never makes a network request itself. Pair it with requests (or another HTTP client) to fetch the page, then hand that HTML to BeautifulSoup to read it.
**Q: Can BeautifulSoup handle JavaScript-rendered content?**
No - it parses static HTML only, the raw markup the server sends. If content is added later by JavaScript running in the browser, BeautifulSoup never sees it. For that you need a browser tool like Playwright or Selenium to render the page first, then parse the result.
**Q: Which parser should I use?**
lxml is the usual choice: fastest and forgiving with the messy HTML real sites produce. html.parser is built into Python (no extra install) but slower; html5lib follows the HTML standard most closely but is the slowest of the three.
---
## Which Python libraries are best for web scraping? (2026 Guide)
URL: https://scrappey.com/qa/python-web-scraping/python-scraping-libraries
If you want to scrape websites with Python, the first decision is which library to use. There are a handful of popular ones, and each fits a different kind of job. This guide walks through the main options for web scraping and helps you pick the right tool for your needs.
### Quick facts
- **Fetching:** requests, httpx, curl_cffi
- **Parsing:** BeautifulSoup, lxml, parsel
- **Frameworks:** Scrapy
- **Browsers:** Playwright, Selenium
- **Pick by:** Static vs dynamic + scale
### Popular Libraries Overview
1. Requests + Beautiful Soup Combination
The most popular starting point. Requests downloads the page (it sends the HTTP request), and Beautiful Soup reads the returned HTML so you can pull out the pieces you want.
import requests
from bs4 import BeautifulSoup
def basic_scraper(url):
# Send HTTP request
response = requests.get(url)
# Parse HTML content
soup = BeautifulSoup(response.text, 'lxml')
# Extract data
title = soup.find('h1').text.strip()
paragraphs = [p.text for p in soup.find_all('p')]
return {
'title': title,
'content': paragraphs
}
**Best for:**
- Learning web scraping
- Small to medium projects
- Static websites
- Quick prototypes
2. Scrapy Framework
The professional's choice for large-scale web scraping. Instead of a single library, it gives you a full framework that handles fetching pages, following links, and saving results.
import scrapy
class NewsSpider(scrapy.Spider):
name = 'news_spider'
start_urls = ['https://example.com/news']
def parse(self, response):
for article in response.css('article'):
yield {
'title': article.css('h2::text').get(),
'summary': article.css('p.summary::text').get(),
'date': article.css('time::attr(datetime)').get()
}
# Follow pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
**Best for:**
- Production environments
- Large-scale scraping
- Performance-critical projects
- Distributed scraping
3. Selenium WebDriver
Selenium drives a real browser through code, so it can handle pages that only finish loading after JavaScript runs. Use it when a plain HTTP request returns an empty or half-built page.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class DynamicScraper:
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
def scrape_dynamic_content(self, url):
self.driver.get(url)
# Wait for dynamic content to load
content = self.wait.until(
EC.presence_of_element_located((By.CLASS_NAME, 'dynamic-content'))
)
return content.text
**Best for:**
- JavaScript-heavy websites
- Sites requiring login
- Interactive web applications
- Complex user interactions
4. HTTPX + Playwright
A modern combo. HTTPX is a faster, async-capable replacement for Requests, and Playwright drives a browser like Selenium but is newer and quicker. Use HTTPX for plain requests and Playwright when a page needs a real browser.
from playwright.sync_api import sync_playwright
import httpx
async def modern_scraper():
async with httpx.AsyncClient() as client:
# Handle regular HTTP requests
response = await client.get('https://api.example.com/data')
api_data = response.json()
with sync_playwright() as p:
# Handle complex JavaScript pages
browser = p.chromium.launch()
page = browser.new_page()
await page.goto('https://example.com')
content = await page.content()
browser.close()
**Best for:**
- Modern web applications
- Sites with anti-bot measures
- Complex JavaScript rendering
- High-performance needs
### Python web scraping library comparison (2026)
There is no single "best" Python web scraping library — there is a best tool for each *layer* of the job. Most scrapers combine an HTTP client (to fetch the page) with a parser (to extract data), and reach for a browser engine only when the page needs JavaScript. This table maps the main options to the layer they belong to.
LibraryLayerRuns JavaScript?Anti-bot helpBest forInstall
RequestsHTTP clientNoNoneSimple static pages & JSON APIspip install requests
HTTPXHTTP clientNoHTTP/2, asyncAsync & concurrent fetchingpip install httpx
curl_cffiHTTP clientNoTLS/JA3 impersonationBeating TLS-fingerprint blockspip install curl_cffi
BeautifulSoupHTML parser——Beginner-friendly extractionpip install beautifulsoup4
lxmlHTML/XML parser——Fast parsing with XPathpip install lxml
selectolaxHTML parser——Fastest parsing at high volumepip install selectolax
ScrapyFrameworkAdd-onPartialLarge crawls with pipelinespip install scrapy
SeleniumBrowser automationYesWeakLegacy dynamic pagespip install selenium
PlaywrightBrowser automationYesBetter than SeleniumModern JS-rendered pagespip install playwright
CrawleeFrameworkYesBuilt-inProduction crawlerspip install crawlee
**The 90% rule:** for static sites, Requests + BeautifulSoup (or lxml/selectolax for speed) covers most jobs. Switch to Playwright for JavaScript-rendered pages, and to Scrapy or Crawlee when you are crawling thousands of pages and need retries, queues, and pipelines. Reach for curl_cffi when a site blocks you on your TLS fingerprint before you can even parse anything.
### Choosing the Right Library
Pick based on your experience level and how hard the target sites are.
For Beginners:
- Start with Requests + Beautiful Soup
- Learn basic HTML and CSS selectors
- Practice with static websites
- Understand HTTP basics
For Intermediate Users:
- Explore Selenium for dynamic content
- Learn about async programming
- Handle more complex scenarios
- Implement error handling
For Advanced Users:
- Master Scrapy for large projects
- Implement distributed systems
- Handle anti-scraping measures
- Optimize performance
### Best Practices
Whichever library you choose, these habits keep your scraper reliable and considerate.
1. Respect Websites
- Read robots.txt
- Implement delays
- Don't overload servers
- Handle errors gracefully
2. Data Management
- Store data properly
- Implement backups
- Validate extracted data
- Handle duplicates
3. Code Organization
- Use proper error handling
- Implement logging
- Write clean, maintainable code
- Document your code
### Common Challenges and Solutions
Two problems show up in almost every project. Here is how to handle them.
1. Rate Limiting
Rate limiting is a site blocking you for sending requests too fast. Pause a random amount between requests so your traffic looks less robotic.
from time import sleep
from random import uniform
def rate_limited_request(url, min_delay=1, max_delay=3):
sleep(uniform(min_delay, max_delay))
return requests.get(url)
2. Error Handling
Requests fail sometimes. Retry on failure, waiting longer after each attempt (this is called exponential backoff), and give up only after a few tries.
def resilient_scraper(url, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
return response
except Exception as e:
if attempt == max_retries - 1:
raise
sleep(2 ** attempt)
Remember: The best library depends on your specific needs. Start simple and upgrade as your requirements grow more complex.
### The hard part: getting blocked
Picking a library is the easy part. The reason most Python scrapers fail in production is not parsing — it is that the target *blocks the request before it returns real HTML*. No parser can extract data from a 403, a CAPTCHA page, or a Cloudflare challenge.
The libraries above do not solve this on their own. Requests sends a TLS handshake no browser sends, so a TLS fingerprint check flags it instantly. Selenium leaks navigator.webdriver and other automation tells. Working with modern anti-bot stacks (Cloudflare, DataDome, Akamai) means rotating residential proxies, matching a real browser fingerprint, and keeping all of those *coherent* — a moving target that is a project in itself.
When DIY stops scaling, a managed web scraping API like Scrappey handles the proxies, fingerprinting, and JS rendering server-side, so your Python code goes back to being a simple HTTP request plus your favourite parser:
### Example
```python
import requests
# When a site blocks plain Requests/Selenium, route the fetch through a
# scraping API. Proxies, browser fingerprint, JS rendering and CAPTCHAs are
# handled server-side -- your parser code stays exactly the same.
resp = requests.post(
'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY',
json={
'cmd': 'request.get',
'url': 'https://example.com/protected',
},
timeout=120,
)
html = resp.json()['solution']['response']
# ...then parse 'html' with BeautifulSoup / lxml / selectolax as usual.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)
```
### FAQ
**Q: What is the minimal stack to start?**
requests to download the page and BeautifulSoup to read the HTML. That covers most static sites. Only add a browser tool like Selenium or Playwright when the content is built by JavaScript and is not in the raw HTML.
**Q: When do I need curl_cffi instead of requests?**
When a site checks your TLS handshake, the encrypted greeting your client sends when starting an https connection. Plain requests has a handshake that screams "Python script." curl_cffi can reproduce a real browser's TLS/JA3 fingerprint (JA3 is a signature derived from that handshake), helping you slip past basic fingerprint checks.
**Q: Is Scrapy a parser or a framework?**
A framework. It does far more than parse: it handles fetching, concurrency (many requests at once), retries, and pipelines for processing data. For extraction it uses parsel, which supports XPath and CSS selectors. You can still plug in BeautifulSoup if you prefer it.
**Q: What is the best Python library for web scraping in 2026?**
There is no single best library — pick by layer. For static pages, Requests (HTTP) plus BeautifulSoup or lxml (parsing) is the standard combination. For JavaScript-rendered pages, Playwright is the modern default over Selenium. For large crawls, use Scrapy or Crawlee. For sites that block you on your TLS fingerprint, curl_cffi impersonates a real browser handshake.
**Q: Do I need Scrapy, or are Requests and BeautifulSoup enough?**
For a handful of pages or a one-off script, Requests + BeautifulSoup is simpler and enough. Scrapy earns its complexity once you are crawling thousands of URLs and need built-in request scheduling, retries, concurrency, deduplication, and item pipelines. Crawlee is a newer alternative that adds first-class browser support and anti-blocking out of the box.
**Q: Which Python library can run JavaScript-rendered pages?**
Requests and BeautifulSoup cannot execute JavaScript — they only see the initial HTML. To render JS you need a browser engine: Playwright (recommended in 2026) or Selenium. A faster alternative is to skip the browser entirely and call the page’s underlying JSON API directly. See our guide on scraping JavaScript-rendered pages with Python.
**Q: How do I keep my Python scraper from getting blocked?**
Use realistic headers and a real browser TLS fingerprint (curl_cffi), rotate residential proxies, throttle request rate, and keep your fingerprint and IP geolocation coherent. Against serious anti-bot vendors this becomes a full-time effort, which is why many teams route hard targets through a managed scraping API that handles proxies, fingerprinting, and JS rendering server-side.
---
## What are the best practices for web scraping? (2026 Guide)
URL: https://scrappey.com/qa/python-web-scraping/scraping-best-practices
Best practices for web scraping are the habits that keep your scraper reliable, polite to the sites you collect from, and unlikely to get you blocked or into legal trouble. This is the 2026 guide.
### Quick facts
- **Respect:** robots.txt & Terms of Service
- **Rate limit:** Throttle + randomise delays
- **Identify:** Rotate realistic user agents
- **Be resilient:** Retries, backoff, caching
- **Store:** Deduplicate & validate output
### Ethical Considerations
Scraping ethically means treating a website like a guest, not a freeloader: take only what you need and don't slow the site down for real users.
1. Respect Website Policies
- Always check robots.txt first (the file at the site root that says which paths bots may visit)
- Follow site terms of service
- Implement proper delays between requests
- Honor crawl-delay directives (a robots.txt line telling bots how long to wait between hits)
- Stay within rate limits
- Identify your scraper (User-Agent)
- Request permission when needed
- Cache data when allowed
2. Resource Management
Two simple habits do most of the work: a rate limiter (caps how many requests you send per minute) and a cache (reuses a page you already fetched instead of asking for it again). The example below checks the cache first, only hits the network when allowed, and stores successful responses.
class ResponsibleScraper:
def __init__(self):
self.session = requests.Session()
self.rate_limiter = RateLimiter(max_requests=10, time_window=60)
self.cache = Cache()
def fetch_url(self, url):
# Check cache first
if cached := self.cache.get(url):
return cached
# Respect rate limits
with self.rate_limiter:
response = self.session.get(
url,
headers={'User-Agent': 'ResponsibleBot/1.0'}
)
# Cache valid responses
if response.status_code == 200:
self.cache.set(url, response.text)
return response.text
### Technical Best Practices
1. Error Handling
Networks fail, pages time out, and servers return errors. A robust scraper expects this: it retries with backoff (waiting a little longer after each failure) and logs problems instead of crashing. The class below mounts an automatic 3-retry policy and wraps each fetch in try/except so one bad URL never stops the run.
class RobustScraper:
def __init__(self):
self.logger = logging.getLogger(__name__)
self.retries = Retry(total=3, backoff_factor=1)
self.session = requests.Session()
self.session.mount('http://', HTTPAdapter(max_retries=self.retries))
def safe_scrape(self, url):
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return self.parse_content(response.text)
except requests.RequestException as e:
self.logger.error(f'Failed to fetch {url}: {e}')
return None
except Exception as e:
self.logger.error(f'Error processing {url}: {e}')
return None
2. Performance Optimization
Fetching pages one at a time is slow because most of the time is spent waiting on the network. Async code lets you wait on many requests at once. The example uses a Semaphore (a counter that caps how many requests run in parallel — here 10) so you go fast without flooding the site.
async def optimized_scraper():
async with aiohttp.ClientSession() as session:
tasks = []
async with asyncio.Semaphore(10) as sem:
for url in urls:
task = asyncio.ensure_future(bounded_fetch(url, session, sem))
tasks.append(task)
return await asyncio.gather(*tasks)
async def bounded_fetch(url, session, sem):
async with sem:
async with session.get(url) as response:
return await response.text()
### Data Management
1. Storage Best Practices
Scraped pages are messy, so check and tidy each record before saving it. The manager below only stores data that passes validation, and trims stray whitespace from text fields first.
class DataManager:
def __init__(self):
self.db = Database()
self.validator = DataValidator()
def store_data(self, data):
if self.validator.is_valid(data):
self.db.insert(self.clean_data(data))
def clean_data(self, data):
return {
key: value.strip() if isinstance(value, str) else value
for key, value in data.items()
}
2. Validation & Cleaning
Validation is your early warning that a page changed or returned junk. This checker rejects a record if required fields are missing, the URL is malformed, or the timestamp isn't a number.
class DataValidator:
def validate_item(self, item):
required_fields = ['title', 'url', 'timestamp']
# Check required fields
if not all(field in item for field in required_fields):
return False
# Validate URL format
if not self.is_valid_url(item['url']):
return False
# Validate data types
if not isinstance(item['timestamp'], (int, float)):
return False
return True
### Security Considerations
1. Authentication Handling
If a site needs a login, handle credentials carefully and keep the connection encrypted. The example posts login details over HTTPS with verify=True, which checks the site's SSL certificate (SSL/TLS is the encryption behind https) so you aren't tricked into talking to an imposter server.
class SecureScraper:
def __init__(self):
self.session = requests.Session()
self.credentials = self.load_credentials()
def login(self):
return self.session.post(
'https://example.com/login',
data=self.credentials,
headers={'User-Agent': 'SecureBot/1.0'},
verify=True # SSL verification
)
2. Data Protection
If you collect anything sensitive, encrypt it before it touches disk. Here Fernet (a ready-made symmetric encryption helper from Python's cryptography library) scrambles the data with a secret key so a leaked database is useless without that key.
class DataProtection:
def __init__(self):
self.encryption_key = load_key()
def store_sensitive_data(self, data):
encrypted_data = self.encrypt_data(data)
self.db.store(encrypted_data)
def encrypt_data(self, data):
return Fernet(self.encryption_key).encrypt(
json.dumps(data).encode()
)
### Monitoring & Maintenance
1. Health Checks
A scraper can quietly break when a site changes its layout, so watch its vital signs. The monitor below tracks memory use, success rate, response time, and recent errors, and fires an alert when something looks wrong.
class ScraperMonitor:
def check_health(self):
metrics = {
'memory_usage': self.get_memory_usage(),
'success_rate': self.calculate_success_rate(),
'average_response_time': self.get_avg_response_time(),
'errors_last_hour': self.count_recent_errors()
}
if self.should_alert(metrics):
self.send_alert(metrics)
2. Logging Best Practices
Good logs are how you find out what went wrong after the fact. This setup timestamps every message and writes it both to a file (scraper.log) and to the screen, so you can debug live or review later.
def setup_logging():
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('scraper.log'),
logging.StreamHandler()
]
)
Remember: Good web scraping practices ensure sustainability, reliability, and respect for web resources while maintaining high-quality data collection.
### FAQ
**Q: Is web scraping legal?**
Scraping data that is already public is generally allowed, but it depends on where you are (jurisdiction), the site's Terms of Service, and what kind of data it is — personal data carries extra obligations. When in doubt, read the site's terms and check the law that applies to you.
**Q: How do I avoid overloading a target site?**
Slow down your request rate, add randomised delays between requests, scrape during off-peak hours, and cache responses so you never re-download the same page when you don't have to.
**Q: How do I keep a scraper from breaking?**
Add retries with exponential backoff (wait longer after each failed attempt), watch for layout changes on the pages you scrape, validate the fields you extract, and set up alerts for sudden drops in your success rate.
---
## How to Scrape JavaScript-Rendered Pages With Python (2026 Guide)
URL: https://scrappey.com/qa/python-web-scraping/scrape-javascript-rendered-pages-python
**To scrape a JavaScript-rendered page in Python you need something that executes the page’s JavaScript before you read the HTML.** A plain requests.get() only returns the initial HTML the server sends, which on a modern single-page app is an almost empty shell — the real content is injected later by JavaScript running in a browser. The three reliable fixes are: drive a real browser with Playwright or Selenium, or skip the browser entirely and call the JSON API the page itself calls.
### Quick facts
- **Why it happens:** Content is rendered client-side; the server returns an empty HTML shell
- **How to detect it:** View source shows no data, but the rendered page (DevTools Elements) does
- **Best tool (2026):** Playwright — auto-waiting, modern API, harder to detect than Selenium
- **Fastest method:** Call the underlying JSON/XHR API directly (no browser needed)
- **Avoid:** requests-html and Pyppeteer — both effectively unmaintained
### Why Requests + BeautifulSoup returns an empty page
When you scrape a JavaScript-heavy site with the usual stack, you often get nothing back:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://quotes.toscrape.com/js/')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select('.quote')) # => [] (empty!)
The list is empty even though the page clearly shows quotes in your browser. The reason: requests downloads only the HTML the server sends, and on a client-side-rendered page that HTML is a near-empty skeleton plus a bundle of JavaScript. The quotes only appear after that JavaScript runs in a browser and fetches the data. requests never runs JavaScript, so it never sees them.
**Quick test:** right-click → *View Page Source* (the raw HTML requests sees). If your data is missing there but present in the *Elements* tab of DevTools (the rendered DOM), the page is JavaScript-rendered and you need one of the methods below.
### Method 1: Playwright (recommended in 2026)
Playwright drives a real Chromium/Firefox/WebKit browser, so the JavaScript runs exactly as it would for a human. It has auto-waiting built in (no manual sleep() calls) and a cleaner API than Selenium. Install it once with pip install playwright then playwright install chromium.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://quotes.toscrape.com/js/')
# Auto-waits for the selector to appear after JS renders it.
page.wait_for_selector('.quote')
quotes = page.eval_on_selector_all(
'.quote',
'els => els.map(e => ({
text: e.querySelector(".text").innerText,
author: e.querySelector(".author").innerText
}))'
)
for q in quotes:
print(q['author'], '—', q['text'])
browser.close()
Playwright also exposes the rendered HTML via page.content() if you prefer to hand it to BeautifulSoup. Use page.wait_for_selector() or page.wait_for_load_state('networkidle') instead of fixed delays so the script is both faster and more reliable.
### Method 2: Selenium
Selenium is the older, most widely documented option. Since Selenium 4.6, **Selenium Manager** downloads the matching browser driver automatically — you no longer manage chromedriver by hand. Install with pip install selenium.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
opts = Options()
opts.add_argument('--headless=new')
driver = webdriver.Chrome(options=opts) # driver auto-managed
try:
driver.get('https://quotes.toscrape.com/js/')
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '.quote'))
)
for el in driver.find_elements(By.CSS_SELECTOR, '.quote'):
text = el.find_element(By.CSS_SELECTOR, '.text').text
author = el.find_element(By.CSS_SELECTOR, '.author').text
print(author, '—', text)
finally:
driver.quit()
Selenium works, but it is heavier and easier for anti-bot systems to detect (it leaks navigator.webdriver and other automation signals). For new projects, Playwright is the better default; keep Selenium for code that already depends on it.
### Method 3: Call the hidden JSON API directly (fastest)
Here is the trick most tutorials skip. A JavaScript page does not invent its data — it *fetches* it from a backend API, usually as JSON. If you call that endpoint directly, you get clean structured data with no browser at all: far faster and lighter than Playwright or Selenium.
Open DevTools → **Network** tab → filter by **Fetch/XHR** → reload the page and watch the requests. Find the one returning your data and copy its URL:
import requests
# The endpoint the page's own JavaScript calls (found in the Network tab).
api = 'https://quotes.toscrape.com/api/quotes?page=1'
data = requests.get(api).json()
for q in data['quotes']:
print(q['author']['name'], '—', q['text'])
# Pagination is usually just a query parameter:
while data.get('has_next'):
page = data['page'] + 1
data = requests.get(f'https://quotes.toscrape.com/api/quotes?page={page}').json()
When it works, this is always the best option — no rendering overhead, structured JSON, trivial pagination. Watch for endpoints that require headers, a token, or a signature; copy those from the Network request too. If the API is locked behind anti-bot protection, fall through to the browser methods or a scraping API.
### Which method to use — and what about blocking
MethodSpeedRuns JSBest for
Hidden JSON APIFastestNo (not needed)When you can find the endpoint
PlaywrightMediumYesModern SPAs, the default browser choice
SeleniumSlowYesLegacy projects already on Selenium
All three break the same way: the site *blocks* you. Headless browsers are detectable (the Cloudflare and DataDome challenge pages render no useful HTML), and hidden APIs are often guarded by the same fingerprinting. Rendering the JavaScript is only half the battle; passing the anti-bot check is the other half.
A managed scraping API like Scrappey renders the JavaScript *and* handles proxies, fingerprinting, and CAPTCHAs in one call, returning the fully rendered HTML — no browser to run or detect:
### Example
```python
import requests
# Render JS + pass anti-bot in one request. The API runs a real browser
# server-side and returns the fully rendered HTML.
resp = requests.post(
'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY',
json={
'cmd': 'request.get',
'url': 'https://quotes.toscrape.com/js/',
},
timeout=120,
)
html = resp.json()['solution']['response']
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for q in soup.select('.quote'):
print(q.select_one('.author').text, '-', q.select_one('.text').text)
```
### FAQ
**Q: Why does requests return an empty page for some sites?**
Because those pages are rendered client-side. The server sends a near-empty HTML shell plus JavaScript, and the actual content is only added after that JavaScript runs in a browser. requests never executes JavaScript, so it only ever sees the empty shell. You need a browser engine (Playwright or Selenium) or you can call the JSON API the page fetches its data from.
**Q: Is Playwright or Selenium better for JavaScript-rendered pages?**
For new projects in 2026, Playwright is the better default: it has built-in auto-waiting, a cleaner API, supports Chromium/Firefox/WebKit, and is somewhat harder to detect. Selenium is still fine if you already have a codebase built on it, and since Selenium 4.6 it auto-manages the browser driver. Avoid requests-html and Pyppeteer — both are effectively unmaintained.
**Q: How do I find the hidden API a JavaScript page uses?**
Open your browser DevTools, go to the Network tab, filter by Fetch/XHR, and reload the page. Look for the request that returns your data (usually JSON). Copy its URL, method, and any required headers or tokens, then replicate it with requests. This is the fastest method because it skips browser rendering entirely and returns structured data.
**Q: My headless browser still gets blocked — what now?**
Rendering JavaScript does not, on its own, satisfy anti-bot detection. Headless browsers leak automation signals (navigator.webdriver, fingerprint mismatches) that Cloudflare, DataDome, and Akamai flag, returning a challenge page with no real content. You need realistic fingerprints and residential proxies — or route the request through a scraping API that does all of that server-side and returns the rendered HTML.
---
## How to Parse HTML in Python (2026 Guide)
URL: https://scrappey.com/qa/python-web-scraping/python-parse-html
**To parse HTML in Python you load the markup into a parser that turns it into a navigable tree, then select the elements you want with CSS selectors or XPath.** The most popular parser is BeautifulSoup for its forgiving, beginner-friendly API; lxml is the fast workhorse with full XPath support; and selectolax is the fastest option for high-volume parsing. The standard library also ships html.parser, but a dedicated library is almost always the better choice.
### Quick facts
- **Easiest to learn:** BeautifulSoup (beautifulsoup4) — forgiving, readable API
- **Fastest with XPath:** lxml — C-backed, supports CSS and XPath
- **Fastest overall:** selectolax — Modest/Lexbor engine, ideal at scale
- **No install needed:** html.parser (stdlib) — basic, slower, less robust
- **Selectors:** CSS selectors (all) or XPath (lxml, parsel)
### HTML is a tree
Before parsing, it helps to see HTML as what it is: a nested **tree** of elements. <html> contains <body>, which contains <div>s, which contain <p>s and <a>s. A parser reads the raw markup string and builds this tree (the DOM) in memory, so instead of hunting through text with fragile string operations or regex, you *navigate*: "find every <a> inside the element with class product."
You select nodes in that tree two ways: **CSS selectors** (div.product > a) which every library below supports, or **XPath** (//div[@class="product"]/a) which lxml and parsel support and which can do things CSS cannot, like selecting by visible text or walking back up to a parent. See our XPath for web scraping guide for the full syntax.
**Do not parse HTML with regular expressions.** HTML is not a regular language; nested tags, optional attributes, and broken markup will break any regex eventually. Use a real parser.
### BeautifulSoup — the beginner-friendly default
BeautifulSoup wraps a parser (use lxml as the backend for speed) in a famously forgiving API. Install with pip install beautifulsoup4 lxml.
from bs4 import BeautifulSoup
html = '''
<div class="product">
<h2>Wireless Mouse</h2>
<span class="price">$24.99</span>
<a href="/p/123">details</a>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
# By tag, by class, by CSS selector:
print(soup.h2.text) # Wireless Mouse
print(soup.find('span', class_='price').text) # $24.99
print(soup.select_one('.product a')['href']) # /p/123
# Loop over many elements:
for a in soup.select('a[href]'):
print(a['href'], a.text.strip())
Key methods: find() / find_all() for tag-and-attribute matching, select() / select_one() for CSS selectors, .text for inner text, and ['attr'] for attributes. It tolerates broken, real-world HTML gracefully, which is why it is the most common starting point.
### lxml — fast parsing with XPath
lxml is a C-backed library that is both fast and the standard way to use XPath in Python. Install with pip install lxml.
from lxml import html as lxml_html
doc = lxml_html.fromstring(html) # same markup as above
# CSS selectors:
print(doc.cssselect('.product h2')[0].text_content()) # Wireless Mouse
# XPath — selecting by attribute, text, or relationship:
print(doc.xpath('//span[@class="price"]/text()')[0]) # $24.99
print(doc.xpath('//a/@href')[0]) # /p/123
# XPath can select by visible text (CSS cannot):
link = doc.xpath('//a[contains(text(), "details")]/@href')
print(link) # ['/p/123']
Reach for lxml when you need XPath’s power (selecting by text, axes like parent:: / following-sibling::) or when BeautifulSoup is too slow. Many people use both: BeautifulSoup with features='lxml' gives the friendly API on top of the fast parser.
### selectolax — the fastest parser
When you are parsing millions of pages, parser speed matters. selectolax wraps the C-based Modest/Lexbor engine and is typically several times faster than lxml and an order of magnitude faster than BeautifulSoup. Install with pip install selectolax.
from selectolax.parser import HTMLParser
tree = HTMLParser(html) # same markup as above
print(tree.css_first('h2').text()) # Wireless Mouse
print(tree.css_first('.price').text()) # $24.99
print(tree.css_first('.product a').attributes['href']) # /p/123
for node in tree.css('a[href]'):
print(node.attributes['href'], node.text(strip=True))
The API is CSS-selector based (css() / css_first()). It does not support XPath, but for the common case of CSS-selecting at high volume it is the fastest option in Python. Most competitor guides never mention it — it is the easiest performance win at scale.
### Which parser should you use?
ParserSpeedSelectorsBest forInstall
BeautifulSoupSlowestCSS, find()Learning, readable code, messy HTMLbeautifulsoup4
lxmlFastCSS + XPathXPath power, speedlxml
selectolaxFastestCSSHigh-volume parsingselectolax
html.parserSlowvia BeautifulSoupZero dependenciesstdlib
parselFastCSS + XPathScrapy projectsparsel
**Rule of thumb:** start with BeautifulSoup (on the lxml backend) for readability, switch to lxml directly when you need XPath, and to selectolax when speed at scale matters.
One caveat that no parser fixes: **you can only parse HTML you actually received.** If the page is JavaScript-rendered, the data will not be in the HTML at all, and if the site blocks you, you will be parsing a CAPTCHA or 403 page. A scraping API like Scrappey returns the fully rendered HTML so your parser always has real markup to work with.
### Example
```python
# Get real, rendered HTML first -- then parse it with any library.
import requests
from bs4 import BeautifulSoup
resp = requests.post(
'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY',
json={'cmd': 'request.get', 'url': 'https://example.com/products'},
timeout=120,
)
html = resp.json()['solution']['response']
soup = BeautifulSoup(html, 'lxml')
for card in soup.select('.product'):
name = card.select_one('h2').get_text(strip=True)
price = card.select_one('.price').get_text(strip=True)
print(name, price)
```
### FAQ
**Q: What is the best way to parse HTML in Python?**
For most people, BeautifulSoup with the lxml backend is the best starting point — it has a forgiving, readable API and handles messy real-world HTML well. Use lxml directly when you need XPath, and selectolax when you need maximum speed for high-volume parsing. Avoid parsing HTML with regular expressions; HTML is not a regular language and regex breaks on nested or malformed markup.
**Q: Is lxml faster than BeautifulSoup?**
Yes. lxml is C-backed and significantly faster than BeautifulSoup’s default parser. In fact, BeautifulSoup can use lxml as its backend (BeautifulSoup(html, "lxml")), giving you the friendly API on top of the fast parser. For the absolute fastest parsing, selectolax (Modest/Lexbor engine) is typically several times faster than lxml.
**Q: Should I use CSS selectors or XPath?**
CSS selectors are shorter and supported by every parser, so use them for most selecting. Switch to XPath when you need something CSS cannot do: selecting an element by its visible text, walking up to a parent, or selecting previous siblings. lxml and parsel support XPath; BeautifulSoup and selectolax are CSS-only.
**Q: Why is my parsed data empty even though I see it in the browser?**
Almost always because the page is JavaScript-rendered: the data is injected by JavaScript after the page loads, so it is not in the HTML that requests downloaded. Parsing cannot recover data that was never in the markup. Render the page with Playwright or Selenium, call the underlying JSON API, or use a scraping API that returns the fully rendered HTML.
---# Web Technologies
The web protocols and primitives every scraper developer should understand — HTTP, cookies, and REST APIs.
## What is HTTP? (Complete Guide 2026)
URL: https://scrappey.com/qa/web-technologies/what-is-http
HTTP (HyperText Transfer Protocol) is the set of rules browsers and servers use to talk to each other on the web. Every time you load a page or call an API, your client sends an HTTP request and the server sends back a response. It is the foundation of how data moves across the web.
### Quick facts
- **Stands for:** HyperText Transfer Protocol
- **Model:** Stateless request–response
- **Methods:** GET, POST, PUT, DELETE…
- **Status codes:** 2xx ok, 4xx client, 5xx server
- **Secure variant:** HTTPS (TLS-encrypted)
### Core Concepts
HTTP works as a simple back-and-forth conversation: the client asks, the server answers. Three things define that conversation — the cycle itself, the method (what you want to do), and the status code (how it went).
1. Request-Response Cycle
One round trip, four steps:
- Client sends request
- Server processes request
- Server sends response
- Client receives response
2. HTTP Methods
The method is the verb of the request — it tells the server what action you want. GET reads, POST creates, and so on:
# Common HTTP Methods
GET /api/users # Retrieve data
POST /api/users # Create new data
PUT /api/users/123 # Update existing data
DELETE /api/users/123 # Remove data
PATCH /api/users/123 # Partial update
HEAD /api/status # Get headers only
OPTIONS /api/users # Get allowed methods
3. Status Codes
Every response carries a three-digit status code. The first digit tells you the family: 2xx worked, 3xx redirected, 4xx is your mistake, 5xx is the server's.
- **2xx Success**
- 200: OK
- 201: Created
- 204: No Content
- **3xx Redirection**
- 301: Moved Permanently
- 302: Found
- 304: Not Modified
- **4xx Client Errors**
- 400: Bad Request
- 401: Unauthorized
- 403: Forbidden
- 404: Not Found
- 429: Too Many Requests
- **5xx Server Errors**
- 500: Internal Server Error
- 502: Bad Gateway
- 503: Service Unavailable
### Headers
Headers are key-value lines of metadata that travel with every request and response. They describe things like what format you want, who you are, and how the data may be cached — the actual content (if any) comes after them.
1. Common Request Headers
Sent by the client to describe the request and identify itself:
Accept: application/json
Authorization: Bearer token123
Content-Type: application/json
User-Agent: Mozilla/5.0
Cookie: session=abc123
2. Common Response Headers
Sent back by the server to describe the response:
Content-Type: application/json
Cache-Control: max-age=3600
Set-Cookie: session=abc123
Access-Control-Allow-Origin: *
### Security Features
Plain HTTP sends everything as readable text, so anyone between you and the server could read or change it. HTTPS and authentication close those gaps.
1. HTTPS
HTTPS is HTTP wrapped in TLS/SSL — the encryption layer behind the padlock in your browser. It scrambles the traffic and proves you are really talking to the right server:
- TLS/SSL encryption
- Certificate validation
- Secure communication
- Data privacy
2. Authentication Methods
Common ways to prove who you are so the server grants access:
- Basic Auth
- Bearer Tokens
- OAuth 2.0
- API Keys
### Best Practices
A few conventions make HTTP APIs predictable: design URLs around resources, return errors in a consistent shape, and tell clients what they can cache.
1. RESTful Design
REST means URLs name *things* (resources), and the method decides the action. Same URL, different verb, different result:
# Resource-based URLs
GET /api/articles # List articles
GET /api/articles/123 # Get specific article
POST /api/articles # Create article
PUT /api/articles/123 # Update article
DELETE /api/articles/123 # Delete article
2. Error Handling
Return a clear status code plus a structured body so callers can react in code:
{
"error": {
"code": 404,
"message": "Resource not found",
"details": "Article with ID 123 does not exist"
}
}
3. Caching Strategies
Caching headers let clients reuse a previous response instead of re-downloading it. ETag is a fingerprint of the content; Last-Modified is its timestamp:
# Cache Control Headers
Cache-Control: public, max-age=3600
ETag: "33a64df551425fcc55e4d42a148795d9f25f89d4"
Last-Modified: Wed, 21 Oct 2025 07:28:00 GMT
### Common Use Cases
Almost anything that moves data over the web runs on HTTP. Three everyday examples:
1. API Communication
Code talking to a service — here a Python script reads data and then sends an authenticated request:
import requests
# Making HTTP requests
response = requests.get('https://api.example.com/users')
data = response.json()
# Handling authentication
headers = {'Authorization': 'Bearer token123'}
response = requests.post('https://api.example.com/login', headers=headers)
2. Web Browsers
Everything a browser does is HTTP under the hood:
- Page loading
- Resource fetching
- Form submission
- AJAX requests
3. Web Services
Servers talking to each other:
- REST APIs
- Microservices
- Webhooks
- Server-side rendering
### Performance Tips
Two simple ideas make HTTP faster: avoid repeating work (reuse connections), and send less data over the wire.
1. Connection Management
Opening a connection is slow, so reuse and spread it:
- Keep-alive connections
- Connection pooling
- DNS caching
- Load balancing
2. Data Optimization
Shrink the payload before it travels:
- Compression (gzip)
- Minification
- Content negotiation
- Partial responses
### Debugging Tools
When a request misbehaves, you need to see exactly what was sent and received. Browser dev tools and command-line clients both let you inspect the raw traffic.
1. Browser Tools
The browser's developer tools show every request a page makes:
- Network inspector
- Request/response viewer
- Headers analyzer
- Performance metrics
2. Command Line
For quick tests without a browser, fire requests straight from the terminal:
# Using curl
curl -X GET https://api.example.com/users
# Using wget
wget https://api.example.com/data.json
# Using httpie
http GET api.example.com/users Authorization:"Bearer token123"
Remember: HTTP is the foundation of data communication on the web, and understanding its principles is crucial for web development and API integration.
### FAQ
**Q: What is the difference between HTTP and HTTPS?**
HTTPS is just HTTP run over TLS — the encryption layer that scrambles the connection and verifies the server. The rules and meaning of requests stay identical; only the transport is made private and tamper-resistant.
**Q: Why does HTTP being "stateless" matter for scraping?**
Stateless means each request stands alone — the server does not remember you from one request to the next. A logged-in "session" is faked by sending cookies and headers every time, so your scraper must replay them on each request to stay logged in.
**Q: What HTTP headers matter most when scraping?**
The big ones are User-Agent, Accept, Accept-Language, Referer, and Cookie. If they are missing or do not look consistent with a real browser, anti-bot systems often flag the request as automated.
---
## What are the 3 types of HTTP cookies? (2026 Guide)
URL: https://scrappey.com/qa/web-technologies/http-cookies
An HTTP cookie is a small piece of data a website asks your browser to store and then send back on every later request to that site. Because plain HTTP has no memory between requests, cookies are how a site remembers who you are - keeping you logged in, holding your cart, or tracking you across pages. The three main types are session, persistent, and third-party cookies. This 2026 guide explains each one.
### Quick facts
- **Session:** Cleared when the browser closes
- **Persistent:** Stored until an expiry date
- **Third-party:** Set by another domain (tracking)
- **Key attributes:** Secure, HttpOnly, SameSite
- **For scrapers:** Persist cookies across requests
### 1. Session Cookies
A session cookie is temporary: it lives only while the browser is open and disappears the moment you close it. It is the right choice for short-lived state that should not stick around, like the fact that you are currently logged in.
Characteristics
- Temporary storage in browser memory
- Deleted automatically when browser closes
- No expiration date set
- Most secure by default
- Cannot be accessed by other browser tabs
Implementation Example
# Flask Session Cookie Example
from flask import Flask, session
app = Flask(__name__)
app.secret_key = 'your-secret-key'
@app.route('/login')
def login():
session['user_id'] = '123'
return 'Session cookie set'
Common Use Cases
- User authentication sessions
- Shopping cart data
- Form wizard progress
- Temporary preferences
- Server-side session tracking
### 2. Persistent Cookies
A persistent cookie is saved to disk and carries an explicit expiry date, so it survives closing the browser and restarting the computer. It stays until that date passes (which can be years away), letting a site remember you across visits.
Characteristics
- Stored on user's disk
- Survive browser restarts
- Have specific expiration date
- Can last for years
- Accessible until expiration
Implementation Example
// Setting a persistent cookie
document.cookie = 'username=john; expires=Thu, 18 Dec 2025 12:00:00 UTC; path=/'
// Reading persistent cookies
function getCookie(name) {
const value = `; ${document.cookie}`;
const parts = value.split(`; ${name}=`);
if (parts.length === 2) return parts.pop().split(';').shift();
}
Common Use Cases
- Remember me functionality
- Language preferences
- Theme settings
- User tracking
- Personalization features
### 3. Third-Party Cookies
A third-party cookie is set by a domain different from the site you are actually visiting - for example, an ad network or analytics service embedded in the page. Because the same outside domain can recognise you across many sites, these cookies are mainly used to track you across the web. Modern browsers increasingly block them, and privacy laws tightly restrict them.
Characteristics
- Set by domains other than current website
- Used for cross-site tracking
- Often blocked by modern browsers
- Subject to strict privacy laws
- Require explicit user consent in many regions
Implementation Example
<!-- Third-party cookie from ad network -->
<script async src="https://ad-network.com/tracker.js"></script>
<!-- Google Analytics cookie setup -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
</script>
Common Use Cases
- Advertising tracking
- Analytics data collection
- Social media widgets
- Cross-site user tracking
- Retargeting campaigns
### Security Best Practices
Cookies often hold sensitive data like login sessions, so set them carefully. The flags below tell the browser how a cookie may be used.
1. Cookie Flags
Set-Cookie: sessionId=abc123; HttpOnly; Secure; SameSite=Strict
- **HttpOnly**: Prevents JavaScript access
- **Secure**: Only sent over HTTPS
- **SameSite**: Controls cross-site behavior
- **Domain**: Limits cookie scope
- **Path**: Restricts cookie access path
In short: **HttpOnly** hides the cookie from page scripts so a cross-site scripting attack cannot steal it; **Secure** sends it only over HTTPS (encrypted) connections; and **SameSite** limits whether it travels on requests coming from other sites.
2. Implementation Guidelines
# Secure cookie setting in Python/Flask
from flask import make_response
@app.route('/set-cookie')
def set_secure_cookie():
resp = make_response('Cookie set')
resp.set_cookie(
'user_id',
'abc123',
secure=True,
httponly=True,
samesite='Strict',
max_age=3600 # 1 hour
)
return resp
3. Privacy Considerations
- Implement cookie consent
- Respect user preferences
- Minimize data collection
- Follow GDPR guidelines
- Regular cookie cleanup
### Debugging Tools
When cookies misbehave, inspect them from both sides: the browser (what the client stored) and the server (what it received).
1. Browser DevTools
// Console commands for cookie management
// List all cookies
console.log(document.cookie)
// Clear cookies
document.cookie.split(';').forEach(cookie => {
document.cookie = cookie.replace(/^ +/, '').replace(/=.*/, '=;expires=' + new Date().toUTCString() + ';path=/');
});
2. Server-Side Inspection
# Flask route to inspect cookies
@app.route('/debug/cookies')
def debug_cookies():
return {
'cookies': request.cookies,
'session': dict(session),
'headers': dict(request.headers)
}
Remember: Always handle cookies with security in mind and respect user privacy preferences. Stay updated with the latest browser policies and privacy regulations regarding cookie usage.
### FAQ
**Q: Why do scrapers need to handle cookies?**
Sites store login state, anti-bot clearance tokens, and session IDs in cookies. If your scraper drops them, you look like a brand-new visitor on every request, which often triggers blocks.
**Q: What is the difference between session and persistent cookies?**
Session cookies live only until the browser (or session) ends; persistent cookies carry an expiry date and survive restarts. Anti-bot clearance is often a short-lived persistent cookie.
**Q: Are third-party cookies relevant to scraping?**
Rarely for the actual data extraction, but they power tracking and some anti-bot vendors. Browsers are phasing them out, so depending on them is fragile.
---
## What is a REST API? (Complete Guide 2026)
URL: https://scrappey.com/qa/web-technologies/rest-api-explained
A REST API is a standard way for programs to read and change data over the web using ordinary HTTP requests. This is the complete 2026 guide.
### Quick facts
- **REST:** Representational State Transfer
- **Transport:** HTTP verbs on resource URLs
- **Format:** Usually JSON
- **Stateless:** Each request is self-contained
- **Auth:** API keys, OAuth, tokens
### What is REST?
REST (Representational State Transfer) is a set of conventions for building web APIs. Instead of inventing a custom protocol, you expose your data as "resources" (like users or posts) and let clients act on them with normal HTTP requests. Because it reuses HTTP, REST is simple to build, easy to scale, and works with any language that can make a web request.
### Core Principles
1. Stateless Communication
- Each request contains all necessary information
- No client context stored on server
- Improves scalability and reliability
- Easier to cache and debug
Stateless means the server remembers nothing between requests: every call must carry everything needed to handle it (like a login token). Any server in a pool can answer any request, which is why this scales well.
2. Standard HTTP Methods
Each HTTP method maps to a basic data operation, summarized as CRUD (Create, Read, Update, Delete):
# CRUD Operations
GET /api/users # Read users
POST /api/users # Create user
PUT /api/users/1 # Update user
DELETE /api/users/1 # Delete user
# Additional Methods
PATCH /api/users/1 # Partial update
HEAD /api/users # Get headers only
### Implementation Examples
Here is a minimal API written with Flask, a small Python web framework. Each function below is an "endpoint" — a URL the client can call.
1. Basic REST API in Python
from flask import Flask, jsonify, request
app = Flask(__name__)
# GET endpoint
@app.route('/api/users', methods=['GET'])
def get_users():
return jsonify({
'users': users,
'total': len(users)
})
# POST endpoint
@app.route('/api/users', methods=['POST'])
def create_user():
user = request.json
users.append(user)
return jsonify(user), 201
# PUT endpoint
@app.route('/api/users/<int:user_id>', methods=['PUT'])
def update_user(user_id):
user = next((u for u in users if u['id'] == user_id), None)
if user:
user.update(request.json)
return jsonify(user)
return jsonify({'error': 'User not found'}), 404
2. Response Formats
Responses are usually JSON. A good API keeps a consistent shape: data on success, a clear error object on failure.
// Success Response
{
"data": {
"id": 1,
"name": "John Doe",
"email": "[email protected]"
},
"meta": {
"timestamp": "2025-01-20T10:00:00Z"
}
}
// Error Response
{
"error": {
"code": "NOT_FOUND",
"message": "User not found",
"details": "No user exists with ID 123"
}
}
### Best Practices
1. URL Structure
Treat each URL as a path to a resource. Use plural nouns, nest related resources, and put options in query parameters (the part after ?):
# Resource Hierarchy
/api/v1/users # User collection
/api/v1/users/{id} # Specific user
/api/v1/users/{id}/posts # User's posts
/api/v1/users/{id}/posts/{id} # Specific post
# Query Parameters
/api/v1/users?role=admin # Filtering
/api/v1/users?sort=name # Sorting
/api/v1/users?page=2&limit=10 # Pagination
2. Authentication
Authentication proves who is calling. A common approach is JWT (JSON Web Token) — the server hands back a signed token at login, and the client sends it on every later request to prove access:
# JWT Authentication Example
from flask_jwt_extended import jwt_required, create_access_token
@app.route('/api/login', methods=['POST'])
def login():
username = request.json.get('username')
password = request.json.get('password')
if authenticate_user(username, password):
access_token = create_access_token(identity=username)
return jsonify({'token': access_token})
return jsonify({'error': 'Invalid credentials'}), 401
@app.route('/api/protected', methods=['GET'])
@jwt_required()
def protected_route():
return jsonify({'message': 'Access granted'})
3. Rate Limiting
Rate limiting caps how many requests a caller can make in a given window, protecting the server from overload or abuse:
from flask_limiter import Limiter
limiter = Limiter(
app,
key_func=get_remote_address,
default_limits=["200 per day", "50 per hour"]
)
@app.route('/api/users')
@limiter.limit("1 per second")
def get_users():
return jsonify(users)
### Common Features
1. Pagination
When a collection is large, return it in pages instead of all at once. The client asks for a page number and size; the server returns that slice plus totals:
@app.route('/api/users')
def get_users():
page = int(request.args.get('page', 1))
limit = int(request.args.get('limit', 10))
start = (page - 1) * limit
end = start + limit
return jsonify({
'data': users[start:end],
'meta': {
'total': len(users),
'page': page,
'limit': limit,
'pages': ceil(len(users) / limit)
}
})
2. Filtering and Sorting
Let clients narrow and order results through query parameters, so they fetch only what they need:
@app.route('/api/users')
def get_users():
# Filtering
role = request.args.get('role')
if role:
filtered_users = [u for u in users if u['role'] == role]
# Sorting
sort_by = request.args.get('sort')
if sort_by:
filtered_users.sort(key=lambda x: x[sort_by])
return jsonify(filtered_users)
### Security Considerations
1. Input Validation
Never trust incoming data. Validate it against a schema (a definition of allowed fields and types) before using it, and reject anything malformed:
from marshmallow import Schema, fields
class UserSchema(Schema):
name = fields.Str(required=True)
email = fields.Email(required=True)
age = fields.Int(validate=lambda n: n >= 0)
@app.route('/api/users', methods=['POST'])
def create_user():
schema = UserSchema()
try:
data = schema.load(request.json)
# Process validated data
return jsonify(data), 201
except ValidationError as err:
return jsonify(err.messages), 400
2. CORS Handling
CORS (Cross-Origin Resource Sharing) is the browser rule that controls which websites may call your API from their own pages. Configure it to allow only the origins, methods, and headers you trust:
from flask_cors import CORS
# Configure CORS
CORS(app, resources={
r"/api/*": {
"origins": ["https://allowed-domain.com"],
"methods": ["GET", "POST", "PUT", "DELETE"],
"allow_headers": ["Content-Type", "Authorization"]
}
})
Remember: A well-designed REST API should be intuitive, consistent, and secure while following established conventions and best practices.
### FAQ
**Q: Should I use a REST API instead of scraping?**
If an official API gives you the data you need, use it — APIs are more stable, faster, and explicitly allowed. Reach for scraping only when no API exists, or when the API leaves out data you actually need.
**Q: What makes an API "RESTful"?**
Four things: each resource has its own URL, you act on it with standard HTTP verbs (GET/POST/PUT/DELETE), every request is stateless (self-contained), and the server returns meaningful status codes. Responses are conventionally sent as JSON.
**Q: How do I find a site's hidden API?**
Open your browser's network tab and watch the XHR/fetch requests (the background calls a page makes) as it loads. Many sites fill their pages from internal JSON endpoints that you can often call directly yourself.
---
## IPv4 vs IPv6
URL: https://scrappey.com/qa/web-technologies/ipv4-vs-ipv6
**IPv4 and IPv6 are the two versions of the Internet Protocol that give every device online an address.** Think of an IP address like a postal address for a computer: it's how traffic knows where to go. IPv4 uses 32-bit addresses — about 4.3 billion of them, now effectively used up — written as dotted decimals like 192.0.2.1. IPv6 uses much longer 128-bit addresses, giving a practically unlimited pool, written in hexadecimal like 2001:db8::1. Both deliver traffic across the internet. They differ in address format and availability, and — crucially for scraping — in how much websites and anti-bot systems trust them.
### Quick facts
- **IPv4 address:** 32-bit, dotted decimal (192.0.2.1)
- **IPv6 address:** 128-bit, hex with colons (2001:db8::1)
- **Address space:** IPv4 ~4.3 billion (exhausted) vs IPv6 ~340 undecillion
- **Adoption:** IPv4 universal; IPv6 ~40%+ of traffic and rising
- **Scraping relevance:** IPv4 residential is more widely trusted; IPv6 /64 ranges are easier to block in bulk
### The key differences between IPv4 and IPv6
The headline difference is size. IPv4's 32-bit space caps out at ~4.3 billion addresses, which the internet ran out of years ago. IPv6's 128-bit space is effectively limitless. That scarcity is why IPv4 leans heavily on NAT (Network Address Translation — many devices sharing one public address, like an office full of phones behind a single front-desk number), whereas IPv6 can give every device its own unique public address. The notation differs too: IPv4 is four dotted decimal octets, IPv6 is eight hex groups separated by colons (with :: as shorthand for long runs of zeros). IPv6 also simplified the packet header and builds in features — like stateless autoconfiguration (devices set up their own address automatically) and mandatory support for IPsec (built-in encryption for traffic) — that were optional add-ons in IPv4.
### Why IPv6 exists
IPv6 was created for one main reason: to solve IPv4 address exhaustion. As billions of phones, servers, and IoT devices (everyday internet-connected gadgets) came online, the 4.3-billion IPv4 ceiling became a hard limit — kept alive only by NAT and a secondary market where IPv4 address blocks are bought and sold. IPv6's enormous space removes that constraint. Adoption is gradual because the two protocols can't talk to each other directly. So the internet runs both side by side (called dual-stack) during the long transition, which means most networks still need working IPv4 alongside IPv6.
### IPv4 vs IPv6 for proxies and web scraping
For scraping, the practical question isn't which protocol is 'better' — it's which one your target sites trust. Many sites and anti-bot vendors still treat IPv4 residential addresses (real home internet connections) as the most human-looking. Large blocks of IPv6 are easier to fingerprint and ban in bulk: a single ISP can hand out a whole /64 (a huge range of addresses) to one customer, so anti-bot systems can block a suspicious IPv6 range wholesale. Datacenter IPv6 in particular is often distrusted. The upshot: for protected targets, residential proxies on IPv4 are usually the most reliable choice, while IPv6 can be fine for IPv6-only sites or high-volume, low-sensitivity crawling. If you're seeing blocks that line up with IP version, switching to trusted IPv4 residential addresses is the first thing to try.
### FAQ
**Q: Is IPv6 better than IPv4?**
Technically yes — vastly more addresses, no need for NAT, and a cleaner packet header. But 'better' depends on what you're doing. For web scraping, IPv4 residential IPs are often more trusted by anti-bot systems, so the newer protocol isn't automatically the more effective one.
**Q: Should I use IPv4 or IPv6 proxies for scraping?**
For protected sites, prefer IPv4 residential proxies — they're the most widely trusted. Use IPv6 for IPv6-only targets, or for large-scale, low-sensitivity crawling where its huge, cheap address pool is an advantage.
**Q: Can a website block all of IPv6?**
It can block large IPv6 ranges easily. Because a single subscriber may control an entire /64 (a big chunk of addresses), anti-bot systems often ban at the prefix level, knocking out many addresses at once. That bulk-blockability is one reason IPv6 can be riskier for scraping.
**Q: What's the difference between IPv4 and IPv6 in one line?**
IPv4 = 32-bit addresses (~4.3 billion, now exhausted, written as dotted decimals); IPv6 = 128-bit addresses (near-limitless, written as hex with colons) created to replace it.
---
## How to Use Basic Auth with curl
URL: https://scrappey.com/qa/web-technologies/curl-basic-auth
**To send HTTP Basic Authentication with curl, use the -u (or --user) flag: curl -u username:password https://example.com.** curl Base64-encodes the username:password pair and adds an Authorization: Basic <token> header for you. That's the whole job — the rest is handling special characters, hiding the password from your shell history, and remembering that Basic Auth is only safe over HTTPS.
### Quick facts
- **Flag:** -u / --user
- **Syntax:** curl -u user:pass https://example.com
- **What it does:** Base64-encodes user:pass into an Authorization: Basic header
- **Manual header:** -H "Authorization: Basic <base64>"
- **Security:** Only over HTTPS — Basic Auth is encoded, not encrypted
### The -u flag (the normal way)
Pass the credentials as username:password after -u:
curl -u myuser:mypassword https://example.com/api/data
curl encodes them and sends Authorization: Basic bXl1c2VyOm15cGFzc3dvcmQ=. --user is the long form of -u and behaves identically. Basic is curl's default auth scheme, so you don't need --basic.
**Hide the password from your shell history.** Give only the username and curl will prompt for the password interactively (it won't be echoed or stored):
curl -u myuser https://example.com/api/data
# Enter host password for user 'myuser':
### Setting the Authorization header manually
If you already have the Base64 token (or want full control), skip -u and send the header yourself with -H:
# Build the token, then send it
TOKEN=$(printf 'myuser:mypassword' | base64)
curl -H "Authorization: Basic $TOKEN" https://example.com/api/data
This is exactly what -u produces — useful when the token is supplied by another system, or when you're translating a curl command into application code.
### Special characters and spaces
If the username or password contains shell-special characters ($, !, spaces, @), wrap the pair in **single quotes** so your shell doesn't expand them:
curl -u 'new_user:my$ecret p@ss!' https://example.com
If the password itself contains a colon, only the first colon is treated as the separator, so a colon in the password is fine — but a colon in the *username* is not allowed by the Basic Auth spec.
### Reusing credentials with .netrc
To keep credentials out of the command line entirely, store them in a ~/.netrc file and let curl read them with -n / --netrc:
# ~/.netrc (chmod 600 it)
machine example.com login myuser password mypassword
curl -n https://example.com/api/data
# or point at a specific file:
curl --netrc-file ./secrets.netrc https://example.com/api/data
### Security: only over HTTPS
Basic Auth is encoded, not encrypted
The Base64 token is trivially reversible. Over plain http:// anyone on the network can read your password. Always use an https:// endpoint so TLS encrypts the header in transit.
For scraping behind a login, Basic Auth is rarely the blocker — anti-bot systems care about your IP reputation and browser fingerprint, not your auth header. If an authenticated endpoint returns 403 or 429 even with correct credentials, the fix is proxies and a real browser profile, not the auth flag.
### Example
```bash
# 1) Simplest: -u user:password (curl adds the Authorization header)
curl -u myuser:mypassword https://example.com/api/data
# 2) Prompt for the password (keeps it out of shell history)
curl -u myuser https://example.com/api/data
# 3) Send the header yourself
curl -H "Authorization: Basic $(printf 'myuser:mypassword' | base64)" \
https://example.com/api/data
# 4) Special characters -> single-quote the pair
curl -u 'new_user:my$ecret p@ss!' https://example.com/api/data
```
### FAQ
**Q: How do I send basic auth with curl?**
Use the -u (or --user) flag: curl -u username:password https://example.com. curl Base64-encodes the pair and adds an Authorization: Basic header automatically. Use HTTPS so the credentials are encrypted in transit.
**Q: How do I set the Authorization header manually in curl?**
Build the Base64 token and pass it with -H: curl -H "Authorization: Basic $(printf 'user:pass' | base64)" https://example.com. This produces the same header that -u would generate.
**Q: How do I handle special characters in a curl password?**
Wrap the user:password pair in single quotes so the shell does not expand characters like $, !, or spaces: curl -u 'user:my$pass!' https://example.com. A colon is allowed in the password (only the first colon separates user from password) but not in the username.
**Q: Is curl basic auth secure?**
Only over HTTPS. Basic Auth Base64-encodes the credentials, which is not encryption — over plain HTTP they can be read on the network. Always use an https:// endpoint, and prefer .netrc or a password prompt over putting the password directly in the command.
---
## How to Send and Receive JSON with curl
URL: https://scrappey.com/qa/web-technologies/curl-json
**To POST JSON with curl, set the content type and pass the body: curl -X POST -H "Content-Type: application/json" -d '{"key":"value"}' https://example.com/api.** Since curl 7.82.0 there's a shortcut — --json — that sets both the Content-Type and Accept headers to application/json and sends the body in one flag. To *read* JSON back, request it with an Accept header and pipe the response to jq.
### Quick facts
- **POST JSON:** -X POST -H "Content-Type: application/json" -d '{...}'
- **Shortcut (curl 7.82+):** --json '{...}'
- **From a file:** -d @data.json
- **Read JSON back:** -H "Accept: application/json" … | jq
- **Gotcha:** Without Content-Type the server may reject or misparse the body
### POST JSON data
Three pieces: -X POST for the method, -H "Content-Type: application/json" so the server parses the body as JSON, and -d for the payload:
curl -X POST https://example.com/api/users \
-H "Content-Type: application/json" \
-d '{"name": "Ada", "role": "admin"}'
Use **single quotes** around the JSON so your shell leaves the double quotes inside it alone. (Note: -d implies POST, so -X POST is technically optional — but it's good to keep it explicit.)
### The --json shortcut (curl 7.82.0+)
Modern curl has a dedicated flag that replaces the -X/-H/-d trio. --json sets Content-Type: application/json *and* Accept: application/json, and sends the body:
curl --json '{"name": "Ada", "role": "admin"}' https://example.com/api/users
Check your version with curl --version; if it's older than 7.82.0, use the explicit -H/-d form above.
### Sending JSON from a file
For large or reusable payloads, put the JSON in a file and reference it with @:
curl -X POST https://example.com/api/users \
-H "Content-Type: application/json" \
-d @payload.json
--json @payload.json works the same way on curl 7.82.0+. Use --data-binary @file.json if you need to preserve newlines exactly (-d strips them).
### GET and parse a JSON response
To fetch JSON, ask for it with an Accept header and pipe the output to jq for readable, queryable output:
# Pretty-print the whole response
curl -s -H "Accept: application/json" https://example.com/api/users | jq
# Pull out one field
curl -s -H "Accept: application/json" https://example.com/api/users/1 | jq '.email'
-s silences the progress meter so only the JSON reaches jq.
### Common JSON gotchas
- **Forgetting Content-Type.** Without it, -d defaults to application/x-www-form-urlencoded and many APIs reject the body or fail to parse it.
- **Shell quoting.** Double quotes inside double quotes break the payload — wrap JSON in single quotes, or read from a file.
- **Getting HTML instead of JSON.** If an API returns an HTML block page rather than JSON, you're being filtered as a bot — the issue is your fingerprint/IP, not your curl syntax. See the code example for routing JSON calls through a scraping API.
### Example
```bash
# POST JSON (classic, works on every curl version)
curl -X POST https://example.com/api/users \
-H "Content-Type: application/json" \
-d '{"name": "Ada", "role": "admin"}'
# POST JSON (curl 7.82.0+ shortcut — sets Content-Type AND Accept)
curl --json '{"name": "Ada", "role": "admin"}' https://example.com/api/users
# POST JSON from a file
curl -X POST https://example.com/api/users \
-H "Content-Type: application/json" -d @payload.json
# GET JSON and parse it with jq
curl -s -H "Accept: application/json" https://example.com/api/users | jq '.[].email'
```
### FAQ
**Q: How do I POST JSON with curl?**
Use curl -X POST -H "Content-Type: application/json" -d '{"key":"value"}' https://example.com/api. The Content-Type header tells the server to parse the body as JSON. On curl 7.82.0+ you can use the --json shortcut instead, which sets both Content-Type and Accept.
**Q: What does the --json flag do in curl?**
Added in curl 7.82.0, --json '{...}' is a shortcut that sends the JSON body and sets both Content-Type: application/json and Accept: application/json in one flag, replacing the -X POST, -H, and -d combination.
**Q: How do I send JSON from a file with curl?**
Reference the file with an @ prefix: curl -X POST -H "Content-Type: application/json" -d @data.json https://example.com/api. Use --data-binary @data.json if you need to preserve newlines exactly, since -d strips them.
**Q: How do I get and read a JSON response with curl?**
Request JSON with -H "Accept: application/json" and pipe the response to jq for pretty-printing or field extraction: curl -s -H "Accept: application/json" https://example.com/api | jq '.field'. The -s flag hides the progress meter so only JSON reaches jq.
---
## How to Make curl Ignore SSL Certificate Errors
URL: https://scrappey.com/qa/web-technologies/curl-ignore-ssl-certificate
**To make curl ignore SSL certificate errors, add the -k (or --insecure) flag: curl -k https://example.com.** This tells curl to skip certificate verification, so it connects even to a server with a self-signed, expired, or mismatched certificate instead of failing with SSL certificate problem. It's the right tool for local and staging environments — but it disables the protection TLS exists to provide, so it should never be a production fix.
### Quick facts
- **Flag:** -k / --insecure
- **Syntax:** curl -k https://example.com
- **What it does:** Skips SSL/TLS certificate verification
- **Safer alternative:** --cacert <file> to trust a specific CA/cert
- **Warning:** Disables MITM protection — testing/dev only
### The -k / --insecure flag
When curl refuses to connect with an error like SSL certificate problem: self-signed certificate or certificate has expired, -k skips the check:
curl -k https://self-signed.local/api
# long form
curl --insecure https://self-signed.local/api
curl still negotiates an encrypted TLS connection — it just stops verifying that the certificate is trusted and matches the host. That distinction matters: the traffic is encrypted, but you've given up the guarantee that you're talking to the *right* server.
### When it is safe to use -k
Skipping verification is reasonable only when you already trust the connection by other means:
- A **local server** or **staging environment** using a self-signed certificate.
- An **internal tool** with a misconfigured or not-yet-issued cert.
- **Quick debugging** to confirm the SSL error is the only thing blocking a request.
Never use -k in production
Disabling verification opens you to man-in-the-middle attacks — anyone who can intercept the connection can impersonate the server and you'll never know. Treat -k as a temporary workaround, not a fix.
### The safer fix: trust the right certificate
Instead of ignoring all verification, point curl at the specific CA or certificate it should trust with --cacert (or --capath for a directory):
# Trust a specific CA bundle / self-signed cert
curl --cacert /path/to/ca.pem https://internal.example.com
# Provide a client certificate (mutual TLS)
curl --cert client.pem --key client.key https://example.com
This keeps verification on — you still get MITM protection — while accepting the certificate you actually expect. It's the correct long-term solution for internal services with their own CA.
### SSL errors when scraping
If you hit SSL errors against a *public* site that loads fine in a browser, -k usually isn't the answer. The common causes are a missing or outdated CA bundle on your machine, or an anti-bot layer terminating the TLS handshake because it doesn't like your TLS fingerprint. In the second case the certificate is valid; the block is happening at the handshake. A scraping API that presents a real browser's TLS profile resolves it — see the code example. Sites returning 403 or Cloudflare errors after the handshake need the same treatment.
### Example
```bash
# Ignore SSL verification (testing / self-signed / staging ONLY)
curl -k https://self-signed.local/api
curl --insecure https://self-signed.local/api
# Safer: trust a specific CA or certificate (keeps verification on)
curl --cacert /path/to/ca.pem https://internal.example.com
# Mutual TLS with a client certificate
curl --cert client.pem --key client.key https://example.com
```
### FAQ
**Q: How do I make curl ignore SSL certificate errors?**
Add the -k or --insecure flag: curl -k https://example.com. This skips certificate verification so curl connects even to servers with self-signed, expired, or mismatched certificates. The connection is still encrypted, but curl no longer verifies you are talking to the right server, so use it for testing only.
**Q: What is the difference between -k and --insecure in curl?**
Nothing — -k is the short form and --insecure is the long form of the same option. Both disable SSL/TLS certificate verification for the request.
**Q: Is it safe to use curl -k?**
Only for local servers, staging, internal tools, or quick debugging where you already trust the connection. In production it exposes you to man-in-the-middle attacks because it disables the verification that ensures you are connected to the genuine server. Use --cacert to trust a specific certificate instead.
**Q: How do I fix a curl SSL error without ignoring it?**
Point curl at the certificate it should trust with --cacert /path/to/ca.pem (or --capath for a directory), or update your system CA bundle if a public site is failing. For mutual TLS, supply a client certificate with --cert and --key. This keeps verification — and MITM protection — enabled.
---# Web Automation
Browser automation, headless browsers, and how the major anti-bot vendors detect and block scrapers.
## What is Puppeteer? (Complete Guide 2026)
URL: https://scrappey.com/qa/web-automation/puppeteer-introduction
Puppeteer is a Node.js tool that lets your code drive a real Chrome browser automatically — clicking, typing, and reading pages just like a person would. This is a complete 2026 guide to what it is and how to use it.
### Quick facts
- **What it is:** Node.js headless-Chrome library
- **Controls:** Chrome/Chromium via DevTools Protocol
- **Best for:** JS rendering, screenshots, PDFs
- **Language:** JavaScript / TypeScript
- **Cross-browser alt:** Playwright
### What is Puppeteer?
Puppeteer is a Node.js library that gives you a simple API to control Chrome or Chromium from your code. It talks to the browser through the DevTools Protocol — the same behind-the-scenes channel Chrome's own developer tools use to inspect and command a page. Google's Chrome team maintains it, and it's a great fit for automating browsers, running tests, and scraping data.
### Key Features
1. Browser Automation
const puppeteer = require('puppeteer');
async function automateWebsite() {
// Launch browser
const browser = await puppeteer.launch({
headless: 'new', // Use new headless mode
defaultViewport: {width: 1920, height: 1080}
});
// Create new page
const page = await browser.newPage();
// Navigate to website
await page.goto('https://example.com', {
waitUntil: 'networkidle0'
});
// Close browser
await browser.close();
}
The script above opens a browser (here in "headless" mode, meaning no visible window), opens a tab, loads a page, and waits until network traffic settles before closing.
2. Screenshot & PDF Generation
Because Puppeteer drives a real browser, it can also save what the page looks like — as an image or a PDF.
async function captureContent() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Take screenshot
await page.screenshot({
path: 'screenshot.png',
fullPage: true
});
// Generate PDF
await page.pdf({
path: 'document.pdf',
format: 'A4'
});
await browser.close();
}
### Common Use Cases
1. Web Scraping
You can pull data out of a page by running code inside the browser. page.evaluate runs your function in the page's own JavaScript context, so it can read the live DOM (the page structure) and hand the result back to your script.
async function scrapeData() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// Extract data
const data = await page.evaluate(() => {
const title = document.querySelector('h1').innerText;
const paragraphs = Array.from(
document.querySelectorAll('p')
).map(p => p.innerText);
return { title, paragraphs };
});
console.log(data);
await browser.close();
}
2. Form Automation
Puppeteer can also fill in and submit forms — typing into fields and clicking buttons for you.
async function fillForm() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com/form');
// Fill form fields
await page.type('#username', 'testuser');
await page.type('#password', 'password123');
// Click submit button
await Promise.all([
page.waitForNavigation(),
page.click('#submit-button')
]);
await browser.close();
}
### Best Practices
1. Resource Management
Each browser uses real memory and CPU, so always close it when you're done. Wrapping launch and close in a small class keeps that cleanup in one place. The launch flags below (like --no-sandbox) are common when running inside Docker or other Linux containers.
class PuppeteerManager {
constructor() {
this.browser = null;
}
async initialize() {
this.browser = await puppeteer.launch({
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage'
]
});
}
async cleanup() {
if (this.browser) {
await this.browser.close();
}
}
}
2. Error Handling
Pages fail, time out, or change without warning. Use try/finally so the browser always closes even when something breaks, and waitForSelector to pause until the element you need actually appears.
async function robustScraping() {
let browser = null;
try {
browser = await puppeteer.launch();
const page = await browser.newPage();
// Set timeout for operations
page.setDefaultTimeout(10000);
await page.goto('https://example.com');
// Wait for specific element
await page.waitForSelector('.content');
} catch (error) {
console.error('Scraping failed:', error);
} finally {
if (browser) {
await browser.close();
}
}
}
### Performance Optimization
1. Resource Blocking
A full browser downloads images, stylesheets, fonts, and more. If you only need the text, you can cancel those requests to load pages much faster. Request interception lets you inspect each request and either abort (block) or continue (allow) it.
async function optimizedBrowsing() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Block unnecessary resources
await page.setRequestInterception(true);
page.on('request', (request) => {
if (
request.resourceType() === 'image' ||
request.resourceType() === 'stylesheet'
) {
request.abort();
} else {
request.continue();
}
});
await page.goto('https://example.com');
await browser.close();
}
2. Parallel Processing
To scrape many URLs faster, open several tabs in one browser and work through them at the same time instead of one after another.
async function parallelScraping(urls) {
const browser = await puppeteer.launch();
// Create multiple pages
const pages = await Promise.all(
Array(5).fill(null).map(() => browser.newPage())
);
// Process URLs in parallel
const results = await Promise.all(
urls.map((url, index) => {
const page = pages[index % pages.length];
return processUrl(page, url);
})
);
await browser.close();
return results;
}
Remember: Puppeteer is a powerful tool for web automation, but use it responsibly and respect websites' terms of service and robots.txt directives.
### FAQ
**Q: Puppeteer or Playwright?**
Playwright is a newer tool from the same lineage that works across multiple browser engines (Chromium, Firefox, and WebKit, the engine behind Safari). It also waits for elements automatically and supports several programming languages. Puppeteer is Chrome-only and Node.js-only, but it's lighter if that's all you need.
**Q: Is Puppeteer detectable?**
Yes. Default headless Chrome gives off telltale signs that it's automated rather than a real person. Stealth plugins hide some of those signs, but well-built anti-bot systems still catch it. Running a real browser or using a managed scraping service is harder to detect.
**Q: Can Puppeteer run with a visible browser?**
Yes. Launch it with headless: false and you'll see a real Chrome window doing the work, which is handy when you're debugging which elements to click or why a flow breaks.
---
## How to handle CAPTCHA in web scraping? (2026 Solutions)
URL: https://scrappey.com/qa/web-automation/handle-captcha-scraping
A CAPTCHA is a test a website shows to tell humans apart from bots (the name stands for a "completely automated test to tell computers and humans apart"). In web scraping of sites you are permitted to access, encountering one usually pauses your workflow. This reference covers the main CAPTCHA types you will meet in 2026 and how teams deal with them on services they are authorized to use.
### Quick facts
- **Common types:** reCAPTCHA, hCaptcha, Turnstile, image
- **How they fire less:** Consistent configuration and reasonable pacing
- **Common triggers:** Low-reputation IPs, inconsistent fingerprints, speed
- **When they appear:** Solver services or a managed API
- **Common setup:** Residential proxies + a real browser
### Common CAPTCHA Types
CAPTCHAs come in three broad generations. Knowing which one a site uses explains how the verification step works.
1. Text-Based CAPTCHA
The oldest kind: read some characters and type them back. Easiest to automate.
- Simple text recognition
- Distorted characters
- Math problems
- Word problems
2. Image-Based CAPTCHA
You click pictures that match a prompt ("select all traffic lights"). Harder, because it needs visual understanding.
- Select specific images
- Identify objects
- Solve visual puzzles
- reCAPTCHA v2
3. Modern CAPTCHA
The newest kind often shows no puzzle at all. Instead it watches how you behave and what your browser looks like, then scores how human you seem.
- reCAPTCHA v3
- hCaptcha
- Behavioral analysis
- Browser fingerprinting
### Why verification steps appear
A verification step is far more likely to appear when traffic looks automated rather than human. A few things drive that, and they are weighed together rather than one request at a time:
- **IP reputation** — datacenter addresses (cloud and server-farm ranges) carry less trust than residential ones (the kind an ISP assigns to a home connection).
- **Browser-environment consistency** — a headless browser whose environment contradicts a normal Chrome (automation flags, missing APIs, a mismatched timezone or locale) stands out from a real one.
- **Request pacing and patterns** — sudden bursts, headers a real browser always sends going missing, and cookies that are not echoed back all read as scripted.
Because these signals combine, changing one in isolation rarely stops challenges. For someone building automation against a service they are authorized to use, the practical takeaway is that a consistent, browser-like configuration and reasonable pacing are what make verification steps appear less often.
Where solver services and managed APIs fit
When a challenge still appears on a service you are permitted to access, teams generally rely on a real browser session — or a managed scraping API that runs one for them — rather than handling puzzles by hand. Dedicated CAPTCHA-solving services also exist as a fallback for image and token challenges, but they add cost and latency, and many sites' terms of service restrict automated solving, so confirm you are permitted before relying on one. This page is a reference on how the pieces fit together, not a step-by-step solving guide.
### Responsible Request Practices
Verification challenges fire less often when automation behaves like ordinary traffic, and the same habits reduce load on the services you are permitted to access — good etiquette regardless of CAPTCHAs.
1. Reasonable pacing
Sending requests faster than a person would is one of the clearest automated signals. Keeping a steady delay between requests, comfortably under a sensible per-minute limit, both looks more natural and eases strain on the server.
2. Spreading requests across IPs
A large volume of requests from a single address stands out quickly. Rotating through a pool of proxies you are authorized to use spreads the load and lets you retry elsewhere when one fails. Use only proxies and targets you have permission to access.
Always confirm that automated access is allowed by the service's terms of use. Many sites add verification specifically to manage automated traffic.
### FAQ
**Q: What is the best way to deal with CAPTCHAs?**
Most CAPTCHAs fire because of low-trust signals — datacenter IPs (server-farm addresses, not real homes), inconsistent browser fingerprints, and unusually fast requests. A consistent configuration and reasonable pacing mean challenges appear less often. When one still appears on a service you are authorized to access, a real browser session or managed API can handle the verification step.
**Q: Are CAPTCHA-solving services reliable?**
For image and token CAPTCHAs they do work, but they add delay and cost money per solve, and many sites' terms of service restrict automated solving. Treat them as a fallback for services you are permitted to access — reducing the signals that trigger challenges is cheaper and more sustainable.
**Q: Why do I suddenly get CAPTCHAs mid-scrape?**
Usually something in the session started to look more automated partway through — a sudden burst of requests, a lower-reputation IP, or a browser fingerprint that drifted out of sync. Slowing down, using residential IPs (addresses tied to home connections), and keeping your request headers consistent all help, and they ease load on the site as well.
---
## How Cloudflare Works (2026)
URL: https://scrappey.com/qa/web-automation/how-cloudflare-detects-bots
Cloudflare's Bot Management is a security layer that decides whether each visitor to a website is a human or an automated script. It sits in front of roughly 20% of the public web — including major retail, jobs, review, and listings sites — so it's the WAF (Web Application Firewall — a filter that screens incoming traffic) that most developers run into first when they start working with HTTP clients and headless browsers.
This is a reference on what Cloudflare actually measures, how its scoring pipeline is structured, and what each detection layer means for someone building automation. It is not a how-to.
### Quick facts
- **Coverage:** ~20% of the public web
- **Key signals:** TLS/JA3, HTTP/2 fingerprint, ML bot score
- **Clearance:** cf_clearance cookie
- **Challenge:** Turnstile / managed challenge
- **Best approach:** Real-browser execution + residential IPs
### What Cloudflare Bot Management is
Cloudflare works as a reverse proxy at the CDN edge — meaning it sits between the visitor and the real server, so every request passes through Cloudflare first. Each request to a protected site is scored by a single global machine-learning model trained on roughly 20% of all internet traffic. In a few milliseconds the model returns a **bot score from 1–99** (1 = almost certainly a bot, 99 = almost certainly a human), and the site's WAF rules decide what to do with it — let it through, show a JavaScript challenge, show a managed challenge, or block it outright.
When a request fails, you typically see one of these:
- error 1020 — you tripped an access rule.
- error 1015 — you're being rate limited (too many requests too fast).
- A managed challenge page (Turnstile).
- A silent 403 carrying a cf-ray header (Cloudflare's request ID).
The scorer doesn't care *why* you're automating. A price-comparison crawler and a credential-stuffing bot look identical to it; it only sees signals, not intentions.
### The four signal categories
1. IP address reputation
Cloudflare keeps a reputation database keyed by ASN (the network block an IP belongs to), built from traffic it has already seen across its whole network. Where your IP comes from sets your starting score:
- **Datacenter IPs** (AWS, GCP, Azure, DigitalOcean, OVH…) — pre-scored low. A request from a known cloud range starts with a poor score before any other check even runs.
- **Residential IPs** — the kind ISPs hand out to home internet connections, treated as much more trustworthy.
- **Mobile IPs** — assigned to cell towers and carrier CGNAT pools (shared mobile-network addresses). These get the highest baseline trust, because the pools are small and rotate naturally as phones move around.
On the very first request of a session — before any JavaScript or fingerprint data exists — IP reputation is the single biggest input to the score.
2. JavaScript fingerprinting and challenges
Plain HTTP clients (requests, axios, curl) just fetch pages; they don't run JavaScript. Cloudflare exploits this with a JS challenge — a script that must compute a token from values scattered around the page. No JavaScript engine, no token, no entry.
Headless browsers (real browsers driven by code, with no visible window) *do* run JavaScript, but their environment differs from a normal Chrome in dozens of small ways: the navigator.webdriver flag, missing plugins, the shape of window.chrome, canvas and WebGL outputs (how the browser draws graphics), font enumeration, timezone and locale mismatches, even the order in which the permissions APIs respond. Cloudflare hashes all of that into a fingerprint and compares it against known-automation patterns. Cloudflare Turnstile is the part of this pipeline the user actually sees.
3. HTTP and TLS fingerprinting
Before a single line of HTML is exchanged, Cloudflare can already fingerprint you from the **TLS handshake** (TLS is the encryption layer behind https; the handshake is the setup conversation that starts every connection, identified by JA3/JA4) and from how your client speaks **HTTP/2**.
- Most scraping libraries still default to HTTP/1.1. Real Chrome and Firefox stopped doing that years ago.
- libcurl and Go's net/http produce JA3 signatures that don't match any real browser, even when they do negotiate HTTP/2.
- HTTP/2 fingerprinting digs deeper still: the order of pseudo-headers, the SETTINGS frame values, and window-update sizes all leak which client you really are.
So a User-Agent: Chrome header on a Python requests call is contradicted by the TLS handshake long before anyone reads the headers — the disguise is blown at the door.
4. Behavioural and pattern analysis
Cloudflare logs every connection, so your behaviour over time is just as visible as any single request:
- Missing headers a real browser always sends (Sec-Fetch-*, Accept-Language, sec-ch-ua).
- Payloads sent in the wrong order or encoding.
- Cookies from the previous response that the next request fails to echo back.
- Hits on URLs no human ever visits — honeypot links hidden in the page's DOM specifically to catch crawlers that blindly follow every link.
- Bursty timing: 200 requests in 5 seconds, then silence.
All of this feeds Cloudflare's ML pattern analysis, which can flag a whole session even when each individual request looks fine on its own.
### What this means for developers
The key takeaway from the four-signal model is that fixing one layer rarely moves the score. A residential proxy sitting on top of a fingerprint that screams HeadlessChrome will still fail; so will a fully patched browser running on a flagged AWS IP. The tooling generally falls into three buckets:
- **HTTP clients with browser-impersonating TLS** — curl_cffi, curl-impersonate, tls-client. These match the TLS/HTTP/2 layer but can't run JS challenges.
- **Patched browsers** — Playwright with fingerprint-consistency plugins, patchright, Camoufox. These cover JS execution and the fingerprint surface but cost a lot per request.
- **Managed scraping APIs** — services that combine the two and handle proxy rotation and session continuity behind a single endpoint.
Reusing the same session value across requests keeps your cookies and trust score warm. Spinning up a fresh session for every request looks far more scripted than one that browses steadily for a few minutes.
### Sites commonly fronted by Cloudflare
Cloudflare is the most widely deployed WAF on the web. Frequently studied targets span major retail, jobs, review, listings, and logistics sites. Many large sites rotate between Cloudflare, Akamai, DataDome and PerimeterX depending on traffic, so the detection logic you hit is rarely the same from day to day.
### Summary
Cloudflare never makes a flat yes/no call. It blends four things into one continuous bot score: IP reputation, JavaScript execution and fingerprint, TLS/HTTP/2 handshake characteristics, and behavioural patterns over time. Any one of the four can pull the score below the WAF's threshold and get you blocked. And because the detection model keeps evolving, anything that relies on beating a single signal will eventually break.
### FAQ
**Q: Why does Cloudflare block my HTTP client but not my browser?**
Your client's TLS/JA3 and HTTP/2 fingerprints don't match a real browser. Cloudflare reads those during the connection setup, before any HTML is sent, so only a genuine Chrome-style handshake gets through.
**Q: What is the cf_clearance cookie?**
It's the token Cloudflare hands you once you pass a challenge — proof that you cleared the check. Reusing it from the same IP and fingerprint keeps you cleared; sending it from a different IP is a red flag.
**Q: Is a residential proxy alone enough to pass Cloudflare?**
No. IP reputation is only one of the four signals. You also need a browser-matching TLS fingerprint and consistent headers, or Cloudflare's bot score will still flag the session.
---
## How PerimeterX (HUMAN) Works (2026)
URL: https://scrappey.com/qa/web-automation/how-perimeterx-detects-bots
PerimeterX, now branded as **HUMAN Security**, is one of the more elaborate anti-bot WAFs (Web Application Firewalls - security layers that sit in front of a website and filter traffic). It guards high-value retail, real-estate, and crowdfunding sites, and is widely studied because its signature press and hold challenge measures input characteristics that aren't trivial to reproduce from scripted code.
This page explains how PerimeterX is built, what it measures, and what each detection layer means for someone building automation.
### Quick facts
- **Now branded:** HUMAN Security
- **Signature challenge:** "Press and hold" button
- **Cookies:** _px3, _pxhd, _pxvid
- **Measures:** Input dynamics & sensor telemetry
- **Best approach:** Real browser + consistent, rate-limited request behavior
### What PerimeterX is
PerimeterX is a reverse-proxy WAF: it sits between visitors and the real site, so every request passes through it first. At its edge it builds a per-request **trust score** from four inputs — the visitor's IP reputation, sensor data gathered in the browser, TLS handshake details (TLS is the encryption layer behind https), and behavioural telemetry (signals about how the visitor acts).
If a request scores low on trust, the visitor sees one of these:
- A silent **403** or **429** error with x-px-block headers.
- A **press and hold** human-verification challenge that measures touch pressure, mouse velocity, hold duration and tiny involuntary movements.
- A full block page with a ref ID.
The scorer is intent-blind: it doesn't care what you're trying to do. Any client that doesn't look like a real browser gets the same low score whatever its purpose.
### The four signal categories
1. IP address reputation
- **Datacenter IPs** (AWS, GCP, Azure, DigitalOcean, OVH…) — pre-scored low, because real people rarely browse from servers. Many cloud IP ranges are blanket-blocked on protected sites before any fingerprint check even runs.
- **Residential IPs** — the addresses ISPs hand out to home connections, treated as much higher trust.
- **Mobile IPs** — cell tower and CGNAT pools (where a carrier shares one IP among many phones), given the highest baseline trust because these pools rotate naturally.
IP reputation dominates the very first request's score, before the sensor has reported anything.
2. JavaScript sensor and the _px* cookie chain
This is the layer PerimeterX is best known for. Every protected page loads a heavily-obfuscated **sensor script** (JavaScript deliberately scrambled so it's hard to read or tamper with). It runs in the browser and collects hundreds of data points: canvas/WebGL fingerprints (how your graphics hardware draws an image), audio context, installed fonts, screen metrics, timezone, language, plugin list, navigator.webdriver (a flag automation tools often leave on), the shape of window.chrome, the entropy of your mouse movements (entropy here means how random and human-like they look), and more.
The sensor sends an encrypted payload back to PerimeterX's collector, and the reply sets the cookie chain:
- _pxhd — long-lived device hash.
- _pxvid — visitor ID.
- _px3 — short-lived session token.
If that payload is missing or stale, PerimeterX fires the press and hold challenge — which itself records timing and pressure data that a scripted click can't easily produce.
3. HTTP and TLS fingerprinting
Before any HTML is exchanged, PerimeterX fingerprints the client from the **TLS handshake** (the opening exchange that sets up the https encryption — its exact shape is summarised as a JA3/JA4 signature) and from HTTP/2 behaviour.
- Most scraping libraries still default to HTTP/1.1. Real Chrome and Firefox haven't for years, so that alone stands out.
- libcurl and Go's net/http produce JA3 signatures that don't match any real browser, even when they do negotiate HTTP/2.
- HTTP/2 fingerprinting tracks the order of pseudo-headers, SETTINGS frame values, and window-update sizes — low-level details a real browser sets in a consistent way.
4. Behavioural and pattern analysis
PerimeterX also runs continuous machine-learning analysis on your connection history, looking for tell-tale patterns:
- Headers a real browser always sends that are missing (Sec-Fetch-*, Accept-Language, sec-ch-ua).
- The _px* cookies missing, or sent from a different IP than the one that originally created them.
- Hits on honeypot links (links hidden from humans but visible to bots).
- Bursty timing — requests fired faster or more regularly than a person would.
- Identical sensor payloads reused across pages instead of fresh ones.
### What this means for developers
The four signals are judged together, so fixing one alone rarely moves the score much. These are the broad categories of tooling that show up in real-world workflows:
- **HTTP clients with browser-impersonating TLS** — curl_cffi, curl-impersonate, tls-client. They copy a real browser's handshake, but they can't run the sensor script.
- **Stealth-patched browsers** — patchright, Camoufox, and Playwright with stealth plugins, which run the sensor inside a genuine browser context.
- **Managed scraping APIs** — services that bundle proxies, patched browsers and session persistence behind a single endpoint.
Reusing a session matters a lot on PerimeterX: the _px* cookies and the built-up behavioural state are far harder to recreate from scratch on every request than to keep warm across one ongoing session.
### Sites commonly fronted by PerimeterX
Real-estate, marketplaces, ticketing and sneaker resale dominate the list of protected sites. Many of these sites rotate between PerimeterX, Cloudflare, Akamai and DataDome depending on traffic conditions, so the same site can serve different defences at different times.
### Summary
PerimeterX builds a continuous trust score from four things: IP reputation, the _px* JavaScript sensor and its cookie chain, TLS/HTTP/2 fingerprints, and behavioural patterns watched over time. The press and hold challenge is the most visible way to fail, but it's downstream of the sensor — by the time it appears, your score has already dropped. And like any modern WAF, its detection logic keeps changing on a rolling basis.
### FAQ
**Q: What is the PerimeterX press-and-hold challenge?**
A button you have to hold down while PerimeterX measures pointer pressure, tiny movements, and timing. These signals are hard to fake convincingly without a real input device and a real browser.
**Q: Which sites use PerimeterX/HUMAN?**
High-value real-estate, marketplace, and crowdfunding sites. Many rotate between vendors, so the same site may show different challenges at different times.
**Q: Can an HTTP client pass PerimeterX?**
Rarely on protected paths. The _px tokens are created by client-side JavaScript and sensor collection, so the reliable route is to run that JavaScript inside a real browser context rather than a plain HTTP client.
---
## How DataDome Works (2026)
URL: https://scrappey.com/qa/web-automation/how-datadome-detects-bots
DataDome is a bot-blocking service that sits in front of roughly 1,200 enterprise sites — major e-commerce, classifieds, news, and travel sites. It has a reputation for catching automation that slips past Cloudflare without trouble, so it is worth understanding on its own. Its design is unusual in three ways: it trains a separate machine-learning (ML) model for each site, it scores requests at the application server instead of at the CDN edge (the network of servers that delivers a site close to its users), and it runs a WebAssembly (WASM — compiled code that runs at near-native speed in the browser) challenge inside the visitor's browser.
This is a reference on how DataDome is structured and what each detection layer measures.
### Quick facts
- **Coverage:** ~1,200 enterprise sites
- **Model:** Per-site machine-learning
- **Cookie:** datadome
- **Known for:** Catching bots that pass Cloudflare
- **Best approach:** Residential IPs + real-browser fingerprint
### What DataDome is
DataDome is a reverse-proxy WAF — a web application firewall that inspects traffic before it reaches the site. It runs at the **application server**, not at the CDN edge. Every request is forwarded to DataDome's scoring service, which decides allow-or-block and answers in roughly 2 ms. The scorer is built per customer — around 85,000 ML models, one per protected site — so the very same TLS, browser and proxy combination can pass on one DataDome customer and fail on another.
When a request looks untrustworthy, the visitor gets one of:
- A silent **403** (the HTTP code for "forbidden") with the x-datadome header set.
- A **GeeTest-style slider captcha** served inline.
- A block page with a Reference #.
### The four signal categories
1. IP address reputation
Where your request comes from carries the most weight: IP reputation accounts for roughly 25–30% of the score on its own — the heaviest single input.
- **Datacenter IPs** (AWS, GCP, Azure, DigitalOcean, OVH…) — these belong to cloud and hosting providers, so they are pre-scored low. DataDome maintains one of the more accurate datacenter-range databases in the industry; many of these ranges are blanket-blocked on protected sites before any other check runs.
- **Residential IPs** — assigned by ISPs to home connections, higher baseline trust.
- **Mobile IPs** — cell tower and CGNAT pools (where many phones share one address), highest baseline trust.
2. The WASM boring_challenge and the datadome cookie
DataDome's signature component is the **WASM boring_challenge** — a small program (a state machine, written in Rust and compiled to WebAssembly) that runs in the browser. It produces a token that's POSTed to js.datadome.co, which then sets the datadome cookie — the pass that authorizes future requests.
Because the challenge is real WASM running against real browser APIs, it can't be solved without an actual browser to execute it. It also times the CPU using SIMD (instructions that crunch several numbers at once) in a way that exposes headless environments — browsers with no visible window — which no stealth-browser JavaScript patch covers. Alongside this, the sensor collects the usual fingerprint surface (canvas, WebGL, audio, fonts, screen metrics, timezone, navigator.webdriver, window.chrome) and feeds it into the WASM state.
3. HTTP and TLS fingerprinting
DataDome is one of the few WAFs that publicly markets HTTP/2 fingerprinting as a detection layer. The idea: the low-level details of how your client talks HTTP and TLS (the encryption behind https) form a fingerprint that often does not match a real browser.
- Most scraping libraries still default to HTTP/1.1. Real Chrome and Firefox haven't in years.
- libcurl and Go's net/http produce JA3 signatures — a hash of their TLS handshake — that don't match any real browser, even when they negotiate HTTP/2.
- HTTP/2 fingerprinting tracks pseudo-header order, SETTINGS frame values, and window-update sizes — small ordering and timing choices that differ between real browsers and libraries.
4. Behavioural and pattern analysis
DataDome also runs continuous ML pattern analysis on your connection history, watching for things a normal user would not do:
- The datadome cookie sent from a different IP than the one that minted it.
- Reused sensor payloads across pages instead of fresh ones per navigation.
- Honeypot link hits — clicks on links a human cannot see.
- Bursty request timing.
- Missing real-browser headers (Sec-Fetch-*, Accept-Language, sec-ch-ua).
### What this means for developers
Because each site gets its own model, there is no single "DataDome solution" — a setup that works on a news customer may fail on an e-commerce one with stricter scoring. Three patterns are common in production:
- **Look in the initial HTML first.** Many DataDome-protected Next.js sites embed the full page state in a __NEXT_DATA__ script tag. If the data is already in the first HTML response, the WASM challenge never runs — there is no follow-up request (XHR) for it to gate. For those cases, curl_cffi plus a residential proxy is enough.
- **Mobile or ISP residential proxies for XHR endpoints** — IP weighting is so heavy that simply switching from a datacenter IP to mobile-4G frequently flips a session from blocked to 200 OK with no other change.
- **Real browser execution** when the page actually runs the WASM challenge — for example Camoufox with its IP, timezone and locale all matching, or a managed scraping API.
DataDome is especially sensitive to IP/cookie mismatches — a datadome cookie minted on one IP looks suspicious when sent from another — so keeping one stable exit IP per session matters.
### Sites commonly fronted by DataDome
E-commerce, classifieds, news and travel dominate the list of protected sites. Many of these rotate between DataDome, Cloudflare, Akamai and PerimeterX depending on conditions, so the same site may not always use DataDome.
### Summary
DataDome scores each request in about 2 ms against a per-site ML model, weighing four things: IP reputation (25–30% of the score), the WASM boring_challenge and its datadome cookie, TLS and HTTP/2 fingerprints, and behavioural patterns. Because each customer gets its own model, detection behaviour varies between sites even when the underlying signals don't — which is the main reason a setup that works on one DataDome target may not carry over to another.
### FAQ
**Q: Why does DataDome catch bots that pass Cloudflare?**
It trains a separate ML model for each site, using device, network, and behavioural signals, so it adapts to each target instead of applying one generic ruleset. A generic anti-fingerprinting setup that passes Cloudflare can still look anomalous to it.
**Q: What triggers a DataDome block?**
Datacenter IPs, fingerprint inconsistencies, and behavioural anomalies each push the score higher; once it crosses the threshold, DataDome returns a 403 with a challenge page.
**Q: Which sites use DataDome?**
Major e-commerce, classifieds, news, and travel platforms, among roughly 1,200 enterprise sites.
---
## How Akamai Bot Manager Works (2026)
URL: https://scrappey.com/qa/web-automation/how-akamai-detects-bots
Akamai Bot Manager is a bot-blocking firewall — one of the oldest and most widely deployed on the internet. It runs on Akamai's CDN (content delivery network — the servers that sit in front of a website to serve it faster), already the largest CDN in the world, so it inspects traffic before it ever reaches the real site. It guards enterprise retail, jobs, social, and logistics sites. Its inner workings are unusually well-understood, thanks to two decades of research into the _abck cookie it relies on.
This is a reference on what Akamai measures and how its scoring pipeline is structured.
### Quick facts
- **Runs on:** Akamai CDN edge (largest CDN)
- **Sensor cookie:** _abck
- **Telemetry:** sensor_data payload
- **Signals:** TLS, behaviour, device integrity
- **Best approach:** Valid _abck via real browser sensor
### What Akamai Bot Manager is
Akamai is a reverse-proxy WAF — a security gateway (Web Application Firewall) that every request passes through before reaching the website behind it. It runs at the CDN edge, meaning on the nearest Akamai server to the visitor. Each request gets a trust score based on four things: IP reputation (does this address look trustworthy?), the _abck sensor data, the way the connection's encryption is set up (the TLS handshake), and behavioural telemetry (how the visitor acts). Akamai's ASN database — its map of which company owns which blocks of IP addresses — is unusually thorough, because the company has been routing internet traffic for over two decades.
Low-trust requests surface as one of:
- A silent **403** (access denied) with an invalidated _abck cookie. Akamai marks the cookie with ~-1~ or ~0~ to signal the session is burned.
- A **Pardon Our Interruption** block page.
- A redirect to a CAPTCHA endpoint.
### The four signal categories
1. IP address reputation
Where your IP address comes from sets a starting trust level:
- **Datacenter IPs** (AWS, GCP, Azure, DigitalOcean, OVH…) — addresses owned by cloud/hosting companies, pre-scored low. Even "clean" datacenter ranges tend to be flagged because Akamai's ASN data is broad.
- **Residential IPs** — assigned by ISPs to home connections, higher baseline trust.
- **Mobile IPs** — cell tower and CGNAT pools (the shared addresses phone carriers hand out), highest baseline trust.
2. The Akamai sensor and the _abck cookie
This is the layer Akamai is best known for. Every protected page loads a deliberately scrambled **sensor script** — usually from a path like /akam/13/... or /_bm/_data. This script quietly inventories your browser: canvas and WebGL fingerprints (tiny images and 3D scenes it draws, whose exact pixels vary by device), audio context, installed fonts, screen metrics, timezone, language, plugin list, the navigator.webdriver flag (true when automation is driving the browser), the exact shape of the window.chrome object, plus how your mouse moves and how fast you type.
The sensor then POSTs an encrypted bundle of all this back to the edge, which sets or refreshes the **_abck cookie**. That cookie has a fixed internal layout (~timestamp~status~hash~), and a valid one is required for later requests. A malformed or stale _abck is the single most common reason automated clients get a 403. Akamai also specifically tests for navigator.webdriver, the headless Chrome user-agent marker (the giveaway string when Chrome runs with no visible window), and inconsistencies in the permissions API.
3. HTTP and TLS fingerprinting
Akamai is widely credited with pioneering HTTP/2 fingerprinting in the WAF space — identifying clients by the low-level quirks of how they speak the protocol, not by what they claim to be.
- Most scraping libraries still default to HTTP/1.1. Real Chrome and Firefox haven't in years.
- libcurl and Go's net/http produce JA3 signatures — fingerprints of the TLS (https encryption) handshake — that don't match any real browser.
- HTTP/2 fingerprinting tracks pseudo-header order, SETTINGS frame values, and window-update sizes — connection-setup details a browser sends automatically and a script usually gets subtly wrong.
4. Behavioural and pattern analysis
Akamai correlates behaviour across sessions — once an IP/fingerprint combo builds up a low score, even a fresh _abck cookie won't rescue it. Signals include:
- Missing real-browser headers (Sec-Fetch-*, Accept-Language, sec-ch-ua).
- _abck or bm_sz cookies from the previous response sent from a different IP (a sign cookies are being shared around).
- Honeypot link hits — clicking links hidden from real users but visible to scrapers.
- Bursty timing — many requests fired faster than a human could.
- Identical sensor payloads reused across pages.
### What this means for developers
The _abck cookie is the focal point: nearly every Akamai workflow comes down to minting a valid one and keeping it valid. Three general tooling categories:
- **HTTP clients with browser-impersonating TLS** — curl_cffi, curl-impersonate, tls-client. These copy a real browser's handshake, but they can't run the JavaScript sensor, so they can't mint a real _abck.
- **Stealth-patched browsers** — Camoufox, patchright, Playwright with stealth plugins. These run the sensor inside a genuine browser, so the cookie comes out valid.
- **Managed scraping APIs** — services that bundle proxies, patched browsers and session persistence behind a single endpoint.
Reusing the same session value across requests keeps the _abck/bm_sz cookies and the trust score warm. Starting a fresh session every request forces the sensor to re-validate from scratch each time — which is exactly what scripted clients look like.
### Sites commonly fronted by Akamai
E-commerce, ticketing, jobs, logistics and social sites. Many of these rotate between Akamai, Cloudflare, DataDome and PerimeterX depending on conditions.
### Summary
Akamai builds a continuous trust score from four inputs: IP reputation, the _abck JavaScript sensor, TLS/HTTP/2 fingerprints, and behaviour tracked across sessions. The _abck cookie is the single most telling signal — its internal ~status~ field says outright whether the session is trusted, burned, or in a challenge state. Akamai pushes sensor updates on a rolling basis, so the exact on-the-wire details change often while the four-layer structure stays the same.
### FAQ
**Q: What is the _abck cookie?**
It is Akamai's core bot-tracking cookie. You get a valid _abck only by submitting realistic sensor_data — the telemetry collected by the page's JavaScript. A missing or invalid one keeps you stuck behind challenges.
**Q: What is sensor_data?**
An encoded report of pointer movement, device details, and browser environment signals. Akamai checks it on its own servers, so it has to come from a real browser actually interacting with the page — you can't fake it convincingly by hand.
**Q: Which sites use Akamai Bot Manager?**
Major retail, jobs, social, and logistics sites, since it runs natively on Akamai's widely used CDN.
---
## How Kasada Works (2026)
URL: https://scrappey.com/qa/web-automation/how-kasada-detects-bots
Kasada is an anti-bot WAF — a security layer that sits in front of a website and decides which visitors to let through. What makes it stand out is its client-side challenge: it sends your browser a tiny custom program (a "VM" running scrambled, hard-to-read bytecode) plus a math puzzle your browser actually has to compute, called proof-of-work. It protects real-estate, retail, and hospitality sites, and is notable because that bytecode is rewritten every few weeks, so approaches that rely on reading the code statically stop working quickly.
This is a reference on how Kasada is built and what it measures.
### Quick facts
- **Challenge file:** ips.js / p.js (VM bytecode)
- **Headers:** x-kpsdk-ct, x-kpsdk-cd
- **Distinct signal:** Real proof-of-work computation
- **Block style:** Silent 429 / 403
- **Best approach:** Real-browser VM execution
### What Kasada is
Kasada is a reverse-proxy WAF: it stands between visitors and the real server (the origin), inspecting every request before passing it on. It gives each request a trust score based on the reputation of your IP address, the output of its in-browser VM, the details of your TLS handshake (TLS is the encryption layer behind https), and how you behave on the page.
Requests it does not trust get one of two responses:
- A silent **429 Too Many Requests** carrying x-kpsdk-* headers — often on your very first request. That is a giveaway that you are dealing with Kasada, not a sign that you actually sent too many requests.
- A **proof-of-work challenge** served from /ips.js or /p.js — a puzzle the browser must solve before any further request is accepted.
### The four signal categories
1. IP address reputation
Where your connection comes from sets a starting trust level:
- **Datacenter IPs** (AWS, GCP, Azure, DigitalOcean, OVH…) — addresses owned by cloud providers, scored low from the start. Kasada is especially harsh here; most cloud ranges are blocked outright on protected sites.
- **Residential IPs** — addresses ISPs hand out to home internet connections. Higher baseline trust.
- **Mobile IPs** — addresses from cell towers and carrier networks (CGNAT). Highest baseline trust.
2. The Kasada VM and proof-of-work
This is what Kasada is unique for. Instead of shipping a readable (if minified) detection script, Kasada sends a **VM-bytecode payload** (ips.js / p.js): scrambled instructions run by a tiny custom interpreter bundled in the same file. That VM creates two headers that authorise your next requests:
- **x-kpsdk-ct** — a client token tied to your device fingerprint (the unique profile Kasada builds of your browser).
- **x-kpsdk-cd** — the proof-of-work result, a calculation that costs a real browser roughly 100ms of CPU time. Headless browsers or attempts to compute it elsewhere get throttled.
The VM also gathers the usual fingerprint surface — canvas/WebGL rendering, audio context, installed fonts, screen size, timezone, the navigator.webdriver flag, the shape of the window.chrome object, the plugin list — and folds it into the bytecode's internal state. Because that bytecode rotates frequently, pre-built solvers stop working within weeks.
3. HTTP and TLS fingerprinting
Before any HTML is exchanged, Kasada fingerprints you from the **TLS handshake** (summarised as a JA3/JA4 hash) and from how you speak HTTP/2.
- Most scraping libraries still default to the older HTTP/1.1. Real Chrome and Firefox stopped doing that years ago.
- libcurl and Go's net/http produce JA3 signatures that match no real browser.
- HTTP/2 fingerprinting tracks the order of pseudo-headers, SETTINGS frame values, and window-update sizes — low-level details that differ between real browsers and bots.
4. Behavioural and pattern analysis
Kasada also runs continuous machine-learning analysis looking for tell-tale patterns:
- Missing headers that real browsers always send (Sec-Fetch-*, Accept-Language, sec-ch-ua).
- x-kpsdk-* tokens from an earlier response now arriving from a different IP.
- The same proof-of-work result reused across many requests instead of being recomputed for each page load.
- Hits on honeypot links (decoy links invisible to humans but followed by bots).
- Bursty, machine-like timing.
### What this means for developers
With Kasada, the VM matters more than anything else: without a valid x-kpsdk-cd, no other signal can save your request. There are three broad categories of tooling:
- **HTTP clients with browser-impersonating TLS** — curl_cffi, curl-impersonate, tls-client. These copy a real browser's handshake, but they cannot run the VM.
- **Full-stack real browsers** — Camoufox, patchright, Playwright with consistent browser configuration. These run the VM inside a real browser, so the proof-of-work gets computed.
- **Managed scraping APIs** — services that keep up with the frequent ips.js rotations for you.
Reusing the same session value across requests keeps your x-kpsdk-* tokens and trust score warm. Computing a fresh proof-of-work on every single request is both slow and a strong sign of automation.
### Sites commonly fronted by Kasada
Mostly real-estate, retail, hospitality and ticketing sites. Many of these switch between Kasada, Cloudflare, Akamai, DataDome and PerimeterX depending on conditions.
### Summary
Kasada builds a running trust score from four things: IP reputation, the ips.js VM and its proof-of-work output, TLS/HTTP/2 fingerprints, and behavioural patterns. The VM is the most distinctive piece — trying to crack it by reading the code stops working quickly because the bytecode rotates, which is why running it in a real browser is the dominant approach in production.
### FAQ
**Q: What makes Kasada different from other WAFs?**
It sends your browser a custom VM running scrambled bytecode plus a real proof-of-work puzzle the browser has to compute. The bytecode is rewritten often, so trying to crack it by reading the code stops working within weeks.
**Q: What are the x-kpsdk headers?**
x-kpsdk-ct is a client token tied to your device, and x-kpsdk-cd is the proof-of-work result. Without a valid x-kpsdk-cd, no other signal will get your request through.
**Q: Why do I get a 429 on my very first request?**
That silent 429 with x-kpsdk headers is Kasada's low-trust response, not real rate limiting. It simply means you have not solved the VM challenge yet.
---
## How Imperva (Incapsula) Works (2026)
URL: https://scrappey.com/qa/web-automation/how-imperva-detects-bots
Imperva is a security service that filters traffic before it reaches a website, blocking what it thinks are bots and scrapers. It was historically known as **Incapsula** and is one of the oldest anti-bot **WAFs** (Web Application Firewalls — a guard that inspects every request) still in use. It fronts large jobs, retail, and many financial-services and ticketing sites. You can spot it from its Request unsuccessful. Incapsula incident ID: ... block page and its incap_ses_* / visid_incap_* cookie chain.
This is a reference on what Imperva measures and how its detection model is structured.
### Quick facts
- **Formerly:** Incapsula
- **Block page:** "Incapsula incident ID: …"
- **Cookies:** reese84, ___utmvc, incap_ses
- **Signals:** TLS, JS challenge, reputation
- **Best approach:** Real browser + clean residential IPs
### What Imperva is
Imperva is a reverse-proxy WAF, meaning it sits in front of the real server and every visitor passes through it first. It gives each request a trust score based on four things: the reputation of the IP address, the output of its _Incapsula_Resource sensor (a script it runs in your browser), the TLS handshake (the encrypted setup at the start of every https connection), and how the visitor behaves over time.
Requests it doesn't trust get one of these responses:
- A silent **403** ("forbidden") with an incident ID in the body — the classic Incapsula block page.
- A **JavaScript challenge** served from /_Incapsula_Resource?.... Your browser must run this code and set the incap_ses_* cookie before the request is allowed through.
- A **reCAPTCHA** puzzle on more sensitive pages.
### The four signal categories
1. IP address reputation
Imperva keeps its own threat-intelligence feed — a list of suspect IP addresses. Most known cloud IP ranges are already flagged before any other check happens, so where your traffic comes from matters a lot.
- **Datacenter IPs** (AWS, GCP, Azure, DigitalOcean, OVH…) — pre-scored low.
- **Residential IPs** (home internet connections) — higher baseline trust.
- **Mobile IPs** — highest baseline trust.
2. The _Incapsula_Resource sensor and the Incapsula cookie chain
This is where Imperva does most of its detection. Every protected page either includes a sensor script inline or redirects (302) to one (/_Incapsula_Resource?SWJIYLWA=...). The script runs in the browser and collects a fingerprint: canvas/WebGL drawing output, audio characteristics, installed fonts, screen size, timezone, language, plugin list, the navigator.webdriver flag (which automation tools often leave set), the shape of the window.chrome object, and similar details.
The sensor sends an encrypted report back to Imperva's edge, which then sets the cookies needed for future requests:
- **visid_incap_<site_id>** — a long-lived visitor ID tied to your device fingerprint.
- **incap_ses_<num>_<site_id>** — a short-lived session token that authorises the actual request.
- **nlbi_<site_id>** — a load-balancer hint that also carries trust state.
If any cookie in this chain is missing — or if an incap_ses_* is sent from a different IP than the one it was created on — the request is dropped.
3. HTTP and TLS fingerprinting
Before any page content is exchanged, Imperva fingerprints the client from the **TLS handshake** (JA3/JA4 are standard ways to summarise that handshake into an ID) and from how it speaks HTTP/2.
- Most scraping libraries still default to HTTP/1.1. Real Chrome and Firefox haven't in years.
- libcurl and Go's net/http produce JA3 signatures that don't match any real browser.
- HTTP/2 fingerprinting tracks pseudo-header order, SETTINGS frame values, and window-update sizes — small protocol details that differ between real browsers and bots.
4. Behavioural and pattern analysis
Imperva also runs continuous machine-learning analysis looking for patterns that real users don't produce:
- Missing headers that real browsers always send (Sec-Fetch-*, Accept-Language, sec-ch-ua).
- incap_ses_* / visid_incap_* cookies sent from a different IP than the one that minted them.
- Identical sensor payloads reused across pages (a real browser produces fresh ones).
- Hits on honeypot links (links hidden from humans but visible to scrapers).
- Bursty timing — requests arriving faster than a person could click.
### What this means for developers
The Incapsula cookie chain is the heart of the problem: most Imperva work comes down to producing a valid chain and keeping the IP and cookies aligned. There are three broad tooling categories:
- **HTTP clients with browser-impersonating TLS** — curl_cffi, curl-impersonate, tls-client. These match the browser's handshake but can't mint a real incap_ses_*, because they don't actually run the sensor script.
- **Stealth-patched browsers** — Camoufox, patchright, Playwright with stealth plugins. These run the sensor in a real browser, so it can produce the cookies.
- **Managed scraping APIs** — services that handle proxies, patched browsers, and session persistence for you.
Imperva is especially strict about IP/cookie consistency — an incap_ses_* created on one IP is rejected when sent from another — so keeping a stable exit IP for each session matters more here than usual.
### Sites commonly fronted by Imperva
You'll find Imperva across e-commerce, financial services, jobs, social, gaming, and ticketing, including many regional banking and insurance portals. Many of these sites rotate between Imperva, Cloudflare, Akamai, DataDome and PerimeterX.
### Summary
Imperva builds a continuous trust score from four inputs: IP reputation, the _Incapsula_Resource JavaScript sensor with its cookie chain, TLS/HTTP/2 fingerprints, and behavioural patterns. The incap_ses_* / visid_incap_* cookie chain and the IP it's bound to are the most telling signals — most failed sessions trace back to either a broken chain or an IP mismatch. As with any modern WAF, the sensor is updated on a rolling basis, so what works today may need adjusting later.
### FAQ
**Q: How do I recognise an Imperva block?**
The classic block page reads "Request unsuccessful. Incapsula incident ID: …". If you see that, the JavaScript challenge or the IP reputation check failed.
**Q: What is the reese84 cookie?**
It is Imperva's clearance token, created by the sensor (the challenge script that runs in your browser). A valid reese84 is required before protected requests will succeed.
**Q: Which sites use Imperva/Incapsula?**
Major jobs, retail, financial-services and ticketing sites use it — it is one of the longest-running anti-bot WAFs around.
---
## How to scrape dynamic JavaScript content? (2026 Guide)
URL: https://scrappey.com/qa/web-automation/dynamic-content-scraping
Dynamic content is anything a page loads *after* the initial HTML arrives — usually pulled in by JavaScript running in your browser. Because the data is not in the first response, a plain HTTP fetch comes back half-empty. This guide shows how to scrape it in 2026.
### Quick facts
- **Problem:** Data loads after initial HTML
- **Option 1:** Headless browser renders the JS
- **Option 2:** Call the hidden JSON/XHR API
- **Fastest:** API endpoint if you can find it
- **Find it:** Browser DevTools network tab
### Understanding Dynamic Content
"Dynamic" means the content shows up only after JavaScript runs in the browser, not in the raw HTML the server first sends. It arrives through one of these common patterns:
1. Types of Dynamic Loading
- AJAX requests (background calls that fetch data without reloading the page)
- Infinite scroll (more items load as you scroll down)
- Lazy loading (content loads only when it scrolls into view)
- WebSocket updates (a live connection that streams new data)
- React/Vue.js state changes (the framework re-renders the page in place)
### Solution Approaches
The reliable fix is to run a real browser that executes the JavaScript, then read the page once it has rendered. Below are three approaches.
1. Using Selenium
Selenium drives a real Chrome browser. Here it scrolls to the bottom repeatedly until the page stops growing, which is how you exhaust an infinite-scroll feed before reading the items.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
class DynamicScraper:
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
def scrape_infinite_scroll(self, url, scroll_pause=2):
self.driver.get(url)
last_height = self.driver.execute_script('return document.body.scrollHeight')
while True:
# Scroll down
self.driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
# Wait for new content
time.sleep(scroll_pause)
# Calculate new scroll height
new_height = self.driver.execute_script('return document.body.scrollHeight')
# Break if no more content
if new_height == last_height:
break
last_height = new_height
# Extract content
elements = self.driver.find_elements(By.CSS_SELECTOR, '.content-item')
return [elem.text for elem in elements]
2. Using Playwright
Playwright is a newer, faster browser-automation tool. The key trick is waiting: networkidle means "wait until network traffic settles" and wait_for_selector means "wait until this element actually exists" — both ensure the dynamic content has arrived before you read it. A SPA (single-page app) is a site like a React/Vue app that renders everything with JavaScript.
from playwright.sync_api import sync_playwright
class ModernScraper:
def __init__(self):
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch()
async def scrape_spa(self, url):
page = self.browser.new_page()
# Navigate and wait for network idle
await page.goto(url, wait_until='networkidle')
# Wait for specific content
await page.wait_for_selector('.dynamic-content')
# Extract data
data = await page.evaluate('''
() => {
const items = document.querySelectorAll('.item');
return Array.from(items).map(item => ({
title: item.querySelector('.title').innerText,
description: item.querySelector('.desc').innerText
}));
}
''')
return data
3. Intercepting AJAX Requests
Instead of reading rendered HTML, you can capture the raw API responses the page fetches in the background. Here a proxy (mitmproxy, an HTTP proxy that sits between browser and server) watches traffic and saves any JSON coming back from an api URL — often the cleanest source of the data.
from mitmproxy import ctx
class AjaxInterceptor:
def __init__(self):
self.data = []
def request(self, flow):
# Add custom headers
flow.request.headers['X-Requested-With'] = 'XMLHttpRequest'
def response(self, flow):
# Capture API responses
if 'api' in flow.request.pretty_url:
try:
self.data.append(json.loads(flow.response.content))
except json.JSONDecodeError:
pass
# Usage with Selenium
proxy = {
'http': 'http://localhost:8080',
'https': 'http://localhost:8080'
}
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server=localhost:8080')
driver = webdriver.Chrome(options=options)
### Best Practices
1. Handling Loading States
The most common bug is reading the page too early. Wait for the network to go quiet, wait for the loading spinner to disappear, then confirm the real content has appeared — in that order.
class LoadingHandler:
def wait_for_load(self, page):
# Wait for network idle
page.wait_for_load_state('networkidle')
# Check loading indicators
try:
page.wait_for_selector('.loading-spinner', state='hidden')
except TimeoutError:
pass
# Ensure content is ready
page.wait_for_selector('.content-loaded')
2. Error Recovery
Dynamic pages are flaky, so expect failures. Return None instead of crashing when an element never shows up, and retry transient errors with exponential backoff — each retry waits longer (2, 4, 8 seconds) so you do not hammer the site.
class ResilientScraper:
def safe_extract(self, page, selector, timeout=5000):
try:
element = page.wait_for_selector(selector, timeout=timeout)
return element.text_content()
except TimeoutError:
logger.warning(f'Element {selector} not found')
return None
async def retry_action(self, action, max_retries=3):
for attempt in range(max_retries):
try:
return await action()
except Exception as e:
if attempt == max_retries - 1:
raise
await asyncio.sleep(2 ** attempt)
Remember: Dynamic content scraping requires patience and proper waiting mechanisms. Always respect the website's resources and implement appropriate delays.
### FAQ
**Q: How do I know if content is dynamic?**
View the page source (Ctrl+U) or fetch it with curl — both show the raw HTML before any JavaScript runs. If the data is missing there but appears in the rendered page you see in the browser, it is injected by JavaScript after load, which means it is dynamic.
**Q: Is rendering with a browser always required?**
No. Often the page fetches its data from a JSON API you can call directly — far faster and lighter than spinning up a browser. Open your browser's network tab first and look for the request that returns the data; if you find one, call it yourself and skip the browser entirely.
**Q: Why is my headless browser slow?**
Rendering full pages is resource-heavy — every image, font, and script costs time and memory. Block the images and fonts you do not need, reuse browser contexts instead of launching a fresh browser each time, and prefer the underlying API when one exists.
---
## Web Scraping vs API: Which Should You Choose? (2026 Comparison)
URL: https://scrappey.com/qa/web-automation/scraping-vs-api
Web Scraping and APIs are the two main ways to pull data off a website. An API hands you clean, ready-to-use data the site officially provides; scraping means reading the site's pages yourself and extracting what you need. This guide compares the two so you can pick the right one (2026 comparison).
### Quick facts
- **API:** Structured, stable, permitted
- **Scraping:** Works on any visible page
- **Use API when:** It exists and exposes your data
- **Scrape when:** No API or it omits fields
- **Maintenance:** API low; scraping higher
### Key Differences
The core trade-off: an API is a front door the site built for you, with clear rules and clean data. Scraping is reading the public web page like a browser would and pulling values out of the HTML yourself. Here is how they compare.
Data access
AspectOfficial APIWeb Scraping
Data formatStructured (JSON / XML)HTML parsing required
Rate limitsClearly definedUnknown / undocumented
DocumentationAvailableNone
Data structureStableMay change without notice
SupportOfficialNone
In short: an API gives you tidy JSON or XML (machine-readable data formats) plus docs and stable fields. With scraping you parse raw HTML, with no docs and no promise the page won't change tomorrow.
Implementation example
The code below shows both. The API version asks for data and gets JSON back. The scraping version downloads the page and digs the values out of the HTML using BeautifulSoup (a Python library for reading HTML).
# API Approach
import requests
def fetch_api_data(api_key):
headers = {'Authorization': f'Bearer {api_key}'}
response = requests.get('https://api.example.com/data', headers=headers)
return response.json()
# Scraping Approach
from bs4 import BeautifulSoup
def scrape_website_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
data = {
'title': soup.find('h1').text,
'content': [p.text for p in soup.find_all('p')]
}
return data
### When to Choose Each
Use this as a quick decision guide. If the site offers an official API that has the data you need, start there. Reach for scraping when no API exists, the API is too limited, or it costs too much.
Use an API when
- Official access is available
- Your budget allows for API costs
- You need a stable data structure
- Real-time data is required
- The rate limits are acceptable
Use web scraping when
- No API is available
- API costs are too high
- You need custom data extraction
- Historical data is required
- You need a flexible solution
### Best Practices
Whichever route you take, wrap the request in error handling so one bad response doesn't crash your program. The two patterns below show clean, reusable starting points.
1. API Integration
Reuse one Session object so your auth headers are set once, and call raise_for_status() to turn error responses (like a 401 or 500) into exceptions you can catch and log.
class APIClient:
def __init__(self, api_key):
self.session = requests.Session()
self.session.headers.update({
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json'
})
def get_data(self, endpoint, params=None):
try:
response = self.session.get(f'https://api.example.com/{endpoint}', params=params)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
logger.error(f'API request failed: {e}')
return None
2. Scraping Implementation
Set a realistic User-Agent (the header that tells a site which browser is calling) so requests look like a normal browser, and again catch errors instead of letting them bubble up.
class WebScraper:
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
def scrape_data(self, url):
try:
response = self.session.get(url)
soup = BeautifulSoup(response.text, 'lxml')
return self.extract_data(soup)
except Exception as e:
logger.error(f'Scraping failed: {e}')
return None
Remember: Always check terms of service and legal implications before choosing either approach.
### FAQ
**Q: Is using an API always better than scraping?**
When an official API gives you the exact data you need, yes — it is more stable and explicitly allowed. But APIs are often rate-limited (capped on how many calls you can make), paywalled, or missing fields, and that is when scraping wins.
**Q: Is scraping a site with an API against the rules?**
It depends on the site's Terms of Service. Some sites want you to use their API instead of scraping; others allow both. Read the terms, and prefer the API when it covers what you need.
**Q: Which is cheaper to run?**
APIs usually cost less to maintain because they don't break when a site changes its layout, but they may charge you per call. Scraping moves the cost to engineering time plus proxy and anti-bot infrastructure to keep your requests getting through.
---
## Residential vs Datacenter Proxies: Which to Choose? (2026 Guide)
URL: https://scrappey.com/qa/web-automation/proxy-types-comparison
A proxy is a middleman server that fetches web pages on your behalf, so the target site sees the proxy's IP address instead of yours. The two main kinds are **residential** proxies (IPs borrowed from real home internet connections) and **datacenter** proxies (IPs that live in cloud server farms). This guide compares the two and helps you pick the right one for your project in 2026.
### Quick facts
- **Residential:** Real ISP IPs — high trust
- **Datacenter:** Cloud IPs — fast & cheap
- **Mobile:** Carrier IPs — highest trust
- **Speed:** Datacenter fastest
- **For hard targets:** Residential or mobile
### Residential Proxies
Residential proxies route your traffic through real home internet connections, so to a website you look like an ordinary person browsing from their living room. That authenticity is their main strength.
Characteristics
- Real IP addresses from ISPs (internet service providers — the companies that give homes their internet)
- Associated with actual devices
- Higher success rates
- Better for avoiding blocks
- More expensive
- Slower than datacenter proxies
- Geographically diverse
- More legitimate looking
Use Cases
The example below rotates through a pool of residential proxies, picking the next one for each request and retrying up to three times if a request fails:
class ResidentialProxyManager:
def __init__(self, proxy_pool):
self.proxies = proxy_pool
self.current = 0
self.success_rates = {}
def get_next_proxy(self):
proxy = self.proxies[self.current]
self.current = (self.current + 1) % len(self.proxies)
return {
'http': f'http://{proxy}',
'https': f'http://{proxy}'
}
async def make_request(self, url):
for _ in range(3): # Retry mechanism
proxy = self.get_next_proxy()
try:
async with aiohttp.ClientSession() as session:
async with session.get(url, proxy=proxy, timeout=30) as response:
if response.status == 200:
self.update_success_rate(proxy, True)
return await response.text()
except Exception as e:
self.update_success_rate(proxy, False)
continue
raise Exception('All proxy attempts failed')
### Datacenter Proxies
Datacenter proxies use IP addresses from cloud servers. They are fast and cheap, but because thousands of them come from the same hosting companies, websites recognise them as non-human traffic and block them more readily.
Characteristics
- Cloud-based IP addresses
- Faster response times
- More likely to be blocked
- Less expensive
- Easier to detect
- Better for high-volume scraping
- Limited geographic diversity
- More suitable for non-sensitive targets
Implementation Example
This rotator cycles endlessly through a proxy list, skips any that have been banned, and refreshes the whole list once more than half are banned:
class DatacenterProxyRotator:
def __init__(self, proxy_list):
self.proxies = cycle(proxy_list)
self.banned_proxies = set()
self.timeout = 10
def get_proxy(self):
while True:
proxy = next(self.proxies)
if proxy not in self.banned_proxies:
return proxy
def mark_banned(self, proxy):
self.banned_proxies.add(proxy)
# Remove if too many banned
if len(self.banned_proxies) > len(self.proxies) * 0.5:
self.refresh_proxies()
### Choosing the Right Type
The right choice comes down to a trade-off: residential proxies cost more but get blocked less, while datacenter proxies are cheaper and faster but easier to detect. Use the matrix below to decide.
Decision Matrix
- **Choose Residential When:**
- Scraping sensitive websites
- Need high success rates
- Geographic targeting important
- Budget allows for higher costs
- **Choose Datacenter When:**
- Speed is priority
- Scraping non-sensitive sites
- Large volume of requests needed
- Cost-effectiveness required
Implementation Strategy
A practical approach is to use both: send sensitive targets (like ecommerce sites) through residential proxies and everything else through cheaper datacenter ones. The manager below picks the proxy type based on the kind of site:
class HybridProxyManager:
def __init__(self):
self.residential = ResidentialProxyPool()
self.datacenter = DatacenterProxyPool()
self.site_categories = {
'ecommerce': 'residential',
'public_data': 'datacenter'
}
async def get_proxy_for_site(self, url, site_type):
if self.site_categories.get(site_type) == 'residential':
return await self.residential.get_proxy()
return await self.datacenter.get_proxy()
Remember: Choose proxy type based on your specific needs, target websites, and budget constraints.
### FAQ
**Q: When are datacenter proxies good enough?**
When the site has no serious bot defenses and you need lots of fast requests where the IP's reputation is not closely checked. Datacenter proxies are fast and cheap, but they are the first to be blocked on protected targets.
**Q: Why are residential proxies more expensive?**
They route through real consumer devices on ISP networks, which are limited in supply and billed by the amount of data you use (bandwidth). That real-user authenticity is exactly what earns them higher trust from websites.
**Q: Do I need mobile proxies?**
Only for the toughest targets, which trust mobile carrier IPs the most, or when you need to scrape the mobile version of a site. They are the most expensive tier, so reach for them only when residential proxies are not enough.
---
## How to Scrape Emails from Websites Legally (2026 Guide)
URL: https://scrappey.com/qa/web-automation/email-scraping-guide
How to Scrape Emails from Websites Legally (2026 Guide).
### Quick facts
- **Key risk:** Privacy law (GDPR, CAN-SPAM)
- **Safer scope:** Public business contacts
- **Respect:** robots.txt & Terms of Service
- **Need:** A lawful basis to process data
- **Avoid:** Mass unsolicited outreach
### Legal Considerations
Before you collect a single address, the rule of thumb is: scrape only what is public, stay within what the site allows, and respect the privacy laws that cover personal data. The checklist below covers the basics.
1. Compliance Requirements
- Check website's robots.txt
- Respect terms of service
- Follow data protection laws
- Obtain necessary permissions
- Implement rate limiting
- Store data securely
- Honor opt-out requests
A quick gloss on two of these: **robots.txt** is a file at the site's root that tells crawlers which paths they may visit, and **rate limiting** means capping how fast you send requests so you do not overload the server. The example below wires both ideas into a simple scraper.
2. Implementation Example
class LegalEmailScraper:
def __init__(self):
self.visited_urls = set()
self.email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
self.robots_parser = RobotFileParser()
async def can_scrape(self, url):
# Check robots.txt
robots_url = urljoin(url, '/robots.txt')
self.robots_parser.set_url(robots_url)
self.robots_parser.read()
return self.robots_parser.can_fetch('*', url)
async def scrape_emails(self, url, depth=2):
if not await self.can_scrape(url):
return set()
emails = set()
try:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
# Extract emails
found_emails = self.email_pattern.findall(text)
emails.update(self.validate_emails(found_emails))
# Follow links if depth allows
if depth > 0:
soup = BeautifulSoup(text, 'lxml')
for link in soup.find_all('a', href=True):
next_url = urljoin(url, link['href'])
if next_url not in self.visited_urls:
self.visited_urls.add(next_url)
sub_emails = await self.scrape_emails(next_url, depth-1)
emails.update(sub_emails)
except Exception as e:
logger.error(f'Error scraping {url}: {e}')
return emails
def validate_emails(self, emails):
# Remove common false positives
return {email for email in emails if self.is_valid_email(email)}
### Best Practices
1. Rate Limiting
Slow yourself down on purpose. A rate limiter (here, 20 requests per minute) waits before each request so you never hammer a server faster than a normal visitor would.
class RateLimitedScraper:
def __init__(self, requests_per_minute=20):
self.rate_limit = RateLimiter(requests_per_minute)
async def scrape_with_limits(self, url):
async with self.rate_limit:
return await self.scrape_page(url)
2. Data Protection
Email addresses are personal data, so do not leave them lying around in plain text. The example encrypts each address before saving it, using Fernet (a symmetric-encryption helper from Python's cryptography library) so the stored value is unreadable without the key.
class SecureEmailStorage:
def __init__(self):
self.encryption_key = Fernet.generate_key()
self.cipher_suite = Fernet(self.encryption_key)
def store_email(self, email):
encrypted_email = self.cipher_suite.encrypt(email.encode())
return self.save_to_database(encrypted_email)
Remember: Always prioritize legal compliance and respect website owners' rights when scraping email addresses.
### Collecting contact emails responsibly from pages you're permitted to scrape
Two practical problems show up once you move past a handful of pages. The first is that addresses are rarely sitting in plain text. Sites hide them on purpose: Cloudflare email obfuscation stores the address in a data-cfemail attribute and decodes it in the browser; HTML entity encoding swaps characters for their numeric codes; [at]/[dot] munging spells out the symbols to fool simple scanners; and some pages only show contact details after JavaScript runs. A plain regex (pattern-matching) over the raw HTML will miss all of these, so you need to decode the Cloudflare scheme, turn entities back into normal characters, and render JS-heavy pages in a real browser before you extract anything.
The second problem is that contact and team pages, the very pages that list these addresses, are often the most heavily defended on a site. They sit behind anti-bot detection and rate limits, so crawling them too aggressively gets your IP blocked fast. The fix is to pair gentle rate limiting and proxy rotation (cycling through different IP addresses) with the legal and consent practices above. A web scraping API rolls the rendering, the decoding-friendly HTML, and the IP rotation into one call, which keeps a compliant email-collection workflow from falling apart the moment it hits a protected page.
### FAQ
**Q: Is scraping email addresses legal?**
It depends on where you are and what you do with the data. Personal data is protected by laws such as the GDPR (the EU privacy law), which require a lawful basis for collecting it. Public business contacts carry less risk, but sending unsolicited bulk email can still break anti-spam laws even if the scraping itself was fine.
**Q: What is the safest way to collect emails?**
Stick to publicly listed business contacts, obey robots.txt and the site's Terms of Service, write down your lawful basis for collecting the data, and always offer an opt-out. Before running any large campaign, consult a lawyer.
**Q: Can I email everyone I scrape?**
No. Anti-spam laws (CAN-SPAM in the US, GDPR in the EU, CASL in Canada) restrict unsolicited contact and require consent or a legitimate basis, plus a working unsubscribe link. Having someone's address does not give you permission to email it.
---
## What's the Difference Between Web Crawling and Scraping? (2026 Guide)
URL: https://scrappey.com/qa/web-automation/crawling-vs-scraping
Crawling and scraping are two different jobs that often work together. **Crawling** is how you *find* pages: a program follows links from page to page, the way a search engine does. **Scraping** is how you *pull data out* of a page once you have it. This guide explains how they differ and when you need each.
### Quick facts
- **Crawling:** Discovers & follows links
- **Scraping:** Extracts data from pages
- **Crawler output:** A set of URLs
- **Scraper output:** Structured records
- **Combined:** Crawl to find, scrape to extract
### Key Differences
Web Crawling
- Automated browsing through websites
- Following links systematically
- Used for indexing and discovery
- Broader in scope
- Focus on navigation
- Used by search engines
- Handles multiple domains
- Maps website structures
Web Scraping
- Extracting specific data
- Targeted data collection
- Used for data extraction
- Narrower in scope
- Focus on data gathering
- Used by businesses
- Often single-domain focused
- Creates structured datasets
### Implementation Examples
Here are minimal Python examples of each. The crawler keeps a queue of links to visit and walks outward from a starting page; the scraper takes one page and pulls out the fields you want.
1. Basic Crawler
class WebCrawler:
def __init__(self, start_url, max_depth=3):
self.visited = set()
self.to_visit = deque([(start_url, 0)])
self.max_depth = max_depth
async def crawl(self):
while self.to_visit:
url, depth = self.to_visit.popleft()
if depth > self.max_depth or url in self.visited:
continue
self.visited.add(url)
try:
links = await self.extract_links(url)
for link in links:
self.to_visit.append((link, depth + 1))
except Exception as e:
logger.error(f'Error crawling {url}: {e}')
async def extract_links(self, url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
soup = BeautifulSoup(text, 'lxml')
return [urljoin(url, a['href']) for a in soup.find_all('a', href=True)]
2. Basic Scraper
class WebScraper:
def __init__(self, url):
self.url = url
self.data = []
async def scrape(self):
async with aiohttp.ClientSession() as session:
async with session.get(self.url) as response:
text = await response.text()
return self.extract_data(text)
def extract_data(self, html):
soup = BeautifulSoup(html, 'lxml')
return {
'title': soup.find('h1').text.strip(),
'content': [p.text for p in soup.find_all('p')],
'metadata': self.extract_metadata(soup)
}
### Combined Approach
In practice you usually combine them: crawl first to discover the pages you care about, then scrape each one for data. The example below wires the crawler and scraper together.
Crawler-Scraper Integration
class SmartDataCollector:
def __init__(self, start_url):
self.crawler = WebCrawler(start_url)
self.scraper = WebScraper(None)
self.data_store = []
async def collect_data(self):
# First crawl to find relevant pages
await self.crawler.crawl()
# Then scrape each discovered page
for url in self.crawler.visited:
if self.should_scrape(url):
self.scraper.url = url
data = await self.scraper.scrape()
self.data_store.append(data)
Remember: Choose between crawling and scraping based on your specific data collection needs and goals.
### FAQ
**Q: What is the core difference between crawling and scraping?**
Crawling is about discovery — traversing links to find pages. Scraping is about extraction — pulling specific data out of those pages. Search engines crawl; data projects usually crawl then scrape.
**Q: Do I always need both?**
No. If you already have the URLs, you only scrape. If you first need to discover pages across a site, you crawl to build the list of URLs, then scrape each one.
**Q: Is a crawler the same as a spider?**
Yes — "spider" is just another name for a web crawler, the program that follows links to discover pages.
---
## What are Headless Browsers and When to Use Them? (2026 Guide)
URL: https://scrappey.com/qa/web-automation/headless-browsers-guide
A headless browser is a real web browser (like Chrome or Firefox) that runs without a visible window, controlled entirely by code instead of by a person clicking and typing. It loads pages, runs JavaScript, and renders content just like a normal browser - you just never see the screen. This 2026 guide explains what headless browsers are and when to use them.
### Quick facts
- **What it is:** A browser with no visible window
- **Why use it:** Render JS-heavy / dynamic pages
- **Popular tools:** Playwright, Puppeteer, Selenium
- **Cost:** Heavier than HTTP clients
- **Avoid when:** Static HTML or an API exists
### What are Headless Browsers?
Headless browsers are web browsers without a graphical user interface (no on-screen window) that you control programmatically - through a script rather than a mouse and keyboard. Because they run the full browser engine, they can execute JavaScript and build the page exactly as a user would see it. That makes them essential for web automation, automated testing, and scraping JavaScript-heavy websites where the content only appears after scripts run.
### Popular Options
1. Chrome Headless
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
class HeadlessChrome:
def __init__(self):
options = Options()
options.add_argument('--headless=new') # New headless mode
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
self.driver = webdriver.Chrome(options=options)
def get_page_content(self, url):
self.driver.get(url)
return self.driver.page_source
2. Playwright
from playwright.sync_api import sync_playwright
class PlaywrightBrowser:
def __init__(self):
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch(headless=True)
async def scrape_spa(self, url):
page = self.browser.new_page()
await page.goto(url, wait_until='networkidle')
# Wait for dynamic content
await page.wait_for_selector('.dynamic-content')
return await page.content()
### Use Cases
1. JavaScript Rendering
class DynamicContentScraper:
def __init__(self):
self.browser = HeadlessChrome()
def get_rendered_content(self, url):
# Wait for specific elements
self.browser.driver.get(url)
WebDriverWait(self.browser.driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'dynamic-data'))
)
# Extract data after rendering
return {
'title': self.browser.driver.title,
'content': self.browser.driver.find_element(By.CLASS_NAME, 'dynamic-data').text
}
2. Performance Testing
class PerformanceTester:
def __init__(self):
self.browser = PlaywrightBrowser()
async def measure_load_time(self, url):
page = self.browser.new_page()
# Measure performance metrics
performance = await page.evaluate("""
() => {
const timing = window.performance.timing;
return {
loadTime: timing.loadEventEnd - timing.navigationStart,
domReady: timing.domContentLoadedEventEnd - timing.navigationStart
}
}
""")
return performance
Remember: headless browsers are powerful, but they run a full browser engine, so they use far more CPU and memory than a simple HTTP request. Reach for them only when you actually need JavaScript or interaction.
### FAQ
**Q: When should I use a headless browser?**
Use one when the data is rendered by JavaScript or only appears after interactions like clicks, scrolls, or logins. If the page is static HTML, or the site offers an API you can call directly, a plain HTTP client is far faster and cheaper.
**Q: Are headless browsers easy to detect?**
Default headless modes give themselves away through telltale signs - for example the navigator.webdriver flag (a browser property that is set to true when automation is driving the browser) and missing window properties that a real browser would have. Stealth-patched builds hide most of these signals, but heavily instrumented sites can still flag them.
**Q: Which headless browser tool should I pick?**
Playwright is the best default for new projects: it works across browsers, waits for elements automatically, and supports several programming languages. Choose Puppeteer for Chrome-only Node.js work, and Selenium when you need the widest support for older or less common languages.
---# Web Scraping by Language
Language-by-language guides to web scraping — the right libraries, runnable code, and how to get past anti-bot blocking in Java, C#, Go, Ruby, PHP, R, Node.js and the command line.
## Web Scraping With Java: A Complete 2026 Guide
URL: https://scrappey.com/qa/web-scraping-languages/web-scraping-with-java
**Web scraping with Java means fetching a web page over HTTP and extracting structured data from its HTML, usually with Jsoup for static pages and Selenium or Playwright for JavaScript-rendered ones.** Java is a strong choice for production scrapers: it is fast, strongly typed, and has first-class concurrency (virtual threads since Java 21), which matters when you are fetching thousands of pages. The standard stack in 2026 is the built-in HttpClient plus Jsoup for parsing.
### Quick facts
- **Static parsing:** Jsoup 1.22.x — fetch + parse + CSS selectors in one library
- **JavaScript pages:** Selenium 4.x (auto driver) or HtmlUnit / Playwright for Java
- **HTTP client:** java.net.http.HttpClient (built into Java 11+; virtual threads in 21+)
- **Concurrency:** ExecutorService / virtual threads for parallel fetching
- **Build tool:** Maven or Gradle dependency on org.jsoup:jsoup
### Your first Java scraper with Jsoup
Jsoup is the workhorse of Java scraping: it fetches a page, parses the HTML into a tree, and lets you select elements with CSS selectors — all in one dependency. Add it with Maven (org.jsoup:jsoup:1.22.2) or Gradle, then:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class BookScraper {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("https://books.toscrape.com/")
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
.timeout(10_000)
.get();
Elements books = doc.select("article.product_pod");
for (Element book : books) {
String title = book.selectFirst("h3 > a").attr("title");
String price = book.selectFirst(".price_color").text();
System.out.println(title + " | " + price);
}
}
}Jsoup.connect(url).get() does the HTTP request and returns a parsed Document. From there, select() takes any CSS selector and selectFirst() returns a single element. .text() reads inner text; .attr("href") reads an attribute. Always set a realistic userAgent — Jsoup's default identifies itself as Jsoup and is trivially blocked.
### Following pagination and crawling
Most real jobs span many pages. Jsoup makes it easy to read the "next" link and follow it. Resolve relative URLs with absUrl() so links work no matter how they are written in the HTML:
String url = "https://books.toscrape.com/";
while (url != null) {
Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0").get();
for (Element book : doc.select("article.product_pod")) {
System.out.println(book.selectFirst("h3 > a").attr("title"));
}
Element next = doc.selectFirst("li.next > a");
url = (next != null) ? next.absUrl("href") : null; // null stops the loop
}For large crawls, fetch pages in parallel. Java 21 virtual threads make this almost free — one Executors.newVirtualThreadPerTaskExecutor() can run thousands of concurrent fetches without exhausting OS threads. Throttle politely so you do not hammer the target.
### Scraping JavaScript-rendered pages with Selenium
Jsoup only sees the HTML the server returns — it does not run JavaScript. For pages that render content client-side, drive a real browser with Selenium. Since Selenium 4.6, Selenium Manager downloads the matching driver automatically:
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import java.util.List;
public class DynamicScraper {
public static void main(String[] args) {
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless=new");
WebDriver driver = new ChromeDriver(options); // driver auto-managed
try {
driver.get("https://quotes.toscrape.com/js/");
List<WebElement> quotes = driver.findElements(By.cssSelector(".quote"));
for (WebElement q : quotes) {
String text = q.findElement(By.cssSelector(".text")).getText();
String author = q.findElement(By.cssSelector(".author")).getText();
System.out.println(author + ": " + text);
}
} finally {
driver.quit();
}
}
}**HtmlUnit** is a lighter, headless GUI-less alternative that runs some JavaScript without a real browser, and **Playwright for Java** is the modern, faster option. But every browser-driving approach is heavier and more detectable than a plain HTTP fetch.
### Which Java scraping library should you use?
LibraryTypeRuns JS?Best for
JsoupFetch + parseNoStatic pages — the default choice
HttpClientHTTP client (JDK)NoAPIs, custom requests, async
HtmlUnitHeadless browserPartialLight JS without a real browser
SeleniumBrowser automationYesFull JS rendering, widest docs
Playwright (Java)Browser automationYesModern JS pages, faster than Selenium
For 90% of jobs, **Jsoup alone** is enough. Add a browser tool only when the data is genuinely rendered by JavaScript.
### The hard part: handling anti-bot blocking
Java code is rarely the reason a scraper fails — anti-bot defenses are. Jsoup sends a TLS handshake and header set that anti-bot systems (Cloudflare, DataDome, Akamai) recognise as non-browser traffic, and headless Selenium leaks automation signals. You cannot parse a 403 or a CAPTCHA page.
Handling this means rotating residential proxies, matching a real browser TLS fingerprint — a project of its own. A managed scraping API handles all of that server-side; your Java code just POSTs the target URL and parses the returned HTML with Jsoup as usual:
### Example
```java
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
public class ScrapingApiScraper {
public static void main(String[] args) throws Exception {
String payload = """
{"cmd": "request.get", "url": "https://example.com/protected"}
""";
HttpRequest req = HttpRequest.newBuilder()
.uri(URI.create("https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY"))
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(payload))
.build();
HttpResponse<String> resp = HttpClient.newHttpClient()
.send(req, HttpResponse.BodyHandlers.ofString());
// resp.body() holds JSON; the rendered HTML is at solution.response.
// Parse it with Jsoup exactly as you would a normal page.
System.out.println(resp.body());
}
}
```
### FAQ
**Q: What is the best library for web scraping with Java?**
Jsoup is the best default — it fetches and parses HTML with CSS selectors in a single dependency and covers most static sites. Add Selenium or Playwright for Java only when the page renders its content with JavaScript. For raw HTTP control and async requests, the JDK built-in java.net.http.HttpClient pairs well with Jsoup for parsing.
**Q: Can Java scrape JavaScript-rendered websites?**
Yes, but not with Jsoup alone — Jsoup does not execute JavaScript. Use Selenium (which auto-manages its browser driver since version 4.6), Playwright for Java, or HtmlUnit for lighter cases. Alternatively, find the JSON API the page calls and request it directly with HttpClient, which is faster than driving a browser.
**Q: Why does my Java scraper get blocked, and how do I fix it?**
Set a realistic User-Agent (never Jsoup’s default), throttle your request rate, and rotate residential proxies. Against serious anti-bot vendors you also need a real browser TLS fingerprint , which is hard to maintain in Java directly — many teams route hard targets through a scraping API that handles proxies, fingerprinting, and challenges server-side.
**Q: Is Java good for web scraping compared to Python?**
Java is excellent for large, long-running, production scrapers thanks to its speed, strong typing, and first-class concurrency (virtual threads in Java 21). Python wins on ecosystem breadth and quick scripting. If you already run a JVM stack or need high-throughput concurrent crawling, Java is a very solid choice.
---
## Web Scraping With C#: A Complete 2026 Guide
URL: https://scrappey.com/qa/web-scraping-languages/web-scraping-with-csharp
**Web scraping with C# means using .NET's HttpClient to fetch a page and a parser like HtmlAgilityPack or AngleSharp to extract data from the HTML.** HtmlAgilityPack is the long-standing standard (XPath-based); AngleSharp is the modern, standards-compliant alternative with CSS selectors and a built-in loader. For JavaScript-rendered pages, Selenium for .NET, PuppeteerSharp, or Playwright for .NET drive a real browser.
### Quick facts
- **Classic parser:** HtmlAgilityPack 1.12.x — XPath selectors, very widely used
- **Modern parser:** AngleSharp 1.4.x — HTML5-compliant, CSS selectors, can fetch pages
- **HTTP client:** System.Net.Http.HttpClient (async/await)
- **JavaScript pages:** Selenium .NET, PuppeteerSharp, or Playwright for .NET
- **Package source:** NuGet — HtmlAgilityPack, AngleSharp, Selenium.WebDriver
### Your first C# scraper with HtmlAgilityPack
HtmlAgilityPack parses HTML into a tree you query with XPath. Pair it with HttpClient to fetch the page. Install via NuGet: dotnet add package HtmlAgilityPack.
using System;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;
class Program
{
static async Task Main()
{
using var http = new HttpClient();
http.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)");
string html = await http.GetStringAsync("https://books.toscrape.com/");
var doc = new HtmlDocument();
doc.LoadHtml(html);
var books = doc.DocumentNode.SelectNodes("//article[@class='product_pod']");
foreach (var book in books)
{
string title = book.SelectSingleNode(".//h3/a").GetAttributeValue("title", "");
string price = book.SelectSingleNode(".//p[@class='price_color']").InnerText;
Console.WriteLine(quot;{title} | {price}");
}
}
}HtmlAgilityPack uses XPath: SelectNodes() returns all matches, SelectSingleNode() one. GetAttributeValue() reads an attribute with a fallback; InnerText reads text. If you prefer CSS selectors, add the HtmlAgilityPack.CssSelectors.NetCore package.
### AngleSharp — the modern alternative
AngleSharp is a newer, fully HTML5-compliant parser with CSS selectors and a built-in document loader, so it can fetch *and* parse in one step. Most competitor guides skip it — it is cleaner for modern code. Install: dotnet add package AngleSharp.
using System;
using System.Threading.Tasks;
using AngleSharp;
class Program
{
static async Task Main()
{
var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var doc = await context.OpenAsync("https://books.toscrape.com/");
foreach (var book in doc.QuerySelectorAll("article.product_pod"))
{
string title = book.QuerySelector("h3 a").GetAttribute("title");
string price = book.QuerySelector(".price_color").TextContent;
Console.WriteLine(quot;{title} | {price}");
}
}
}QuerySelectorAll() and QuerySelector() are the same CSS-selector APIs you know from the browser DOM, which makes AngleSharp very natural to use. WithDefaultLoader() enables the HTTP fetch so context.OpenAsync(url) downloads the page for you.
### Scraping JavaScript-rendered pages
Neither parser runs JavaScript. For client-side-rendered pages, drive a browser with Selenium for .NET (dotnet add package Selenium.WebDriver). Selenium Manager handles the driver automatically:
using System;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
var options = new ChromeOptions();
options.AddArgument("--headless=new");
using var driver = new ChromeDriver(options);
driver.Navigate().GoToUrl("https://quotes.toscrape.com/js/");
foreach (var quote in driver.FindElements(By.CssSelector(".quote")))
{
string text = quote.FindElement(By.CssSelector(".text")).Text;
string author = quote.FindElement(By.CssSelector(".author")).Text;
Console.WriteLine(quot;{author}: {text}");
}
driver.Quit();**PuppeteerSharp** and **Playwright for .NET** are modern alternatives — Playwright in particular has cleaner auto-waiting and multi-browser support, and is the recommended choice for new dynamic-scraping projects in .NET.
### Which C# scraping library should you use?
LibraryTypeSelectorsBest for
HtmlAgilityPackParserXPath (CSS via add-on)Static pages, the established choice
AngleSharpParser + loaderCSSModern HTML5 parsing, clean API
Selenium .NETBrowser automationCSS/XPathJavaScript pages, widest docs
PuppeteerSharpBrowser automationCSS/XPathChrome-only headless control
Playwright .NETBrowser automationCSS/XPathModern JS pages, multi-browser
Start with **HtmlAgilityPack or AngleSharp** for static pages; reach for a browser tool only when the content is JavaScript-rendered.
### The hard part: handling anti-bot blocking
The .NET code is the easy part. Sites behind major anti-bot systems block HttpClient on its TLS fingerprint and headers, and they flag headless Selenium on its automation signals. A parser cannot extract data from a challenge page.
Solving this means residential proxy rotation, a real browser fingerprint, and CAPTCHA handling. A scraping API does all of it server-side — your C# code posts the URL and parses the returned HTML with HtmlAgilityPack or AngleSharp:
### Example
```csharp
using System;
using System.Net.Http;
using System.Net.Http.Json;
using System.Text.Json;
using System.Threading.Tasks;
class Program
{
static async Task Main()
{
using var http = new HttpClient();
var resp = await http.PostAsJsonAsync(
"https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY",
new { cmd = "request.get", url = "https://example.com/protected" });
var json = await resp.Content.ReadFromJsonAsync<JsonElement>();
// Fully rendered, unblocked HTML -- hand it to HtmlAgilityPack/AngleSharp.
string html = json.GetProperty("solution").GetProperty("response").GetString();
Console.WriteLine(html.Substring(0, 200));
}
}
```
### FAQ
**Q: What is the best library for web scraping with C#?**
HtmlAgilityPack is the established standard (XPath-based) and AngleSharp is the modern, HTML5-compliant alternative with CSS selectors and a built-in page loader. Both are excellent for static pages — choose AngleSharp for cleaner CSS-selector code, HtmlAgilityPack if you prefer XPath or need its huge body of examples. Add Selenium or Playwright for .NET for JavaScript-rendered pages.
**Q: Can C# scrape dynamic, JavaScript-heavy sites?**
Yes. HtmlAgilityPack and AngleSharp only see server HTML, so for JavaScript-rendered content drive a real browser with Selenium for .NET, PuppeteerSharp, or Playwright for .NET. Playwright is the recommended modern choice. You can also call the page’s underlying JSON API directly with HttpClient, which avoids running a browser entirely.
**Q: HtmlAgilityPack or AngleSharp — which should I choose?**
Use AngleSharp if you want CSS selectors, strict HTML5 parsing, and a built-in loader that fetches the page for you. Use HtmlAgilityPack if you prefer XPath, need maximum compatibility, or are maintaining existing code. Both are actively used in 2026; AngleSharp tends to feel more modern for new projects.
**Q: Why does my C# scraper get blocked, and how do I fix it?**
Set a realistic User-Agent, throttle requests, and rotate residential proxies. For sites behind major anti-bot systems you also need a browser-grade TLS fingerprint , which is hard to do from raw HttpClient. Routing those requests through a scraping API that handles proxies, fingerprinting, and challenges server-side is the most reliable approach.
---
## Web Scraping With Go (Golang): A Complete 2026 Guide
URL: https://scrappey.com/qa/web-scraping-languages/web-scraping-with-golang
**Web scraping with Go (Golang) means using net/http or the Colly framework to fetch pages and goquery to extract data with jQuery-like selectors.** Go is a great fit for scraping at scale: it compiles to a single fast binary and its goroutines make concurrent fetching simple and cheap. The common stack is net/http + goquery for simple jobs, Colly for full crawlers, and chromedp for JavaScript-rendered pages.
### Quick facts
- **HTML parsing:** goquery v1.12.x — jQuery-style CSS selectors
- **Crawl framework:** Colly v2.2.x — collectors, callbacks, built-in concurrency
- **JavaScript pages:** chromedp (drives Chrome via CDP) or go-rod
- **Concurrency:** Goroutines + channels — cheap parallel fetching
- **Module paths:** github.com/gocolly/colly/v2, github.com/PuerkitoBio/goquery
### Your first Go scraper with net/http + goquery
The simplest Go scraper uses the standard library's net/http to fetch a page and goquery to parse it. Install goquery with go get github.com/PuerkitoBio/goquery.
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func main() {
res, err := http.Get("https://books.toscrape.com/")
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
log.Fatal(err)
}
doc.Find("article.product_pod").Each(func(i int, s *goquery.Selection) {
title, _ := s.Find("h3 a").Attr("title")
price := s.Find(".price_color").Text()
fmt.Printf("%s | %s\n", title, price)
})
}goquery mirrors jQuery: Find() takes a CSS selector, Each() iterates matches, Text() reads text, and Attr() returns an attribute plus an "exists" boolean. It is the most idiomatic way to parse HTML in Go.
### Crawling with Colly
For anything beyond a single page, Colly is the framework of choice. It gives you a collector with event callbacks, automatic link-following, and built-in concurrency and rate-limiting. Install: go get github.com/gocolly/colly/v2.
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("books.toscrape.com"),
colly.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)"),
)
// Extract each product.
c.OnHTML("article.product_pod", func(e *colly.HTMLElement) {
title := e.ChildAttr("h3 a", "title")
price := e.ChildText(".price_color")
fmt.Printf("%s | %s\n", title, price)
})
// Follow the "next" pagination link.
c.OnHTML("li.next > a", func(e *colly.HTMLElement) {
e.Request.Visit(e.Request.AbsoluteURL(e.Attr("href")))
})
c.Visit("https://books.toscrape.com/")
}Colly's callback model — OnHTML, OnRequest, OnError — keeps crawlers readable as they grow. Enable parallelism with colly.Async(true) and a LimitRule to control concurrency and delay per domain.
### Concurrent fetching with goroutines
Go's superpower for scraping is cheap concurrency. To fetch many URLs in parallel, launch goroutines and bound them with a semaphore channel so you do not overwhelm the target:
func fetchAll(urls []string) {
sem := make(chan struct{}, 10) // max 10 in flight
var wg sync.WaitGroup
for _, u := range urls {
wg.Add(1)
go func(url string) {
defer wg.Done()
sem <- struct{}{} // acquire
defer func() { <-sem }() // release
res, err := http.Get(url)
if err != nil {
return
}
defer res.Body.Close()
// ...parse res.Body with goquery...
}(u)
}
wg.Wait()
}This pattern scrapes hundreds of pages concurrently while keeping a hard cap on simultaneous requests. Colly offers the same via its async mode and limit rules if you prefer the framework to manage it.
### Which Go scraping library should you use?
LibraryTypeRuns JS?Best for
net/http + goqueryFetch + parseNoSimple static scraping
CollyCrawl frameworkNoMulti-page crawls, concurrency built in
chromedpBrowser automationYesJavaScript-rendered pages
go-rodBrowser automationYesModern browser control with fine-grained options
Use **net/http + goquery** for quick jobs, **Colly** for real crawlers, and **chromedp** or **go-rod** only when the page needs JavaScript.
### The hard part: handling anti-bot blocking
Go is fast, but raw speed does not help when a site's anti-bot defenses reject the request. A net/http request has a TLS and header signature anti-bot vendors flag immediately, and even chromedp leaks headless signals. You cannot parse a Cloudflare or DataDome challenge page.
The fix — residential proxies, a real browser TLS fingerprint, and CAPTCHA challenges — is a lot to maintain in Go. A scraping API handles it server-side; your Go code POSTs the URL and parses the returned HTML with goquery:
### Example
```go
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
)
func main() {
payload, _ := json.Marshal(map[string]string{
"cmd": "request.get",
"url": "https://example.com/protected",
})
resp, err := http.Post(
"https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY",
"application/json",
bytes.NewReader(payload),
)
if err != nil {
panic(err)
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
var result map[string]interface{}
json.Unmarshal(body, &result)
// Fully rendered, unblocked HTML -- parse it with goquery as usual.
html := result["solution"].(map[string]interface{})["response"].(string)
fmt.Println(html[:200])
}
```
### FAQ
**Q: What is the best library for web scraping with Go?**
For simple jobs, the standard net/http plus goquery (jQuery-style CSS selectors) is the idiomatic choice. For real crawlers, Colly adds collectors, automatic link-following, concurrency, and rate-limiting. For JavaScript-rendered pages, chromedp or go-rod drive a real Chrome browser. Most Go scrapers start with goquery and adopt Colly as the crawl grows.
**Q: Can Go scrape JavaScript-rendered pages?**
Yes — goquery and Colly only see server HTML, so for JavaScript-rendered content use chromedp (which controls Chrome over the DevTools Protocol) or go-rod. Both run a real browser. As a faster alternative, inspect the Network tab, find the JSON API the page calls, and request it directly with net/http.
**Q: Is Go good for web scraping at scale?**
Very. Go compiles to a fast single binary and its goroutines make concurrent fetching cheap and simple, so it handles high-throughput crawling well with modest resource use. Colly’s built-in async mode and limit rules make large, polite crawls straightforward. It is one of the best languages for performance-critical scraping.
**Q: How do I avoid blocks when scraping with Go?**
Set a realistic User-Agent, throttle and rotate residential proxies, and cap concurrency per domain. Against Cloudflare, DataDome, or Akamai you also need a browser-grade TLS fingerprint , which is significant work in Go. Many teams route hard targets through a scraping API that manages proxies, fingerprinting, and challenges server-side.
---
## Web Scraping With Ruby: A Complete 2026 Guide
URL: https://scrappey.com/qa/web-scraping-languages/web-scraping-with-ruby
**Web scraping with Ruby means fetching a page with an HTTP gem like HTTParty and parsing the HTML with Nokogiri, which supports both CSS selectors and XPath.** Nokogiri is the de-facto standard parser in the Ruby ecosystem. For JavaScript-rendered pages you drive a real browser with Selenium or Watir, and for full crawlers there are frameworks built on these gems.
### Quick facts
- **HTML parsing:** Nokogiri — CSS and XPath, the Ruby standard
- **HTTP client:** HTTParty or Faraday (or built-in net/http)
- **JavaScript pages:** Selenium WebDriver or Watir
- **Form/session helper:** Mechanize — cookies, forms, link-following
- **Install:** gem install nokogiri httparty
### Your first Ruby scraper with HTTParty + Nokogiri
The classic Ruby stack is HTTParty to fetch and Nokogiri to parse. Install both with gem install nokogiri httparty, then:
require 'httparty'
require 'nokogiri'
response = HTTParty.get('https://books.toscrape.com/',
headers: { 'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' })
doc = Nokogiri::HTML(response.body)
doc.css('article.product_pod').each do |book|
title = book.at_css('h3 a')['title']
price = book.at_css('.price_color').text
puts "#{title} | #{price}"
enddoc.css(selector) returns all matching nodes; at_css returns the first. Access attributes with node['attr'] and text with .text. Nokogiri also supports XPath via doc.xpath('//...') when you need it. Always pass a realistic User-Agent — the default identifies your script as a bot.
### Following pagination
To crawl multiple pages, read the "next" link and resolve it against the current URL with URI.join:
require 'uri'
url = 'https://books.toscrape.com/'
while url
doc = Nokogiri::HTML(HTTParty.get(url).body)
doc.css('article.product_pod h3 a').each { |a| puts a['title'] }
nxt = doc.at_css('li.next a')
url = nxt ? URI.join(url, nxt['href']).to_s : nil # nil ends the loop
endFor sites that need cookies, logins, or form submissions, the **Mechanize** gem wraps Nokogiri with session and form handling, so you do not manage cookies by hand.
### Scraping JavaScript-rendered pages
Nokogiri only parses the HTML you give it — it does not run JavaScript. For client-side-rendered pages, drive a browser with Selenium (gem install selenium-webdriver):
require 'selenium-webdriver'
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless=new')
driver = Selenium::WebDriver.for(:chrome, options: options)
driver.get('https://quotes.toscrape.com/js/')
driver.find_elements(css: '.quote').each do |q|
text = q.find_element(css: '.text').text
author = q.find_element(css: '.author').text
puts "#{author}: #{text}"
end
driver.quit**Watir** is a friendlier wrapper around Selenium that reads more like natural Ruby. Either way, a browser is heavier and easier to detect than a plain HTTP request.
### Which Ruby gem should you use?
GemTypeRuns JS?Best for
NokogiriHTML/XML parserNoParsing — CSS and XPath, the standard
HTTPartyHTTP clientNoSimple, readable requests
FaradayHTTP clientNoMiddleware, advanced configuration
MechanizeHTTP + parse + sessionNoLogins, forms, cookies
Selenium / WatirBrowser automationYesJavaScript-rendered pages
For most jobs, **HTTParty + Nokogiri** is all you need; add Mechanize for sessions and Selenium only for JavaScript.
### The hard part: handling anti-bot blocking
The gem you choose rarely decides success — anti-bot defenses do. HTTParty sends a TLS fingerprint and headers that Cloudflare, DataDome, and Akamai recognise as non-browser, and headless Selenium leaks automation signals. Nokogiri cannot parse a 403 or CAPTCHA page.
Handling modern anti-bot stacks means residential proxies and a real browser fingerprint. A scraping API handles all of that server-side, so your Ruby code posts the URL and parses the returned HTML with Nokogiri as usual:
### Example
```ruby
require 'httparty'
require 'json'
require 'nokogiri'
resp = HTTParty.post(
'https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY',
headers: { 'Content-Type' => 'application/json' },
body: { cmd: 'request.get', url: 'https://example.com/protected' }.to_json
)
# Fully rendered, unblocked HTML -- parse it with Nokogiri as usual.
html = resp.parsed_response['solution']['response']
doc = Nokogiri::HTML(html)
puts doc.at_css('title')&.text
```
### FAQ
**Q: What is the best gem for web scraping with Ruby?**
Nokogiri is the standard HTML/XML parser — it supports both CSS selectors and XPath and is what almost every Ruby scraper uses. Pair it with HTTParty or Faraday to fetch pages, and use Mechanize when you need cookies, logins, or form submissions. For JavaScript-rendered pages, add Selenium or Watir.
**Q: Can Ruby scrape JavaScript-rendered pages?**
Yes, but not with Nokogiri alone, which only parses static HTML. Drive a real browser with selenium-webdriver or Watir (a friendlier Selenium wrapper) for client-side-rendered content. Alternatively, find the JSON API the page calls in your browser’s Network tab and request it directly with HTTParty.
**Q: Does Nokogiri support XPath?**
Yes. Nokogiri supports both CSS selectors (doc.css) and XPath (doc.xpath), so you can use whichever fits. CSS is shorter for most selections; XPath is more powerful when you need to select by text content or navigate to parent and sibling nodes.
**Q: Why does my Ruby scraper get blocked, and how do I fix it?**
Set a realistic User-Agent, throttle your requests, and rotate residential proxies. Against serious anti-bot vendors you also need a browser-grade TLS fingerprint , which is hard to maintain from HTTParty. Many teams route hard targets through a scraping API that handles proxies, fingerprinting, and challenges server-side.
---
## Web Scraping With PHP: A Complete 2026 Guide
URL: https://scrappey.com/qa/web-scraping-languages/web-scraping-with-php
**Web scraping with PHP means fetching pages with the Guzzle HTTP client and extracting data with Symfony's DomCrawler component, which supports CSS selectors and XPath.** This Guzzle + DomCrawler stack is the modern, correct-for-2026 approach. The older Goutte library is deprecated — its functionality was merged into Symfony's BrowserKit and HttpClient. For JavaScript-rendered pages, Symfony Panther drives a real browser.
### Quick facts
- **HTTP client:** Guzzle — the standard PHP HTTP client
- **HTML parsing:** Symfony DomCrawler (+ CssSelector) — CSS and XPath
- **Goutte status:** Deprecated — use Symfony BrowserKit HttpBrowser instead
- **JavaScript pages:** Symfony Panther or php-webdriver
- **Install:** composer require guzzlehttp/guzzle symfony/dom-crawler symfony/css-selector
### Your first PHP scraper with Guzzle + DomCrawler
Install the modern stack with Composer: composer require guzzlehttp/guzzle symfony/dom-crawler symfony/css-selector. The CssSelector component lets DomCrawler accept CSS selectors (not just XPath).
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
$client = new Client(['headers' => ['User-Agent' => 'Mozilla/5.0']]);
$html = $client->get('https://books.toscrape.com/')->getBody()->getContents();
$crawler = new Crawler($html);
$crawler->filter('article.product_pod')->each(function (Crawler $node) {
$title = $node->filter('h3 a')->attr('title');
$price = $node->filter('.price_color')->text();
echo "$title | $price\n";
});filter() takes a CSS selector and returns a new Crawler; each() iterates matches; text() reads text and attr() reads an attribute. For XPath instead of CSS, use filterXPath('//...').
### The modern replacement for Goutte
Many older tutorials still teach **Goutte**. It is now *deprecated and archived* — its code was merged into Symfony. The direct replacement is Symfony\Component\BrowserKit\HttpBrowser, which adds clicking links and submitting forms on top of DomCrawler. Install with composer require symfony/browser-kit symfony/http-client:
<?php
require 'vendor/autoload.php';
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;
$browser = new HttpBrowser(HttpClient::create());
$crawler = $browser->request('GET', 'https://books.toscrape.com/');
while (true) {
$crawler->filter('article.product_pod h3 a')->each(function ($node) {
echo $node->attr('title') . "\n";
});
$next = $crawler->filter('li.next a');
if ($next->count() === 0) break;
$crawler = $browser->click($next->link()); // follow pagination
}This gives you the same convenience Goutte offered — a browsing session that follows links and submits forms — using maintained Symfony components.
### Scraping JavaScript-rendered pages with Panther
Guzzle and DomCrawler do not run JavaScript. For client-side-rendered pages, **Symfony Panther** drives a real Chrome/Firefox browser and shares the DomCrawler API you already know. Install: composer require symfony/panther.
<?php
require 'vendor/autoload.php';
use Symfony\Component\Panther\Client;
$client = Client::createChromeClient();
$client->request('GET', 'https://quotes.toscrape.com/js/');
$client->waitFor('.quote'); // wait for JS to render
$crawler = $client->getCrawler();
$crawler->filter('.quote')->each(function ($node) {
$text = $node->filter('.text')->text();
$author = $node->filter('.author')->text();
echo "$author: $text\n";
});
$client->quit();Panther uses the same selectors as DomCrawler, so moving from static to dynamic scraping is mostly swapping the client. **php-webdriver** is the lower-level alternative if you need direct Selenium control.
### Which PHP library should you use?
LibraryTypeRuns JS?Best for
GuzzleHTTP clientNoFetching pages and APIs
Symfony DomCrawlerHTML parserNoExtraction — CSS and XPath
BrowserKit (HttpBrowser)HTTP + browsing sessionNoLinks, forms — the Goutte replacement
Symfony PantherBrowser automationYesJavaScript-rendered pages
Goutte(deprecated)NoAvoid — use BrowserKit instead
The canonical 2026 stack is **Guzzle + Symfony DomCrawler**, with HttpBrowser for sessions and Panther for JavaScript.
### The hard part: handling anti-bot blocking
The PHP code is straightforward; anti-bot defenses are the real obstacle. Guzzle's TLS fingerprint and headers flag it as non-browser to Cloudflare, DataDome, and Akamai, and headless Panther is detectable too. DomCrawler cannot extract data from a challenge page.
Solving this means residential proxies, a real browser fingerprint, and CAPTCHA handling. A scraping API does it server-side — your PHP code posts the URL with Guzzle and parses the returned HTML with DomCrawler:
### Example
```php
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
$client = new Client();
$resp = $client->post('https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY', [
'json' => ['cmd' => 'request.get', 'url' => 'https://example.com/protected'],
]);
$data = json_decode($resp->getBody()->getContents(), true);
// Fully rendered, unblocked HTML -- parse it with DomCrawler as usual.
$html = $data['solution']['response'];
$crawler = new Crawler($html);
echo $crawler->filter('title')->text();
```
### FAQ
**Q: What is the best library for web scraping with PHP?**
The modern stack is Guzzle (HTTP client) plus Symfony DomCrawler with the CssSelector component (parsing). This pairing is maintained, supports CSS and XPath, and is the correct choice in 2026. Add Symfony BrowserKit (HttpBrowser) for sessions and forms, and Symfony Panther for JavaScript-rendered pages.
**Q: Is Goutte still used for PHP web scraping?**
No — Goutte is deprecated and archived. Its functionality was merged into Symfony, and the direct replacement is Symfony\Component\BrowserKit\HttpBrowser (built on symfony/browser-kit and symfony/http-client). Tutorials that still lead with Goutte are out of date; use Guzzle + DomCrawler, with HttpBrowser when you need a browsing session.
**Q: Can PHP scrape JavaScript-rendered websites?**
Yes. Guzzle and DomCrawler only see server HTML, so for JavaScript-rendered pages use Symfony Panther, which drives a real Chrome or Firefox browser and shares the DomCrawler selector API. php-webdriver is a lower-level Selenium alternative. You can also call the page’s underlying JSON API directly with Guzzle.
**Q: Why does my PHP scraper get blocked, and how do I fix it?**
Set a realistic User-Agent, throttle requests, and rotate residential proxies. For sites behind major anti-bot systems you also need a browser-grade TLS fingerprint , which Guzzle cannot provide on its own. Routing those requests through a scraping API that handles proxies, fingerprinting, and challenges server-side is the most reliable fix.
---
## Web Scraping With R: A Complete 2026 Guide
URL: https://scrappey.com/qa/web-scraping-languages/web-scraping-with-r
**Web scraping with R means using the rvest package to download and parse HTML into tidy data frames, with CSS selectors or XPath.** rvest is the standard, tidyverse-friendly scraping package. For finer control over requests use httr2 (the modern successor to httr), the polite package to respect robots.txt and rate limits, and RSelenium or chromote for JavaScript-rendered pages.
### Quick facts
- **Core package:** rvest — read_html, html_elements, html_text2
- **HTTP requests:** httr2 (modern successor to httr)
- **Politeness:** polite — robots.txt + rate-limiting wrapper
- **JavaScript pages:** RSelenium or chromote
- **Install:** install.packages(c("rvest","httr2","polite"))
### Your first R scraper with rvest
rvest reads a page and lets you select nodes with CSS or XPath, then pull out text or attributes into a vector — which drops straight into a data frame. Install with install.packages("rvest").
library(rvest)
library(dplyr)
page <- read_html("https://books.toscrape.com/")
books <- page %>% html_elements("article.product_pod")
titles <- books %>% html_element("h3 a") %>% html_attr("title")
prices <- books %>% html_element(".price_color") %>% html_text2()
df <- tibble(title = titles, price = prices)
print(df)
write.csv(df, "books.csv", row.names = FALSE)Note the current rvest API: html_elements() (plural) returns all matches, html_element() (singular) returns one per parent, html_text2() reads cleaned text, and html_attr() reads an attribute. Older tutorials use the retired html_nodes()/html_node() names — the new ones are the way to go.
### Requests and politeness with httr2 + polite
For custom headers, query parameters, or POST requests, use **httr2** — the modern successor to httr. It builds requests with a readable pipeline:
library(httr2)
library(rvest)
page <- request("https://books.toscrape.com/") %>%
req_user_agent("Mozilla/5.0 (compatible; research-bot)") %>%
req_perform() %>%
resp_body_html()
page %>% html_elements(".price_color") %>% html_text2()To scrape responsibly, the **polite** package wraps rvest to honour robots.txt, identify your scraper, and rate-limit automatically — a best practice almost every competitor guide omits:
library(polite)
library(rvest)
session <- bow("https://books.toscrape.com/",
user_agent = "polite R scraper ([email protected])")
prices <- scrape(session) %>%
html_elements(".price_color") %>%
html_text2()bow() introduces you to the host and checks robots.txt; scrape() then fetches within those rules.
### Scraping JavaScript-rendered pages
rvest only reads the HTML the server returns — it does not run JavaScript. For client-side-rendered pages, **RSelenium** drives a real browser:
library(RSelenium)
driver <- rsDriver(browser = "chrome", chromever = "latest", verbose = FALSE)
remote <- driver$client
remote$navigate("https://quotes.toscrape.com/js/")
quotes <- remote$findElements(using = "css", ".quote .text")
texts <- sapply(quotes, function(q) q$getElementText()[[1]])
print(texts)
remote$close()
driver$server$stop()**chromote** is a newer, lighter alternative that controls headless Chrome directly without a separate Selenium server — worth preferring for new projects. As always, you can also find the page's JSON API and fetch it directly with httr2.
### Which R package should you use?
PackageRoleRuns JS?Best for
rvestDownload + parseNoThe standard — CSS/XPath to data frames
httr2HTTP clientNoCustom headers, POST, APIs
politePoliteness wrapperNorobots.txt + rate-limiting
RSeleniumBrowser automationYesJavaScript-rendered pages
chromoteHeadless ChromeYesLighter JS rendering, no Selenium server
Start with **rvest**, add **httr2** for request control and **polite** for good manners; reach for RSelenium or chromote only when JavaScript is involved.
### The hard part: handling anti-bot blocking
R is excellent for analysing scraped data, but rvest's requests carry a TLS fingerprint and headers that Cloudflare, DataDome, and Akamai flag as non-browser. You cannot turn a 403 or CAPTCHA page into a data frame.
Handling serious anti-bot protection — residential proxies and a real browser fingerprint — is awkward to build in R. A scraping API handles it server-side; you POST the URL with httr2 and parse the returned HTML with rvest:
### Example
```r
library(httr2)
library(rvest)
resp <- request("https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY") %>%
req_body_json(list(cmd = "request.get", url = "https://example.com/protected")) %>%
req_perform()
# Fully rendered, unblocked HTML -- parse it with rvest as usual.
html <- resp_body_json(resp)$solution$response
read_html(html) %>% html_elements(".price_color") %>% html_text2()
```
### FAQ
**Q: What is the best package for web scraping with R?**
rvest is the standard — it downloads and parses HTML with CSS selectors or XPath and produces tidy vectors and data frames. Pair it with httr2 (the modern successor to httr) for request control and the polite package to respect robots.txt and rate limits. For JavaScript-rendered pages, use RSelenium or the lighter chromote.
**Q: Should I use httr or httr2 in R?**
Use httr2. It is the modern successor to httr, with a cleaner request-building pipeline and better defaults. httr still works and appears in older tutorials, but new R scraping code should use httr2 (or just rvest, which handles the request for you with read_html).
**Q: Can R scrape JavaScript-rendered pages?**
Yes, but not with rvest alone, which only reads server HTML. Use RSelenium to drive a real browser, or chromote to control headless Chrome without a separate Selenium server. Alternatively, locate the JSON API the page calls in your browser’s Network tab and request it directly with httr2 — usually the fastest option.
**Q: Why does my R scraper get blocked, and how do I scrape politely?**
Use the polite package, which checks robots.txt, identifies your scraper, and rate-limits automatically. Set a descriptive User-Agent and throttle your requests. Against serious anti-bot vendors you also need residential proxies and a browser-grade fingerprint, which is hard in R — routing those requests through a scraping API that handles it server-side is the most reliable approach.
---
## Web Scraping With Node.js: A Complete 2026 Guide
URL: https://scrappey.com/qa/web-scraping-languages/web-scraping-with-nodejs
**Web scraping with Node.js means fetching a page (with Axios or the built-in fetch) and parsing it with Cheerio for static sites, or driving a real browser with Playwright or Puppeteer for JavaScript-rendered ones.** JavaScript is a natural fit for scraping because the same language runs in the browser you are scraping. The 2026 stack is Axios + Cheerio for static pages, Playwright for dynamic pages, and Crawlee as the production crawler framework.
### Quick facts
- **Static parsing:** Cheerio v1.x — fast, jQuery-like server-side parsing
- **HTTP client:** Axios or the built-in fetch / undici (node-fetch is legacy)
- **JavaScript pages:** Playwright (recommended) or Puppeteer
- **Crawler framework:** Crawlee — queues, proxies, retries built in
- **Install:** npm install axios cheerio
### Your first Node.js scraper with Axios + Cheerio
The canonical static-scraping combo is Axios to fetch and Cheerio to parse. Cheerio gives you a jQuery-like $ API on the server. Install with npm install axios cheerio.
const axios = require('axios');
const cheerio = require('cheerio');
(async () => {
const { data: html } = await axios.get('https://books.toscrape.com/', {
headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' },
});
const $ = cheerio.load(html);
$('article.product_pod').each((i, el) => {
const title = $(el).find('h3 a').attr('title');
const price = $(el).find('.price_color').text();
console.log(`${title} | ${price}`);
});
})();Cheerio mirrors jQuery: $(selector) selects, .find() drills down, .text() reads text, .attr() reads attributes, and .each() iterates. Modern Node (18+) also ships a global fetch, so you can drop Axios for simple GETs if you prefer zero dependencies for the HTTP layer.
### Scraping JavaScript-rendered pages with Playwright
Cheerio only parses static HTML — it does not run JavaScript. For client-side-rendered pages, Playwright drives a real browser with built-in auto-waiting. Install with npm install playwright then npx playwright install chromium.
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://quotes.toscrape.com/js/');
await page.waitForSelector('.quote'); // wait for JS to render
const quotes = await page.$eval('.quote', (els) =>
els.map((e) => ({
text: e.querySelector('.text').innerText,
author: e.querySelector('.author').innerText,
}))
);
console.log(quotes);
await browser.close();
})();A common production pattern is **hybrid**: let Playwright render the page, grab await page.content(), then parse that HTML with Cheerio — you get the browser's rendering with Cheerio's fast, familiar extraction. Puppeteer is the Chrome-only alternative; Playwright is the recommended default in 2026 for its multi-browser support and cleaner API.
### Production crawlers with Crawlee
For real crawlers — queues, retries, proxy rotation, and automatic scaling — **Crawlee** is the framework most competitor guides miss. It wraps Cheerio and Playwright with production concerns built in. Install with npm install crawlee.
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ $, enqueueLinks }) {
$('article.product_pod').each((i, el) => {
Dataset.pushData({
title: $(el).find('h3 a').attr('title'),
price: $(el).find('.price_color').text(),
});
});
// Automatically follow pagination links.
await enqueueLinks({ selector: 'li.next a' });
},
});
await crawler.run(['https://books.toscrape.com/']);Crawlee handles the request queue, concurrency, retries, and result storage for you, and you can swap CheerioCrawler for PlaywrightCrawler when a target needs JavaScript — same structure, real browser underneath.
### Which Node.js library should you use?
LibraryTypeRuns JS?Best for
Axios / fetchHTTP clientNoFetching pages and APIs
CheerioHTML parserNoFast static parsing (jQuery-like)
PlaywrightBrowser automationYesJavaScript pages — the default
PuppeteerBrowser automationYesChrome-only headless control
CrawleeCrawler frameworkYes (optional)Production crawlers at scale
Start with **Axios + Cheerio**, add **Playwright** for JavaScript, and adopt **Crawlee** when you are running a real, ongoing crawl.
### The hard part: handling anti-bot blocking
The Node code is the easy part; anti-bot defenses are what break scrapers. Axios sends a TLS fingerprint no browser sends, and headless Playwright leaks automation signals that Cloudflare, DataDome, and Akamai flag. Cheerio cannot parse a 403 or CAPTCHA page.
Handling this means residential proxies and a real browser fingerprint — and keeping them coherent. A scraping API handles it server-side, so your Node code posts the URL and parses the returned HTML with Cheerio:
### Example
```javascript
const axios = require('axios');
const cheerio = require('cheerio');
(async () => {
const { data } = await axios.post(
'https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY',
{ cmd: 'request.get', url: 'https://example.com/protected' }
);
// Fully rendered, unblocked HTML -- parse it with Cheerio as usual.
const html = data.solution.response;
const $ = cheerio.load(html);
console.log($('title').text());
})();
```
### FAQ
**Q: What is the best library for web scraping with Node.js?**
For static pages, Axios (or the built-in fetch) plus Cheerio is the standard — Cheerio gives you fast, jQuery-like parsing on the server. For JavaScript-rendered pages, Playwright is the recommended browser-automation choice in 2026, with Puppeteer as the Chrome-only alternative. For production crawlers with queues, retries, and proxy rotation, use Crawlee.
**Q: Is Playwright or Puppeteer better for Node.js scraping?**
Playwright is the better default in 2026: it supports Chromium, Firefox, and WebKit, has built-in auto-waiting, and a cleaner API. Puppeteer is Chrome/Chromium-only but is still solid and well documented. Both run a real browser, so both are heavier and more detectable than a plain Axios + Cheerio request.
**Q: Can I use Cheerio for JavaScript-rendered pages?**
Not directly — Cheerio only parses the static HTML you give it and does not execute JavaScript. The common pattern is to render the page with Playwright or Puppeteer, take page.content(), and then parse that HTML with Cheerio. Alternatively, call the JSON API the page fetches its data from and skip the browser entirely.
**Q: Why does my Node.js scraper get blocked, and how do I fix it?**
Use realistic headers, throttle requests, and rotate residential proxies. Against Cloudflare, DataDome, or Akamai you also need a browser-grade TLS fingerprint , which Axios and even headless Playwright struggle with. Many teams route hard targets through a scraping API that handles proxies, fingerprinting, and challenges server-side.
---
## Web Scraping With curl: A Complete 2026 Guide
URL: https://scrappey.com/qa/web-scraping-languages/web-scraping-with-curl
**Web scraping with curl means fetching pages directly from the command line, setting headers, cookies, and proxies with curl's flags, then piping the output to a parser like pup (HTML) or jq (JSON).** curl is perfect for quick scrapes, testing requests, and shell pipelines — no programming language required. Its limits are that it cannot run JavaScript and sends a non-browser fingerprint, so anti-bot systems can identify it as automated.
### Quick facts
- **Fetch a page:** curl -sL --compressed -A "<user-agent>" URL
- **Custom headers:** -H "Header: value" (repeatable)
- **Cookies/session:** -c cookies.txt to save, -b cookies.txt to send
- **Proxy:** -x http://user:pass@host:port
- **Parse output:** pipe to pup (HTML) or jq (JSON)
### Fetching pages with curl
The basics: follow redirects with -L, stay quiet with -s, request gzip with --compressed, and always set a realistic User-Agent with -A (curl's default UA is instantly blocked).
# GET a page, follow redirects, set a browser User-Agent, save to file:
curl -sL --compressed \
-A "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" \
https://books.toscrape.com/ -o page.html
# Send extra headers (repeat -H as needed):
curl -s \
-H "Accept-Language: en-US,en;q=0.9" \
-H "Referer: https://www.google.com/" \
https://example.com/
# Show response headers and timing without the body:
curl -sI https://example.com/
curl -s -o /dev/null -w "time_total: %{time_total}s\n" https://example.com/-o file saves the body; -I fetches headers only; -w prints timing and status metrics. These cover inspecting and downloading almost any page.
### POST, cookies, and proxies
Real scraping needs logins, form posts, and proxies. curl handles all of them:
# POST a form (login) and save the session cookies:
curl -s -c cookies.txt \
-d "username=jane&password=secret" \
https://example.com/login
# Reuse those cookies on a protected page:
curl -s -b cookies.txt https://example.com/account
# POST JSON to an API endpoint:
curl -s -X POST \
-H "Content-Type: application/json" \
-d '{"query":"laptops","page":1}' \
https://api.example.com/search
# Route the request through a proxy (rotate by changing -x):
curl -s -x http://user:[email protected]:8000 https://example.com/-c writes a cookie jar, -b sends it back, so you can keep a session across requests. -x sends the request through an HTTP or SOCKS proxy, which is how you rotate IPs from the shell.
### Parsing HTML and JSON in the terminal
Here is what most curl tutorials skip: you can extract data without leaving the shell. Pipe HTML to pup (CSS selectors) or htmlq, and JSON to jq.
# Extract every product link (CSS selector -> href attribute) with pup:
curl -sL https://books.toscrape.com/ | pup 'h3 a attr{href}'
# Pull the text of each price into a list:
curl -sL https://books.toscrape.com/ | pup '.price_color text{}'
# Parse a JSON API response with jq:
curl -s https://api.example.com/items | jq -r '.items[].name'
# Combine: titles + prices into a quick CSV with htmlq + paste:
curl -sL https://books.toscrape.com/ | htmlq -t 'article.product_pod h3 a' > /tmp/t.txt
curl -sL https://books.toscrape.com/ | htmlq -t '.price_color' > /tmp/p.txt
paste -d',' /tmp/t.txt /tmp/p.txtThis terminal-native pipeline — curl | pup | ... — is fast, scriptable, and needs no programming language. Install pup, htmlq, and jq from your package manager.
### What curl cannot do (and how to fix it)
curl has two hard limits. First, it **cannot run JavaScript** — on a client-side-rendered page it only sees the empty HTML shell, so the data is not there to extract. Second, curl sends a TLS fingerprint and header set that no real browser produces, so anti-bot systems (Cloudflare, DataDome, Akamai) flag it immediately — and no combination of flags changes that.
You can address part of this with curl-impersonate (a build that reproduces a browser's TLS handshake), but JavaScript rendering and CAPTCHA challenges are out of reach for curl alone. A scraping API keeps the simple curl workflow but renders JavaScript server-side — you still pipe the result to pup or jq:
### Example
```bash
# Render JS + clear anti-bot in one curl call, then parse with jq/pup.
curl -s -X POST \
'https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY' \
-H 'Content-Type: application/json' \
-d '{"cmd":"request.get","url":"https://example.com/protected"}' \
| jq -r '.solution.response' > rendered.html
# rendered.html now holds the fully rendered, unblocked page:
pup 'h3 a attr{title}' < rendered.html
```
### FAQ
**Q: Can you use curl for web scraping?**
Yes. curl fetches pages from the command line and supports custom headers, cookies, POST data, and proxies, so it is great for quick scrapes, testing requests, and shell pipelines. Pipe its output to pup or htmlq to extract HTML by CSS selector, or to jq to parse JSON APIs. Its limits are that it cannot run JavaScript and is easily fingerprinted and blocked.
**Q: How do I parse HTML from curl output?**
Pipe curl to a command-line HTML parser. pup takes CSS selectors (e.g. curl -sL URL | pup "h3 a attr{href}"), and htmlq is a similar tool. For JSON API responses, pipe to jq instead. This keeps the whole scrape in the terminal with no programming language required.
**Q: Why does curl get blocked when scraping?**
curl sends a TLS handshake and header set that no real browser produces, so anti-bot systems flag it as automated and return a 403 or CAPTCHA. Setting a browser User-Agent helps a little but does not change the TLS fingerprint. curl-impersonate reproduces a browser handshake; for JavaScript rendering you need a headless browser or a scraping API.
**Q: Can curl scrape JavaScript-rendered pages?**
No. curl only downloads the raw HTML the server returns and cannot execute JavaScript, so on a client-side-rendered page the data simply is not present. You need a real browser (Playwright, Selenium) or a scraping API that renders the page server-side and returns the final HTML, which you can then pipe to pup or jq.
---
## XPath for Web Scraping: A Complete 2026 Guide
URL: https://scrappey.com/qa/web-scraping-languages/xpath-web-scraping
**XPath (XML Path Language) is a query language for selecting nodes in an HTML or XML document, widely used in web scraping to pinpoint the exact elements you want to extract.** Where CSS selectors target elements by tag, class, and id, XPath can also select by an element's text content and navigate in any direction — to parents, siblings, and ancestors. It is supported by Python's lxml and Scrapy, browser DevTools, Selenium, and Playwright.
### Quick facts
- **What it is:** A path language for selecting nodes in HTML/XML
- **Select by text:** //button[text()="Buy"] — CSS cannot do this
- **Navigate up:** Axes: parent::, ancestor::, following-sibling::
- **Python:** lxml: tree.xpath("//...") ; also Scrapy/parsel
- **Browser/JS:** Selenium, Playwright, DevTools $x("//...")
### Why use XPath for scraping?
HTML is a tree of nested elements, and XPath is a way to write a path to any node in that tree. You can test any XPath instantly in your browser: open DevTools, go to the Console, and run $x('//h1') to get an array of matching elements.
XPath earns its place alongside CSS selectors because it can do two things CSS cannot: **select an element by its visible text** (//a[text()="Next"]) and **walk back up the tree** to a parent or previous sibling. When a page has no helpful classes or ids, or you need "the price *next to* this label," XPath is often the only clean option.
### XPath syntax cheat sheet
ExpressionSelects
//divAll <div> elements anywhere in the document
/html/body/div<div> that is a direct child of body
//div/p<p> that is a direct child of any <div>
//div//a<a> anywhere inside any <div> (descendant)
//*[@id="main"]Any element with id="main"
//a[@class="btn"]<a> with class exactly "btn"
//a/@hrefThe href attribute value of every <a>
//h1/text()The text node inside each <h1>
(//div[@class="p"])[1]The first matching <div>
//li[last()]The last <li> in its list
The two leading-slash forms are the ones you use constantly: // means "anywhere below," and / means "direct child." Attributes are matched in square brackets with @.
### Predicates, functions, and axes
**Predicates** (square brackets) filter matches, and XPath ships handy **functions** for partial and text matching:
ExpressionSelects
//div[contains(@class,"product")]class contains "product" (partial match)
//a[starts-with(@href,"/p/")]href begins with "/p/"
//button[text()="Add to cart"]button whose text is exactly that
//span[contains(text(),"in stock")]span whose text contains the phrase
//input[@type="email" and @required]multiple conditions with and / or
**Axes** are XPath's superpower — they let you move in any direction, which CSS cannot:
AxisExampleMeaning
parent//span[@class="price"]/parent::divThe div wrapping the price span
ancestor//a/ancestor::articleThe article a link sits inside
following-sibling//dt[text()="Price"]/following-sibling::ddThe value next to a label
preceding-sibling//dd/preceding-sibling::dt[1]The label before a value
The following-sibling pattern — "find the label, then take the value beside it" — is one of the most useful real-world scraping tricks, and it is impossible with CSS selectors alone.
### XPath in Python with lxml
The standard way to use XPath in Python is the lxml library (also the engine behind Scrapy and parsel). Install with pip install lxml requests.
import requests
from lxml import html
resp = requests.get("https://books.toscrape.com/")
tree = html.fromstring(resp.content)
# Select titles and prices with XPath:
titles = tree.xpath('//article[@class="product_pod"]//h3/a/@title')
prices = tree.xpath('//p[@class="price_color"]/text()')
for title, price in zip(titles, prices):
print(title, "|", price)
# Using an axis: the rating sits as a class on a sibling element.
first = tree.xpath('(//article[@class="product_pod"])[1]')[0]
rating = first.xpath('.//p[contains(@class,"star-rating")]/@class')
print(rating) # e.g. ['star-rating Three']tree.xpath() returns a list — of elements, strings (for text()), or attribute values (for @attr). Note the leading . in .// when running XPath relative to an element you already selected.
### XPath in Node.js with Playwright
Almost every XPath tutorial is Python-only, but XPath works just as well in JavaScript. Playwright supports XPath selectors directly with the xpath= prefix, and it runs the JavaScript first so XPath sees the fully rendered DOM:
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
// Playwright accepts XPath with the xpath= prefix:
const titles = await page
.locator('xpath=//article[@class="product_pod"]//h3/a')
.evaluateAll((els) => els.map((e) => e.getAttribute('title')));
console.log(titles);
await browser.close();
})();For static HTML in Node without a browser, the xpath + @xmldom/xmldom packages evaluate XPath against a parsed document. But Playwright is the cleaner route when the page needs JavaScript anyway.
### XPath vs CSS selectors — and why scrapers fail on protected sites
NeedXPathCSS selector
Select by tag/class/idYesYes (shorter)
Select by visible textYesNo
Navigate to parent/ancestorYesNo
Select previous siblingYesNo
Readability for simple casesVerboseCleaner
Use CSS selectors for everyday selecting and XPath when you need text matching or tree navigation. Many scrapers mix both.
One thing neither solves: a selector only works if you actually received the real HTML. If the page is JavaScript-rendered or the site blocks you, your perfect XPath matches nothing. A scraping API returns the fully rendered HTML so your XPath always has real markup to query.
### Example
```python
import requests
from lxml import html
# Get real, rendered, unblocked HTML, then query it with XPath.
resp = requests.post(
'https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY',
json={'cmd': 'request.get', 'url': 'https://example.com/products'},
timeout=120,
)
page = resp.json()['solution']['response']
tree = html.fromstring(page)
for row in tree.xpath('//article[@class="product_pod"]'):
title = row.xpath('.//h3/a/@title')[0]
price = row.xpath('.//p[@class="price_color"]/text()')[0]
print(title, '|', price)
```
### FAQ
**Q: What is XPath in web scraping?**
XPath (XML Path Language) is a query language for selecting nodes in an HTML or XML document. In web scraping you use it to pinpoint the elements you want to extract — by tag, attribute, position, or text content. Unlike CSS selectors, XPath can also select elements by their visible text and navigate to parent, ancestor, and sibling nodes.
**Q: XPath vs CSS selectors — which is better for scraping?**
Use CSS selectors for everyday selection by tag, class, and id; they are shorter and very readable. Use XPath when you need something CSS cannot do: selecting by visible text (//button[text()="Buy"]), walking up to a parent or ancestor, or grabbing the value next to a label with following-sibling. Most experienced scrapers use both, picking whichever is cleaner for each case.
**Q: How do I test an XPath expression?**
In your browser, open DevTools, switch to the Console tab, and run $x("//your/xpath") — it returns an array of matching elements you can inspect immediately. In the Elements panel you can also right-click an element and choose Copy > Copy XPath, though hand-written XPaths are usually more robust than the auto-generated ones.
**Q: Can I use XPath in JavaScript and Node.js?**
Yes. Most tutorials are Python-focused, but Playwright supports XPath selectors directly with the xpath= prefix, and Selenium does too. For static HTML in Node without a browser, the xpath package combined with @xmldom/xmldom evaluates XPath against a parsed document. In the browser itself, document.evaluate and the DevTools $x() helper run XPath natively.
---