Anti-Bot

How Do Websites Detect Web Scrapers?

How Do Websites Detect Web Scrapers? — conceptual illustration
On this page

Websites detect scrapers by collecting hundreds of signals across the network, transport, browser, and behavioral layers, then scoring the combination against models of known-good human traffic. No single signal blocks a scraper — anti-bot decisions are ensemble scores. The signal categories are stable even as the implementations rotate: IP reputation, TLS fingerprint, HTTP/2 frame ordering, header consistency, JavaScript runtime probes, canvas/WebGL/audio fingerprints, and mouse/timing behavior.

Quick facts

Network layerIP reputation, ASN, geolocation, connection reuse
Transport layerTLS JA3/JA4, HTTP/2 frame ordering, ALPN
Browser layerCanvas, WebGL, audio context, font enumeration, navigator probes
Behavioral layerMouse movement, scroll velocity, dwell time, click timing
Decision modelEnsemble score across all signals — no single tell

Network signals (the first filter)

Before any JavaScript runs, the site already knows your IP's ASN, reputation history, and geographic plausibility. Datacenter IPs (AWS, GCP, DigitalOcean) get near-zero trust by default. Residential and mobile IPs start neutral. Repeat-offender IPs are blacklisted at the edge. This filter alone handles ~70% of low-effort scraping traffic — no fingerprinting needed.

Transport signals (TLS and HTTP/2)

Every TLS handshake exposes a JA3/JA4 fingerprint — cipher suites, extensions, elliptic curves, in the exact order your client advertises them. Python's requests library has a JA3 that screams "not a browser." HTTP/2 adds frame priorities and header ordering as additional signals. Real Chrome sends headers in a specific order; curl sends them differently. Anti-bot vendors maintain catalogs of known automation-tool fingerprints and block on match.

Browser signals (JS-collected)

If you survive the network and transport filters, the page runs JavaScript that probes your browser environment: canvas rendering deterministic hash, WebGL renderer string, audio context fingerprint, installed fonts, screen geometry, timezone, languages, navigator.webdriver flag, and dozens more. Each is cheap to spoof in isolation; making them mutually consistent is the hard problem. A spoofed canvas + real WebGL is a stronger signal than either alone.

Behavioral signals (the last layer)

Once the page is loaded, the site records mouse movement, scroll patterns, dwell time before clicks, and form-fill cadence. Real users move in jittery non-linear arcs, scroll in bursts, and pause unpredictably. Scrapers either skip these interactions entirely (no mouse events ever fire) or emulate them in patterns that ML models classify with high confidence. This layer is what catches headless browsers that pass every static fingerprint check.

A worked example — what a single request reveals

Consider one GET against an Akamai-protected site from a vanilla Python requests script:

LayerWhat's observedVerdict
NetworkJA4 hash matches Python urllib3, not ChromeBot
TransportNo HTTP/2 — connection negotiates HTTP/1.1Bot
HeadersAccept-Encoding: gzip, no Accept-Language, User-Agent claims ChromeIncoherent — bot
IPAWS us-east-1 datacenter ASNBot
JavaScriptNo script execution — sensor.js never ranBot or non-browser

Each layer independently classifies this as bot. Akamai returns 412 with the Pardon Our Interruption body, the _abck cookie stays at ~-1~, and protected XHR endpoints block on the cookie state. The bot was identified at the TLS handshake — every layer below confirmed it.

Now repeat with curl_cffi + Chrome impersonation + ISP residential proxy: JA4 matches, HTTP/2 works, headers are coherent, IP is residential. The same endpoint returns 200. Nothing changed except the network-layer fingerprint.

How this is shifting in 2026

Three trends changing the detection model:

  1. JA4 has fully replaced JA3 across major vendors. Targeting JA3-only profiles produces a "wrong-shape Chrome" signal because vendors check both. curl_cffi, utls, and tls-client all support JA4 — there is no reason to be on JA3 in 2026.
  2. WASM challenges are universal at enterprise tier. DataDome's boring_challenge shipped in 2023; Akamai and PerimeterX added WASM probes through 2024. Defeating them at the JS layer is no longer possible (see the WASM fingerprinting entry); the bypass moved into the browser-engine layer (Camoufox, CloakBrowser).
  3. Behavioural signals are per-session, not per-request. Vendors aggregate clicks, scrolls, and timing across a session and score the trajectory. Single-request perfect fingerprints can still be flagged behaviourally on request 50. The mitigation is realistic pacing and warm-up, not perfect single-request fingerprints.

What hasn't changed: the relative cost ranking. Network-layer fixes are still the cheapest, behavioural fixes still the most expensive. Climb the layers only as the previous one stops working.

Code example

python
# Each signal layer needs to be addressed. A scraping API handles all of them.
import requests

resp = requests.post('https://publisher.scrappey.com/api/v1', json={
    'cmd': 'request.get',
    'url': 'https://hard-target.com'
}, headers={'Authorization': 'YOUR_API_KEY'})

html = resp.json()['solution']['response']

Related terms

What Is Anti-Bot Detection?
Anti-bot detection is the set of techniques websites use to distinguish automated traffic from human users — and to block, challenge, or thr…
What Is TLS Fingerprinting (JA3/JA4)?
TLS fingerprinting is a technique that identifies an HTTP client from its TLS handshake — before the server reads a single request byte. The…
What Is Browser Fingerprinting?
Browser fingerprinting is a technique that identifies and tracks a visitor by combining dozens of small, observable characteristics of their…
What Is Browser Fingerprinting Evasion?
Browser fingerprinting evasion is the practice of configuring an automated browser so that the combined fingerprint it presents — canvas, We…
What Is a 200 Status Code?
HTTP 200 OK is the standard success status code: the server received the request, processed it, and returned the expected response body. For…
What Is an Anti-Scraping Mechanism?
An anti-scraping mechanism is any technical control a website uses to detect, slow, or block automated requests. Modern sites stack multiple…
What Is Burp Suite MCP for Scraping Recon?
The Burp Suite MCP Server is an official PortSwigger extension (released 3 April 2025) that exposes Burp's HTTP history, Repeater, Intruder,…
Anti-Bot Vendor Detection Cheatsheet
The first step of any scrape against a protected site is identifying which anti-bot vendor is in front of it. The vendor determines almost e…
What Is the Web Scraping Decision Flow?
The web scraping decision flow is a six-step priority order experienced practitioners follow on any new target. Walk steps in order. Stop at…
What Is Function.toString() Inspection?
Function.prototype.toString() inspection is the technique anti-bot scripts use to detect runtime JavaScript patches. Every JS function expos…
What Is WebGL Fingerprinting?
WebGL fingerprinting reads identifying information directly from the GPU. The browser exposes the graphics card vendor and renderer string (…
What Is Scraper Data Poisoning?
Data poisoning is when a site detects a likely scraper and silently serves different data: fake prices, fabricated reviews, wrong stock coun…
What Is a DOM Honeypot?
A DOM honeypot is an invisible form field or link that humans never see but bots fill in or click. The moment you interact with it, the site…
What Is Behavioural Bot Detection?
Behavioural bot detection is the layer of anti-bot scoring that asks "how does this client act?" rather than "what is it?". It tracks mouse-…
What Is HTTP/2 Fingerprinting?
HTTP/2 fingerprinting identifies an HTTP client from its SETTINGS frame and frame-level behaviour, independent of the TLS layer. Every HTTP/…

Concept map

How How Do Websites Detect Web Scrapers connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Anti-Bot
Building map…

Frequently asked questions

Can I scrape without triggering detection?

Not with a single trick. You can lower detection probability dramatically by combining good residential IPs, browser-matching TLS, realistic fingerprints, and unhurried request pacing. Perfect undetectability is not a real target — the goal is to look enough like real users that the cost of blocking you is higher than letting you through.

Which signal is most important to fix first?

IP. Datacenter IPs lose before any other signal is collected. Residential or mobile IPs give every other signal a chance to matter.

Why does my scraper work in a browser but fail headless?

Headless Chrome leaks a dozen tells (navigator.webdriver true, missing chrome.runtime, suspicious permissions API). Use a real browser with stealth patches (Camoufox, PatchRight) or a managed scraping API that handles the fingerprint surface for you.

Is "block at the TLS handshake" really one signal or four?

One. The TLS Client Hello is hashed into JA4 in microseconds and compared to known browsers. If it doesn't match any browser baseline, the connection is dropped before the HTTP layer parses a single byte. No User-Agent, IP, or header can recover from that — it never gets read.

Which signal is the most-checked across vendors?

navigator.webdriver. Every browser-automation framework sets it to true by default (Selenium, Playwright, Puppeteer) and every anti-bot script tests it. It is the cheapest detection and catches all unmodified scrapers. The fix is trivial (override the property) but the more interesting detection — Function.toString() inspection of the override — is what catches the modified ones.

Last updated: 2026-05-27