How Do Websites Detect Web Scrapers?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

How Do Websites Detect Web Scrapers? — conceptual illustration

On this page

Websites spot scrapers by gathering hundreds of small clues about each visitor, then scoring how human the whole picture looks. No single clue gets you blocked — anti-bot systems add up many signals (an "ensemble" score) and decide based on the total. The clues come from four layers, and these categories stay the same even as the exact checks change: IP reputation, TLS fingerprint (TLS is the encryption behind https), HTTP/2 frame ordering, header consistency, JavaScript probes that run in the browser, canvas/WebGL/audio fingerprints (tiny rendering differences unique to your setup), and mouse/timing behavior.

Network layer	IP reputation, ASN, geolocation, connection reuse
Transport layer	TLS JA3/JA4, HTTP/2 frame ordering, ALPN
Browser layer	Canvas, WebGL, audio context, font enumeration, navigator probes
Behavioral layer	Mouse movement, scroll velocity, dwell time, click timing
Decision model	Ensemble score across all signals — no single tell

Network signals (the first filter)

Before any JavaScript runs, the site already knows a lot from your IP address alone: its ASN (the network it belongs to — e.g. Amazon vs. a home ISP), its past reputation, and whether its location makes sense. Datacenter IPs (AWS, GCP, DigitalOcean) get almost no trust by default, because real users rarely browse from a server farm. Residential and mobile IPs start out neutral. IPs caught misbehaving before get blacklisted right at the edge. This one filter handles about 70% of low-effort scraping traffic before any fingerprinting is even needed.

Transport signals (TLS and HTTP/2)

Every https connection starts with a TLS handshake — the step where client and server agree on encryption. That handshake exposes a JA3/JA4 fingerprint: the list of cipher suites, extensions, and elliptic curves your client offers, in the exact order it offers them. Python's requests library has a JA3 that instantly says "not a browser." HTTP/2 adds more tells, like the order of frame priorities and headers. Real Chrome sends headers in a particular order; curl sends them differently. Anti-bot vendors keep catalogs of known automation-tool fingerprints and block anything that matches.

Browser signals (JS-collected)

If you make it past the network and transport filters, the page runs JavaScript that quietly inspects your browser. It checks things like the canvas rendering hash (the exact pixels your machine draws), the WebGL renderer string (your graphics hardware), an audio fingerprint, installed fonts, screen size, timezone, languages, the navigator.webdriver flag, and dozens more. Faking any one of these is easy; the hard part is making them all agree with each other. A spoofed canvas paired with a real WebGL value is actually a stronger bot signal than either one alone, because the mismatch gives you away.

Behavioral signals (the last layer)

Once the page loads, the site watches how you act: mouse movement, scrolling, how long you wait before clicking, and how fast you fill in forms. Real people move the mouse in jittery, curved paths, scroll in bursts, and pause at random. Scrapers either skip all of this (no mouse event ever fires) or fake it in patterns that machine-learning models recognize with high confidence. This is the layer that catches headless browsers — automated browsers with no visible window — that pass every static fingerprint check.

A worked example — what a single request reveals

Take one GET request to an Akamai-protected site from a plain Python requests script. Here is what each layer sees:

Layer	What's observed	Verdict
Network	JA4 hash matches Python urllib3, not Chrome	Bot
Transport	No HTTP/2 — connection negotiates HTTP/1.1	Bot
Headers	Accept-Encoding: gzip, no Accept-Language, User-Agent claims Chrome	Incoherent — bot
IP	AWS us-east-1 datacenter ASN	Bot
JavaScript	No script execution — sensor.js never ran	Bot or non-browser

Every layer independently flags this as a bot. Akamai returns a 412 status with the Pardon Our Interruption page, the _abck cookie stays stuck at ~-1~ (its "not verified" state), and any protected XHR endpoints refuse to work because of that cookie. The bot was already caught at the TLS handshake — every layer after that just confirmed it.

Now run the same request with curl_cffi + Chrome impersonation + an ISP residential proxy: the JA4 matches a real Chrome, HTTP/2 works, the headers line up, and the IP looks residential. The same endpoint now returns 200. Nothing changed except the network-layer fingerprint.

How this is shifting in 2026

Three trends are reshaping how detection works:

JA4 has fully replaced JA3 across major vendors. Matching only an old JA3 profile now produces a "wrong-shape Chrome" signal, because vendors check both. curl_cffi, utls, and tls-client all support JA4 — there is no reason to be stuck on JA3 in 2026.
WASM challenges are now standard at the enterprise tier. WASM is compiled code that runs in the browser, harder to inspect or fake than plain JavaScript. DataDome's boring_challenge shipped in 2023; Akamai and PerimeterX added WASM probes through 2024. These can no longer be addressed at the JavaScript layer (see the WASM fingerprinting entry); handling them has moved down into the browser engine itself (Camoufox, CloakBrowser).
Behavioural signals are tracked per-session, not per-request. Vendors now collect clicks, scrolls, and timing across a whole session and score the overall pattern. A single request with a flawless fingerprint can still get flagged by your behavior on request 50. The fix is realistic pacing and warm-up over time, not a perfect one-off fingerprint.

What hasn't changed: the cost ranking of fixes. Network-layer fixes are still the cheapest, behavioural fixes still the most expensive. Move up the layers only as the one below stops working.

Related terms

What Is Anti-Bot Detection?

Anti-bot detection is the set of techniques websites use to tell automated traffic apart from real human visitors — and then block, challeng…

What Is TLS Fingerprinting (JA3/JA4)?

TLS fingerprinting is a way to recognize what software made a connection just by looking at how it sets up encryption — before the server re…

What Is Browser Fingerprinting?

Browser fingerprinting is a technique that identifies and tracks a visitor by combining dozens of small, observable characteristics of their…

How Browser Fingerprinting Works

Browser fingerprinting is how a site combines signals — canvas, WebGL, audio, fonts, navigator probes, TLS (the encryption layer behind http…

What Is a 200 Status Code?

HTTP 200 OK is the standard "success" status code: the server got your request, handled it, and sent back the response you expected. For a G…

What Is an Anti-Scraping Mechanism?

An anti-scraping mechanism is any technical control a website uses to detect, slow down, or block automated requests (bots) instead of real …

What Is Burp Suite MCP for Scraping Recon?

The Burp Suite MCP Server is an official PortSwigger extension (released 3 April 2025) that exposes Burp's HTTP history, Repeater, Intruder,…

Anti-Bot Vendor Detection Cheatsheet

A useful first step when working with any protected site you are authorized to access is identifying which anti-bot vendor sits in front of …

What Is the Web Scraping Decision Flow?

The web scraping decision flow is a six-step checklist, ordered cheapest-first, that experienced engineers run through on every new target t…

What Is Function.toString() Inspection?

Function.prototype.toString() inspection is a technique anti-bot scripts use to identify JavaScript functions that have been modified at run…

What Is WebGL Fingerprinting?

WebGL fingerprinting reads identifying information directly from the GPU. WebGL is the browser feature that lets web pages draw 3D graphics …

What Is Scraper Data Poisoning?

Data poisoning is when a site decides you are probably a scraper and quietly feeds you wrong data instead of blocking you: fake prices, made…

What Is a DOM Honeypot?

A DOM honeypot is an invisible form field or link that humans never see but bots fill in or click. The DOM (Document Object Model) is the li…

What Is Behavioural Bot Detection?

Behavioural bot detection is the part of anti-bot scoring that asks "how does this client act?" instead of "what is this client?". Instead o…

What Is HTTP/2 Fingerprinting?

HTTP/2 fingerprinting identifies an HTTP client from its SETTINGS frame and frame-level behaviour, independent of the TLS layer. Think of it…

What Is WebGPU Fingerprinting?

WebGPU fingerprinting reads identifying data from the modern navigator.gpu API. WebGPU is the newest browser standard for talking to your GP…

What Is Client Hints Fingerprinting?

User-Agent Client Hints (UA-CH) are a set of structured HTTP headers plus a matching JavaScript API that report the same browser and operati…

What Is a Timezone / IP Mismatch?

A timezone/IP mismatch is when the location a browser claims and the location of its IP address disagree. Anti-bot systems (the software sit…

What Is navigator.webdriver?

navigator.webdriver is a standardized boolean that returns true when the browser is being controlled by automation. Think of it as a built-i…

What Is JA3 Fingerprinting?

JA3 is a method for fingerprinting a TLS client by hashing the fields of its Client Hello. TLS is the encryption layer behind https, and the…

What Is HTTP/3 / QUIC Fingerprinting?

HTTP/3 / QUIC fingerprinting identifies a client from the QUIC transport layer that HTTP/3 runs on. QUIC is the modern transport beneath HTT…

What Is Hardware Fingerprinting?

Hardware fingerprinting reads device capability signals - CPU cores, RAM, and screen metrics - that JavaScript exposes directly. These are v…

What Is CDP Detection?

CDP detection is the family of techniques anti-bot scripts use to tell that a browser is being driven through the Chrome DevTools Protocol (…

What Is Incognito Detection?

Incognito detection is the set of techniques that reveal whether a browser is in private / incognito mode. Private mode is the browser featu…

What Is Media Devices Fingerprinting?

Media devices fingerprinting reads the list of cameras, microphones, and speakers a browser reports via navigator.mediaDevices.enumerateDevi…

What Is Speech Synthesis Fingerprinting?

Speech synthesis fingerprinting reads the list of text-to-speech voices exposed by window.speechSynthesis.getVoices(). "Text-to-speech" mean…

What Is Stack Depth Fingerprinting?

Stack depth fingerprinting measures the maximum JavaScript recursion depth a browser allows before throwing a RangeError: Maximum call stack…

What Is CSS Media Query Fingerprinting?

CSS media query fingerprinting reads operating-system and device preferences through window.matchMedia(). A media query is a yes/no question…

What Is Screen Resolution Fingerprinting?

Screen resolution fingerprinting reads the display measurements a browser reports - screen.width/height, availWidth/availHeight, colorDepth,…

What Is a User Agent?

A user agent is a short text string a client sends in the User-Agent HTTP header to tell a server what software is making the request. Every…

What Is a CAPTCHA?

A CAPTCHA is a challenge a website uses to tell a human visitor apart from an automated script. The name stands for Completely Automated Pub…

Concept map

How How Do Websites Detect Web Scrapers connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Anti-Bot

Tools & solutions for this topic

Frequently asked questions

Can I scrape without triggering detection?

Not with one trick. You can lower the chance of detection a lot by combining good residential IPs, browser-matching TLS, realistic fingerprints, and unhurried request pacing. Perfect invisibility isn't a realistic goal — the aim is to look enough like a real user that blocking you costs the site more than letting you through.

Which signal is most important to fix first?

The IP. Datacenter IPs lose before any other signal is even collected. A residential or mobile IP is what gives every other signal a chance to matter.

Why does my scraper work in a browser but fail headless?

Headless Chrome (a browser with no visible window) leaks a dozen tells: navigator.webdriver is set to true, chrome.runtime is missing, the permissions API behaves oddly, and more. Use a real browser with a consistent, full-stack configuration (Camoufox, PatchRight), or a managed scraping API that handles the whole fingerprint surface for you on sites you are permitted to access.

Is "block at the TLS handshake" really one signal or four?

One. The TLS Client Hello — the opening message of an https connection — is hashed into a JA4 fingerprint in microseconds and compared against known browsers. If it doesn't match any browser baseline, the connection is dropped before the HTTP layer reads a single byte. No User-Agent, IP, or header can save you, because none of them ever get read.

Which signal is the most-checked across vendors?

navigator.webdriver — a browser property that flags automation. Every automation framework sets it to true by default (Selenium, Playwright, Puppeteer), and nearly every anti-bot script tests for it. It is the cheapest possible check and catches all unmodified scrapers. Overriding the property is trivial, but the more interesting check — using Function.toString() to spot that the property has been tampered with — is what catches the modified ones.

Last updated: 2026-05-31