Web Automation

How DataDome detects bots and scrapers (2026)

How DataDome detects bots and scrapers (2026) — conceptual illustration
On this page

DataDome fronts roughly 1,200 enterprise sites — Etsy, Hermès, Leboncoin, Tripadvisor, Reuters, MarketWatch, WSJ, Wellfound — and is known for catching automation that passes Cloudflare without issue. It's worth understanding on its own terms because its architecture is unusual: per-site ML models, application-layer scoring rather than CDN-edge, and a WebAssembly challenge that runs in the browser.

This is a reference on how DataDome is structured and what each detection layer measures.

Quick facts

Coverage~1,200 enterprise sites
ModelPer-site machine-learning
Cookiedatadome
Known forCatching bots that pass Cloudflare
Best approachResidential IPs + real-browser fingerprint

What DataDome is

DataDome is a reverse-proxy WAF that runs at the application server, not at the CDN edge. Every request is forwarded synchronously to DataDome's scoring service, which returns a verdict in roughly 2 ms. The scorer is per-customer — around 85,000 ML models, one per protected site — so the same TLS, browser and proxy combination can pass on one DataDome customer and fail on another.

Low-trust requests surface as one of:

  • A silent 403 with the x-datadome header set.
  • A GeeTest-style slider captcha served inline.
  • A block page with a Reference #.

The four signal categories

1. IP address reputation

IP reputation accounts for roughly 25–30% of the score on its own — the heaviest single input.

  • Datacenter IPs (AWS, GCP, Azure, DigitalOcean, OVH…) — pre-scored low. DataDome maintains one of the more accurate datacenter-range databases in the industry; many of these ranges are blanket-blocked on Etsy and Leboncoin before any other check runs.
  • Residential IPs — assigned by ISPs to home connections, higher baseline trust.
  • Mobile IPs — cell tower and CGNAT pools, highest baseline trust.

2. The WASM boring_challenge and the datadome cookie

DataDome's signature component is the WASM boring_challenge — a Rust-compiled state machine served as WebAssembly and executed in the browser. It produces a token that's POSTed to js.datadome.co, which then sets the datadome cookie that authorizes future requests.

Because the challenge is real WASM running against real browser APIs, it can't be solved without an actual browser execution context. The challenge also probes the CPU via SIMD timing in a way that exposes headless environments no stealth-browser JS patch covers. The sensor itself collects the usual fingerprint surface (canvas, WebGL, audio, fonts, screen metrics, timezone, navigator.webdriver, window.chrome) and feeds it into the WASM state.

3. HTTP and TLS fingerprinting

DataDome is one of the few WAFs that publicly markets HTTP/2 fingerprinting as a detection layer.

  • Most scraping libraries still default to HTTP/1.1. Real Chrome and Firefox haven't in years.
  • libcurl and Go's net/http produce JA3 signatures that don't match any real browser, even when they negotiate HTTP/2.
  • HTTP/2 fingerprinting tracks pseudo-header order, SETTINGS frame values, and window-update sizes.

4. Behavioural and pattern analysis

DataDome runs continuous ML pattern analysis on connection history:

  • The datadome cookie sent from a different IP than the one that minted it.
  • Reused sensor payloads across pages instead of fresh ones per navigation.
  • Honeypot link hits.
  • Bursty request timing.
  • Missing real-browser headers (Sec-Fetch-*, Accept-Language, sec-ch-ua).

What this means for developers

The per-site model architecture means there is no single "DataDome solution" — a setup that works on a news customer may fail on an e-commerce one with stricter scoring. Three patterns are common in production:

  1. Look in the initial HTML first. Many DataDome-protected Next.js sites embed full page state in a __NEXT_DATA__ script tag. If the data is in the first HTML response, the WASM challenge never runs because there is no XHR to gate. curl_cffi + a residential proxy is sufficient for those cases.
  2. Mobile or ISP residential proxies for XHR endpoints — IP weighting is so heavy that switching from datacenter to mobile-4G frequently flips a session from blocked to 200 OK with no other change.
  3. Real browser execution when the page actually runs the WASM challenge — Camoufox with aligned IP/timezone/locale, or a managed scraping API.

For reference, a minimal managed-API example:

import requests

response = requests.post(
    'https://publisher.scrappey.com/api/v1',
    json={
        'cmd': 'request.get',
        'url': 'https://example.com/search?q=...',
        'session': 'dd-session-1'
    },
    headers={'Authorization': 'Bearer YOUR_API_KEY'}
)
print(response.json()['solution']['response'])

DataDome is particularly sensitive to IP/cookie mismatches — the datadome cookie minted on one IP is treated with suspicion when sent from another, so a stable exit IP per session matters.

Sites commonly fronted by DataDome

E-commerce, classifieds, news and travel dominate: Etsy.com, Hermes.com, Leboncoin.fr, Marketwatch.com, Reuters.com, Tripadvisor.com, WSJ.com, Wellfound.com. Many of these rotate between DataDome, Cloudflare, Akamai and PerimeterX depending on conditions.

Summary

DataDome scores each request in ~2 ms against a per-site ML model using IP reputation (25–30% of the score), the WASM boring_challenge and datadome cookie, TLS and HTTP/2 fingerprints, and behavioural patterns. The per-customer architecture means detection behaviour varies between sites even when the underlying signals don't, which is the main reason setups that work on one DataDome target may not generalise to another.

Related terms

Concept map

How How DataDome detects bots and scrapers (2026) connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Automation
Building map…

Frequently asked questions

Why does DataDome catch bots that pass Cloudflare?

It trains per-site ML models on device, network, and behavioural signals, so it adapts to each target rather than applying one generic ruleset. Generic stealth setups that beat Cloudflare often still look anomalous to it.

What triggers a DataDome block?

Datacenter IPs, fingerprint inconsistencies, and behavioural anomalies push the score over the threshold, returning a 403 with a DataDome challenge page.

Which sites use DataDome?

Etsy, Hermès, Leboncoin, Tripadvisor, Reuters, and the WSJ among roughly 1,200 enterprise sites.

Last updated: 2026-05-28