Live verdict vs. inferred verdict
A fingerprint test page - the CreepJS / BrowserScan family - produces an inferred verdict: it inspects what the browser reports and predicts how identifiable it is, but nothing is actually deciding to let you through. A live benchmark instead drives a tool against real targets behind real anti-bot gates and records what each gate does - allowed, served a challenge, or blocked. Because the live test measures the decision a production system makes, it captures factors a client-side page never sees, while the inferred test is repeatable and cheap but blind to the network and server side. Good stealth benchmarking is explicit about which of the two it is reporting.
Why identical fingerprints get opposite verdicts
The same client fingerprint can pass on one run and be blocked on another because production detectors weigh signals the fingerprint does not contain: IP reputation (a residential address versus a datacenter ASN), the TLS handshake fingerprint, HTTP/2 frame ordering, the shape of the automation protocol driving the browser, and session or behavioural history. Gates cross-check these layers for agreement - a Linux server advertising a desktop browser from a residential proxy is a contradiction a gate can flag - so an identical browser fingerprint passes behind one network context and fails behind another. That is why an inferred score and a live verdict are not interchangeable, and why calibration between them is an open question rather than a fixed number.
Why benchmarks are snapshots
Public stealth benchmarks are best read as dated snapshots, not standings. They are typically one author's run, on one operating system and often a single IP, against targets that swap anti-bot vendors without notice and tools that change fast - many are pre-alpha and shift behaviour between versions. Rotating the proxy can reorder the results; a browser-version difference can be mistaken for a tool difference. This is distinct from a browser-automation engine benchmark, which compares tools on memory, CPU, and speed; stealth benchmarking measures detectability itself. Use either as a directional reference and re-test for your own targets, IPs, and dates rather than quoting a leaderboard as settled fact.
