What Is a 404 Not Found Error?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

What Is a 404 Not Found Error? — conceptual illustration

On this page

HTTP 404 Not Found is the server's way of saying "I understood your request, but there is nothing at this address." The server is working fine - it just has no page, file, or data at the URL you asked for. On the normal web a 404 is straightforward: the page is gone or never existed. In scraping it is trickier: some anti-bot systems (tools that detect and block automated traffic) send a fake 404 to hide the fact they are blocking you, and JavaScript-heavy sites can show a 404-looking page that is actually fine once the browser runs its scripts.

Status family	4xx — client error
Honest meaning	URL does not exist on this server
Suspicious meaning	Anti-bot system returning 404 instead of 403 to obscure the block
Retry safe?	Usually no — but worth trying with a different IP or fingerprint if you suspect cloaking
Detection trick	Compare response from a browser vs your scraper; if browser works, it is a block

When 404 is honest

Most 404s are real: a typo in the URL, a product that has been delisted, an old article taken down, or a path that never existed. When this happens, record the 404, mark the URL as dead in your work queue, and move on. Repeatedly hitting a dead URL just wastes requests and pushes the target site's rate limiter (the system that throttles clients sending too many requests) to flag your IP.

When 404 is a block

Some anti-bot stacks deliberately return 404 to scrapers instead of 403, on the theory that "page not found" is less useful to you than "you are blocked" - it gives you less to react to. Cloudflare, DataDome, and a handful of in-house systems do this. The giveaway: the page loads fine in a real browser on your machine but consistently 404s from your scraper. The fix is the same as for any block - a cleaner IP reputation, a more realistic browser fingerprint (the set of signals that make your traffic look like a normal browser), and a slower request rate.

When 404 is a rendering problem

Single-page apps (sites that load one HTML page and then build every view with JavaScript) often serve the same 404-shaped HTML shell for every URL, with the real content filled in by the browser after a follow-up fetch. If you scrape the raw HTML you see "404" or an empty body; if you actually run the JavaScript, the page loads normally. The clue is a mismatched content-type or a near-empty response body - switch to a JS-rendering API (one that runs the page's scripts for you) or grab the underlying XHR endpoint (the background data request the page makes) directly.

Code example

python

import requests

def diagnose_404(url):
    # Real-browser UA succeeds where bare client 404s → cloaked block
    headers = {'User-Agent': 'Mozilla/5.0 (real browser UA)'}
    r1 = requests.get(url, headers=headers)
    r2 = requests.get(url)
    if r1.status_code == 200 and r2.status_code == 404:
        return 'cloaked_block'
    if r1.status_code == 404 and r2.status_code == 404:
        return 'real_404'
    return 'inconclusive'

HTTP 403 Forbidden means the server understood your request but refuses to answer it. The difference from 401 is simple: 401 means "we don't…

What Is Anti-Bot Detection?

Anti-bot detection is the set of techniques websites use to tell automated traffic apart from real human visitors — and then block, challeng…

What Is Cloudflare Error 1015?

Cloudflare error 1015 "You are being rate limited" means a website is blocking you because you sent too many requests too quickly. The site …

What Is a Sitemap?

A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, chan…

What Is Link Extraction?

Link extraction is the crawling step where you pull every URL out of a page you have just downloaded, so you can decide which ones to visit …

What Is the 401 Status Code (401 Unauthorized)?

HTTP 401 Unauthorized means the server doesn't know who you are because your request didn't include valid login credentials. Think of it as …

What Is the 406 Status Code (406 Not Acceptable)?

HTTP 406 Not Acceptable means the server can't return content matching your Accept headers. When you make a request, your client sends "Acce…

What Is the 502 Status Code (502 Bad Gateway)?

HTTP 502 Bad Gateway means one server, acting as a middleman, got a broken reply from another server behind it. Many websites sit behind a g…

Concept map

How 404 Error connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · HTTP Errors

Tools & solutions for this topic

Frequently asked questions

Should I retry 404s in a crawl?

Usually no - mark the URL dead and move on. It is worth one retry through a different IP or with a real browser fingerprint if you suspect the site is disguising blocks as 404s.

Why would a site return 404 instead of 403?

To hide that they are blocking you. A 403 tells the scraper "you are detected, try harder." A 404 tells it "nothing here, give up." It is a deliberate tactic, not a bug.

How do I crawl an SPA that returns 404 for the raw HTML?

Either render the JavaScript (with Playwright or a JS-rendering scraping API) or figure out the XHR endpoint the SPA calls to load its data and request that directly - usually cheaper and faster than full rendering.

Last updated: 2026-05-31