When 404 is honest
Most 404s are real: a typo in the URL, a product that has been delisted, an old article taken down, or a path that never existed. Cache the 404, mark the URL dead in your work queue, and move on. Hammering a dead URL just wastes requests and trains the target site's rate limiter on your IP.
When 404 is a block
Some anti-bot stacks deliberately return 404 to scrapers instead of 403, on the theory that "page not found" is less actionable than "you are blocked." Cloudflare, DataDome, and a handful of in-house systems do this. The tell: the page works in a real browser from your machine but consistently 404s from your scraper. The fix is the same as any block — better IP reputation, more realistic fingerprint, slower request rate.
When 404 is a rendering problem
Single-page apps often serve the same 404-shaped HTML shell for every URL, with the real content rendered client-side after a fetch call. If you scrape the raw HTML you see "404" or an empty body; if you execute the JS, the page loads normally. The signal is mismatched content-type or a near-zero response body — switch to a JS-rendering API or capture the underlying XHR endpoint directly.
