Why polite wins
Sites notice aggressive crawlers within minutes — high concurrency, ignoring robots.txt, sustained 5xx-without-backoff. The response is a block at the IP, ASN, or fingerprint level. A polite crawler operates below the noise threshold and runs for hours or days without intervention. The math favors politeness: 10 requests/second sustained over an hour is 36,000 pages; 100 requests/second blocked after five minutes is 30,000 — and you have burned the IP.
The practical recipe
Per host: cap concurrency at 1-5 connections. Insert 100-1000ms delay between requests. Respect Crawl-delay in robots.txt as a minimum. On 429 or 503, exponential backoff (1s, 2s, 4s, 8s, 16s) and stop after 5 attempts. Honor Retry-After headers exactly. Set a User-Agent that names your crawler and includes a URL or email for the site owner to reach you. Rotate IPs across crawls, not within a session.
Why identification matters
A User-Agent like "MyCrawler/1.0 (+https://example.com/crawler)" signals good faith. Site owners who see it in logs and have a concern reach out instead of just blocking. An anonymous crawler with a forged browser UA looks like an attack and is treated like one. The reputational cost of being honest is zero; the reputational benefit when something goes wrong is significant.
