Why polite wins
Sites notice aggressive crawlers within minutes — too many simultaneous connections, ignoring robots.txt, hammering on through repeated 5xx server errors without pausing. The response is a block at the IP, the ASN (the network block an IP belongs to), or the fingerprint level. A polite crawler stays quiet enough to slip under that radar and can run for hours or days without being stopped. The math favors politeness: 10 requests per second sustained over an hour gets you 36,000 pages; 100 requests per second that gets blocked after five minutes gets you 30,000 — and you have burned the IP.
The practical recipe
Per host: keep no more than 1-5 connections open at once. Insert a 100-1000ms gap between requests. Treat the Crawl-delay value in robots.txt as a minimum wait. When you hit a 429 ("too many requests") or 503 ("service unavailable"), back off exponentially — wait 1s, then 2s, 4s, 8s, 16s — and give up after 5 attempts. If the server sends a Retry-After header telling you exactly how long to wait, honor it. Set a User-Agent (the line every request sends to identify itself) that names your crawler and includes a URL or email so the site owner can reach you. Rotate IPs between crawls, not in the middle of a single session.
Why identification matters
A User-Agent like "MyCrawler/1.0 (+https://example.com/crawler)" signals good faith. A site owner who spots it in their logs and has a concern can reach out to you instead of simply blocking. An anonymous crawler wearing a faked browser User-Agent looks like an attack and gets treated like one. Being honest costs you nothing; the goodwill it buys when something goes wrong is significant.
