Crawling

What Is Polite Crawling?

What Is Polite Crawling? — conceptual illustration
On this page

Polite crawling means running your crawler at a speed and rhythm that won't strain the websites it visits. In practice that means obeying robots.txt (the file where a site lists which pages bots may touch), opening only a few connections per site at a time, honoring any rate limits, saying honestly who your crawler is, and slowing down when the server returns errors. It is partly good manners and partly self-interest: polite crawlers get blocked less, get more cooperation when they reach out, and avoid the legal and reputational risk of being seen as abusive.

Quick facts

Concurrency per host1-5 connections; not 100
Request rate1-10 requests/second/host, lower for small sites
IdentificationReal User-Agent + contact URL or email
Backoff429 and 503 → exponential backoff, not retry hammer
Honorrobots.txt, Crawl-delay, Retry-After

Why polite wins

Sites notice aggressive crawlers within minutes — too many simultaneous connections, ignoring robots.txt, hammering on through repeated 5xx server errors without pausing. The response is a block at the IP, the ASN (the network block an IP belongs to), or the fingerprint level. A polite crawler stays quiet enough to slip under that radar and can run for hours or days without being stopped. The math favors politeness: 10 requests per second sustained over an hour gets you 36,000 pages; 100 requests per second that gets blocked after five minutes gets you 30,000 — and you have burned the IP.

The practical recipe

Per host: keep no more than 1-5 connections open at once. Insert a 100-1000ms gap between requests. Treat the Crawl-delay value in robots.txt as a minimum wait. When you hit a 429 ("too many requests") or 503 ("service unavailable"), back off exponentially — wait 1s, then 2s, 4s, 8s, 16s — and give up after 5 attempts. If the server sends a Retry-After header telling you exactly how long to wait, honor it. Set a User-Agent (the line every request sends to identify itself) that names your crawler and includes a URL or email so the site owner can reach you. Rotate IPs between crawls, not in the middle of a single session.

Why identification matters

A User-Agent like "MyCrawler/1.0 (+https://example.com/crawler)" signals good faith. A site owner who spots it in their logs and has a concern can reach out to you instead of simply blocking. An anonymous crawler wearing a faked browser User-Agent looks like an attack and gets treated like one. Being honest costs you nothing; the goodwill it buys when something goes wrong is significant.

Code example

python
import time, requests

class PoliteCrawler:
    def __init__(self, user_agent='MyCrawler/1.0 (+https://example.com)'):
        self.ua = user_agent
        self.last_request = {}
    def fetch(self, url, host, delay=1.0):
        gap = time.time() - self.last_request.get(host, 0)
        if gap < delay: time.sleep(delay - gap)
        self.last_request[host] = time.time()
        return requests.get(url, headers={'User-Agent': self.ua}, timeout=30)

Related terms

Concept map

How Polite Crawling connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

How slow is polite enough?

1-2 requests per second per host is safe for almost any site. Large content sites can absorb 10 per second without blinking; small ones may not. When in doubt, slower is better.

Should I use a real browser UA for polite crawling?

No — that is dishonest. Use a User-Agent that clearly identifies your crawler. Real-browser User-Agents are for scraping past anti-bot defenses, which is a different problem entirely.

Is polite crawling slower in total?

Per host, yes. But polite crawlers keep running without getting blocked, so total throughput often ends up higher — and you keep the IP usable for the next run.

Last updated: 2026-05-31