Crawling

What Is Polite Crawling?

What Is Polite Crawling? — conceptual illustration
On this page

Polite crawling is the practice of fetching a site at a pace and pattern that does not burden its infrastructure — respecting robots.txt, limiting concurrency per host, honoring rate limits, identifying the crawler honestly, and backing off on errors. It is partly an ethical norm, partly a practical strategy: polite crawlers get blocked less, get more cooperation when they reach out, and avoid the legal and reputational risk of being seen as abusive.

Quick facts

Concurrency per host1-5 connections; not 100
Request rate1-10 requests/second/host, lower for small sites
IdentificationReal User-Agent + contact URL or email
Backoff429 and 503 → exponential backoff, not retry hammer
Honorrobots.txt, Crawl-delay, Retry-After

Why polite wins

Sites notice aggressive crawlers within minutes — high concurrency, ignoring robots.txt, sustained 5xx-without-backoff. The response is a block at the IP, ASN, or fingerprint level. A polite crawler operates below the noise threshold and runs for hours or days without intervention. The math favors politeness: 10 requests/second sustained over an hour is 36,000 pages; 100 requests/second blocked after five minutes is 30,000 — and you have burned the IP.

The practical recipe

Per host: cap concurrency at 1-5 connections. Insert 100-1000ms delay between requests. Respect Crawl-delay in robots.txt as a minimum. On 429 or 503, exponential backoff (1s, 2s, 4s, 8s, 16s) and stop after 5 attempts. Honor Retry-After headers exactly. Set a User-Agent that names your crawler and includes a URL or email for the site owner to reach you. Rotate IPs across crawls, not within a session.

Why identification matters

A User-Agent like "MyCrawler/1.0 (+https://example.com/crawler)" signals good faith. Site owners who see it in logs and have a concern reach out instead of just blocking. An anonymous crawler with a forged browser UA looks like an attack and is treated like one. The reputational cost of being honest is zero; the reputational benefit when something goes wrong is significant.

Code example

python
import time, requests

class PoliteCrawler:
    def __init__(self, user_agent='MyCrawler/1.0 (+https://example.com)'):
        self.ua = user_agent
        self.last_request = {}
    def fetch(self, url, host, delay=1.0):
        gap = time.time() - self.last_request.get(host, 0)
        if gap < delay: time.sleep(delay - gap)
        self.last_request[host] = time.time()
        return requests.get(url, headers={'User-Agent': self.ua}, timeout=30)

Related terms

Concept map

How Polite Crawling connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

How slow is polite enough?

1-2 requests per second per host is safe for almost any site. Large content sites can handle 10/sec without blinking; small ones may not. When in doubt, slower is better.

Should I use a real browser UA for polite crawling?

No — that is dishonest. Use a UA that identifies your crawler. Real-browser UAs are for scraping past anti-bot defenses, which is a different problem.

Is polite crawling slower in total?

Per-host, yes. But polite crawlers run longer without being blocked, so total throughput often comes out higher — and you keep the IP usable for the next run.

Last updated: 2026-05-26