Crawling

What Is the robots.txt Protocol?

What Is the robots.txt Protocol? — conceptual illustration
On this page

robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch. It is a voluntary convention — there is no enforcement, no authentication, no penalty in the protocol itself for ignoring it. Reputable crawlers (Googlebot, Bingbot, Common Crawl) and well-behaved custom crawlers respect it. Ignoring it for public-facing crawling is a fast way to get IP-blocked and, in some jurisdictions, sued.

Quick facts

Location/robots.txt at the root of every domain
FormatPlain text, User-agent + Allow/Disallow rules
EnforcementVoluntary — convention only, no protocol enforcement
Common directivesDisallow, Allow, Crawl-delay, Sitemap
Does NOT coverAuthentication, content restrictions, rate limiting

The basic format

A robots.txt file is a list of User-agent blocks. Each block names which crawlers it applies to (or * for all) and lists Disallow and Allow rules. Sitemap directives at the top level point to sitemap XML files. Crawl-delay (where honored) requests a minimum gap between requests. Modern additions like ai.txt extend the same convention for AI training opt-outs.

What it does NOT do

robots.txt is not authentication, not access control, and not rate limiting. A disallowed path is still publicly reachable — anyone can fetch it directly. If you want to prevent access to content, use authentication. If you want to prevent indexing, use X-Robots-Tag headers or <meta name="robots"> tags on the page itself. robots.txt only tells well-behaved crawlers what to skip.

How custom crawlers should handle it

Fetch /robots.txt once per host at the start of a crawl, parse it (Python's urllib.robotparser or a third-party library), and check every candidate URL against it before fetching. Cache the parsed rules for the duration of the crawl. Treat Crawl-delay as a minimum gap between requests to that host. If the file is missing or unfetchable, default to "allow all" — that is the convention.

Code example

python
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()

if rp.can_fetch('MyCrawler/1.0', 'https://example.com/private/'):
    pass  # allowed

delay = rp.crawl_delay('MyCrawler/1.0')

Related terms

Concept map

How robots.txt Protocol connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

Is robots.txt legally binding?

In most jurisdictions, no — but ignoring it has been argued in court as evidence of bad faith. Reputable scraping projects respect it for both ethical and risk-management reasons.

What if robots.txt blocks everything I want?

Reach out to the site owner. Most are open to negotiated access — a license, an API key, or a polite scrape schedule — for legitimate use cases.

Does robots.txt block AI training?

Standard robots.txt does not, but the ai.txt convention is emerging for that purpose, and major search engines now interpret <code>noai</code> directives. Respect both if your pipeline feeds an LLM.

Last updated: 2026-05-26