Crawling

What Is the robots.txt Protocol?

By the Scrappey Research Team

What Is the robots.txt Protocol? — conceptual illustration
On this page

robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch. Think of it as a "please don't go in here" note posted at a site's front door - it states the owner's wishes but locks nothing. It is a voluntary convention: there is no enforcement, no authentication (no login check), and no penalty in the protocol itself for ignoring it. Reputable crawlers (Googlebot, Bingbot, Common Crawl) and well-behaved custom crawlers respect it anyway. Ignoring it for public-facing crawling is a fast way to get IP-blocked and, in some jurisdictions, sued.

Quick facts

Location/robots.txt at the root of every domain
FormatPlain text, User-agent + Allow/Disallow rules
EnforcementVoluntary — convention only, no protocol enforcement
Common directivesDisallow, Allow, Crawl-delay, Sitemap
Does NOT coverAuthentication, content restrictions, rate limiting

The basic format

A robots.txt file is a list of User-agent blocks. A user-agent is the name a crawler sends to identify itself, so each block names which crawlers its rules apply to (or * for all of them). Inside a block, Disallow lists paths to skip and Allow lists exceptions that are okay to fetch. At the top level, Sitemap directives point to sitemap XML files (machine-readable lists of a site's URLs). Crawl-delay (where honored) requests a minimum gap between requests so you don't hammer the server. Modern additions like ai.txt extend the same convention to let sites opt out of AI training.

What it does NOT do

robots.txt is not authentication, not access control, and not rate limiting. A disallowed path is still publicly reachable — anyone can type the URL and fetch it directly. If you want to actually prevent access to content, put it behind a login (authentication). If you only want to keep a page out of search results, use X-Robots-Tag headers or <meta name="robots"> tags on the page itself. robots.txt only tells well-behaved crawlers what to skip — it changes behavior, not permissions.

How custom crawlers should handle it

The basic etiquette: fetch /robots.txt once per host at the start of a crawl, parse it (Python's urllib.robotparser or a third-party library does this for you), and check every candidate URL against the rules before fetching it. Cache the parsed rules for the duration of the crawl so you don't re-download them. Treat Crawl-delay as a minimum gap between requests to that host. If the file is missing or can't be fetched, default to "allow all" — that is the agreed-on convention.

Code example

python
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()

if rp.can_fetch('MyCrawler/1.0', 'https://example.com/private/'):
    pass  # allowed

delay = rp.crawl_delay('MyCrawler/1.0')

Related terms

Concept map

How robots.txt Protocol connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

Is robots.txt legally binding?

In most jurisdictions, no — but ignoring it has been argued in court as evidence of bad faith. Reputable scraping projects respect it for both ethical and risk-management reasons.

What if robots.txt blocks everything I want?

Reach out to the site owner. Most are open to negotiated access — a license, an API key, or a polite scrape schedule — for legitimate use cases.

Does robots.txt block AI training?

Standard robots.txt does not, but the ai.txt convention is emerging for that purpose, and major search engines now interpret noai directives (a signal that says "don't use this for AI"). Respect both if your pipeline feeds an LLM.

Last updated: 2026-05-31