What Is the robots.txt Protocol?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

What Is the robots.txt Protocol? — conceptual illustration

On this page

robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch. Think of it as a "please don't go in here" note posted at a site's front door - it states the owner's wishes but locks nothing. It is a voluntary convention: there is no enforcement, no authentication (no login check), and no penalty in the protocol itself for ignoring it. Reputable crawlers (Googlebot, Bingbot, Common Crawl) and well-behaved custom crawlers respect it anyway. Ignoring it for public-facing crawling is a fast way to get IP-blocked and, in some jurisdictions, sued.

Location	/robots.txt at the root of every domain
Format	Plain text, User-agent + Allow/Disallow rules
Enforcement	Voluntary — convention only, no protocol enforcement
Common directives	Disallow, Allow, Crawl-delay, Sitemap
Does NOT cover	Authentication, content restrictions, rate limiting

The basic format

A robots.txt file is a list of User-agent blocks. A user-agent is the name a crawler sends to identify itself, so each block names which crawlers its rules apply to (or * for all of them). Inside a block, Disallow lists paths to skip and Allow lists exceptions that are okay to fetch. At the top level, Sitemap directives point to sitemap XML files (machine-readable lists of a site's URLs). Crawl-delay (where honored) requests a minimum gap between requests so you don't hammer the server. Modern additions like ai.txt extend the same convention to let sites opt out of AI training.

What it does NOT do

robots.txt is not authentication, not access control, and not rate limiting. A disallowed path is still publicly reachable — anyone can type the URL and fetch it directly. If you want to actually prevent access to content, put it behind a login (authentication). If you only want to keep a page out of search results, use X-Robots-Tag headers or <meta name="robots"> tags on the page itself. robots.txt only tells well-behaved crawlers what to skip — it changes behavior, not permissions.

How custom crawlers should handle it

The basic etiquette: fetch /robots.txt once per host at the start of a crawl, parse it (Python's urllib.robotparser or a third-party library does this for you), and check every candidate URL against the rules before fetching it. Cache the parsed rules for the duration of the crawl so you don't re-download them. Treat Crawl-delay as a minimum gap between requests to that host. If the file is missing or can't be fetched, default to "allow all" — that is the agreed-on convention.

Code example

python

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()

if rp.can_fetch('MyCrawler/1.0', 'https://example.com/private/'):
    pass  # allowed

delay = rp.crawl_delay('MyCrawler/1.0')

Related terms

What Is a Web Crawler?

A web crawler is a program that finds and downloads web pages on its own by following links - it starts from a few given pages (called seed …

What Is Polite Crawling?

Polite crawling means running your crawler at a speed and rhythm that won't strain the websites it visits. In practice that means obeying ro…

What Is a Sitemap?

A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, chan…

What Is Crawl Budget?

Crawl budget is the upper limit on how much of a site a crawler will fetch in a single run - measured in pages, requests, or wall-clock time…

What Is Throttling?

Throttling means deliberately slowing down how fast requests are sent or handled. A website throttles incoming traffic so it doesn't get ove…

Concept map

How robots.txt Protocol connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Crawling

Tools & solutions for this topic

Frequently asked questions

Is robots.txt legally binding?

In most jurisdictions, no — but ignoring it has been argued in court as evidence of bad faith. Reputable scraping projects respect it for both ethical and risk-management reasons.

What if robots.txt blocks everything I want?

Reach out to the site owner. Most are open to negotiated access — a license, an API key, or a polite scrape schedule — for legitimate use cases.

Does robots.txt block AI training?

Standard robots.txt does not, but the ai.txt convention is emerging for that purpose, and major search engines now interpret noai directives (a signal that says "don't use this for AI"). Respect both if your pipeline feeds an LLM.

Last updated: 2026-05-31