The basic format
A robots.txt file is a list of User-agent blocks. A user-agent is the name a crawler sends to identify itself, so each block names which crawlers its rules apply to (or * for all of them). Inside a block, Disallow lists paths to skip and Allow lists exceptions that are okay to fetch. At the top level, Sitemap directives point to sitemap XML files (machine-readable lists of a site's URLs). Crawl-delay (where honored) requests a minimum gap between requests so you don't hammer the server. Modern additions like ai.txt extend the same convention to let sites opt out of AI training.
What it does NOT do
robots.txt is not authentication, not access control, and not rate limiting. A disallowed path is still publicly reachable — anyone can type the URL and fetch it directly. If you want to actually prevent access to content, put it behind a login (authentication). If you only want to keep a page out of search results, use X-Robots-Tag headers or <meta name="robots"> tags on the page itself. robots.txt only tells well-behaved crawlers what to skip — it changes behavior, not permissions.
How custom crawlers should handle it
The basic etiquette: fetch /robots.txt once per host at the start of a crawl, parse it (Python's urllib.robotparser or a third-party library does this for you), and check every candidate URL against the rules before fetching it. Cache the parsed rules for the duration of the crawl so you don't re-download them. Treat Crawl-delay as a minimum gap between requests to that host. If the file is missing or can't be fetched, default to "allow all" — that is the agreed-on convention.
