The basic format
A robots.txt file is a list of User-agent blocks. Each block names which crawlers it applies to (or * for all) and lists Disallow and Allow rules. Sitemap directives at the top level point to sitemap XML files. Crawl-delay (where honored) requests a minimum gap between requests. Modern additions like ai.txt extend the same convention for AI training opt-outs.
What it does NOT do
robots.txt is not authentication, not access control, and not rate limiting. A disallowed path is still publicly reachable — anyone can fetch it directly. If you want to prevent access to content, use authentication. If you want to prevent indexing, use X-Robots-Tag headers or <meta name="robots"> tags on the page itself. robots.txt only tells well-behaved crawlers what to skip.
How custom crawlers should handle it
Fetch /robots.txt once per host at the start of a crawl, parse it (Python's urllib.robotparser or a third-party library), and check every candidate URL against it before fetching. Cache the parsed rules for the duration of the crawl. Treat Crawl-delay as a minimum gap between requests to that host. If the file is missing or unfetchable, default to "allow all" — that is the convention.
