Why crawlers should check the sitemap first
If your goal is to grab content pages (not literally every link on a site), the sitemap is usually a more complete and more efficient source than link-following. Picture a site with 100,000 articles buried behind faceted navigation (filter menus by date, category, tag) — walking link by link is a nightmare. The sitemap lists every article flat, in one file. Fetch /sitemap.xml, parse it, and you have the URL list — then scrape each URL directly. This can cut crawl time by 10-100x.
Sitemap index files
Large sites split their sitemap into several files tied together by an index — a sitemap whose only job is to point to other sitemaps. For example, /sitemap_index.xml points to /sitemap-articles-1.xml, /sitemap-articles-2.xml, and so on. Your crawler should handle this: fetch the index, fetch each child sitemap it lists, then join the URL lists together. Site owners often split files by content type (articles, products, categories), so you can target just the section you care about.
When the sitemap is missing or stale
Many smaller sites either have no sitemap or one that has not been regenerated in months (a stale sitemap). When that happens, fall back to link-following or use the site's news/RSS feeds (auto-updating lists of recent posts). If a site has a sitemap but it is stale, combine the two: use the sitemap for the bulk of the URLs, and add a quick recent-changes crawl (the homepage plus the first few category pages) to catch new pages the sitemap missed.
