Why crawlers should check the sitemap first
For any crawl that targets content URLs (not "every link on the site"), the sitemap is usually a more complete and more efficient source than link-following. A site with 100,000 articles linked behind faceted nav is a nightmare to crawl by walking links; the sitemap lists them flat. Pull /sitemap.xml, parse, and you have the URL list — then scrape each URL directly. Cuts crawl time by 10-100x.
Sitemap index files
Large sites split their sitemap into multiple files referenced by an index. /sitemap_index.xml points to /sitemap-articles-1.xml, /sitemap-articles-2.xml, etc. Your crawler should handle this case — fetch the index, fetch each child sitemap, concatenate the URL lists. Site owners often partition by content type (articles, products, categories) so you can target just the section you care about.
When the sitemap is missing or stale
Many smaller sites either lack a sitemap or have one that has not been regenerated in months. In that case fall back to link-following or use the news/RSS feeds. For sites that publish a sitemap but it is stale, combine the sitemap (for the bulk) with a recent-changes crawl (homepage + first few category pages) to catch additions the sitemap missed.
