Crawling

What Is a Sitemap in Web Crawling?

What Is a Sitemap in Web Crawling? — conceptual illustration
On this page

A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, change frequency, priority. Sitemaps were designed for search engines, but they are gold for custom crawlers — pulling the sitemap gives you the site's preferred URL list without crawling at all. For most content sites this is faster, cheaper, and more complete than link-following.

Quick facts

Locations/sitemap.xml, /sitemap_index.xml, or listed in robots.txt
FormatXML following the sitemaps.org schema
Size limit50,000 URLs / 50MB per file — large sites use index files
Per-URL metadataloc, lastmod, changefreq, priority
Crawl benefitSkip link traversal entirely — go straight to known URLs

Why crawlers should check the sitemap first

For any crawl that targets content URLs (not "every link on the site"), the sitemap is usually a more complete and more efficient source than link-following. A site with 100,000 articles linked behind faceted nav is a nightmare to crawl by walking links; the sitemap lists them flat. Pull /sitemap.xml, parse, and you have the URL list — then scrape each URL directly. Cuts crawl time by 10-100x.

Sitemap index files

Large sites split their sitemap into multiple files referenced by an index. /sitemap_index.xml points to /sitemap-articles-1.xml, /sitemap-articles-2.xml, etc. Your crawler should handle this case — fetch the index, fetch each child sitemap, concatenate the URL lists. Site owners often partition by content type (articles, products, categories) so you can target just the section you care about.

When the sitemap is missing or stale

Many smaller sites either lack a sitemap or have one that has not been regenerated in months. In that case fall back to link-following or use the news/RSS feeds. For sites that publish a sitemap but it is stale, combine the sitemap (for the bulk) with a recent-changes crawl (homepage + first few category pages) to catch additions the sitemap missed.

Code example

python
import requests
from xml.etree import ElementTree as ET

def fetch_sitemap_urls(url):
    r = requests.get(url)
    root = ET.fromstring(r.text)
    ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
    if root.tag.endswith('sitemapindex'):
        urls = []
        for sm in root.findall('s:sitemap/s:loc', ns):
            urls.extend(fetch_sitemap_urls(sm.text))
        return urls
    return [loc.text for loc in root.findall('s:url/s:loc', ns)]

Related terms

Concept map

How Sitemap connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

Where do I find the sitemap?

/sitemap.xml is the conventional location, but the authoritative answer is in /robots.txt — look for a <code>Sitemap:</code> directive. Large sites have multiple sitemaps; check robots.txt for the full list.

Can I trust the sitemap to be complete?

Mostly — it represents the site owner's canonical URL set. Pages they do not want indexed will not appear. For a complete crawl, combine sitemap with link-following.

Does the lastmod field actually update?

Sometimes. Treat it as a hint, not ground truth. For change tracking, hash the page content yourself rather than relying on the sitemap.

Last updated: 2026-05-26