Crawling

What Is a Sitemap in Web Crawling?

What Is a Sitemap in Web Crawling? — conceptual illustration
On this page

A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, change frequency, priority. XML is a tag-based text format, and a canonical URL is the single "official" address a site picks for a page. Sitemaps were designed for search engines, but they are gold for custom crawlers — pulling the sitemap gives you the site's preferred URL list without crawling at all (without following links page by page). For most content sites this is faster, cheaper, and more complete than link-following.

Quick facts

Locations/sitemap.xml, /sitemap_index.xml, or listed in robots.txt
FormatXML following the sitemaps.org schema
Size limit50,000 URLs / 50MB per file — large sites use index files
Per-URL metadataloc, lastmod, changefreq, priority
Crawl benefitSkip link traversal entirely — go straight to known URLs

Why crawlers should check the sitemap first

If your goal is to grab content pages (not literally every link on a site), the sitemap is usually a more complete and more efficient source than link-following. Picture a site with 100,000 articles buried behind faceted navigation (filter menus by date, category, tag) — walking link by link is a nightmare. The sitemap lists every article flat, in one file. Fetch /sitemap.xml, parse it, and you have the URL list — then scrape each URL directly. This can cut crawl time by 10-100x.

Sitemap index files

Large sites split their sitemap into several files tied together by an index — a sitemap whose only job is to point to other sitemaps. For example, /sitemap_index.xml points to /sitemap-articles-1.xml, /sitemap-articles-2.xml, and so on. Your crawler should handle this: fetch the index, fetch each child sitemap it lists, then join the URL lists together. Site owners often split files by content type (articles, products, categories), so you can target just the section you care about.

When the sitemap is missing or stale

Many smaller sites either have no sitemap or one that has not been regenerated in months (a stale sitemap). When that happens, fall back to link-following or use the site's news/RSS feeds (auto-updating lists of recent posts). If a site has a sitemap but it is stale, combine the two: use the sitemap for the bulk of the URLs, and add a quick recent-changes crawl (the homepage plus the first few category pages) to catch new pages the sitemap missed.

Code example

python
import requests
from xml.etree import ElementTree as ET

def fetch_sitemap_urls(url):
    r = requests.get(url)
    root = ET.fromstring(r.text)
    ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
    if root.tag.endswith('sitemapindex'):
        urls = []
        for sm in root.findall('s:sitemap/s:loc', ns):
            urls.extend(fetch_sitemap_urls(sm.text))
        return urls
    return [loc.text for loc in root.findall('s:url/s:loc', ns)]

Related terms

Concept map

How Sitemap connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

Where do I find the sitemap?

/sitemap.xml is the conventional location, but the authoritative answer is in /robots.txt — the text file that tells crawlers what they may access. Look for a Sitemap: directive (line) there. Large sites have multiple sitemaps, so robots.txt is where you'll find the full list.

Can I trust the sitemap to be complete?

Mostly — it represents the site owner's canonical URL set, meaning the pages they want crawlers to find. Pages they do not want indexed (showing up in search results) won't appear. For a truly complete crawl, combine the sitemap with link-following.

Does the lastmod field actually update?

Sometimes. lastmod is supposed to show when a page last changed, but many sites don't keep it accurate — treat it as a hint, not ground truth. To reliably detect changes, hash the page content yourself (compute a short fingerprint and compare it over time) rather than relying on the sitemap.

Last updated: 2026-05-31