What Is a Sitemap in Web Crawling?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

What Is a Sitemap in Web Crawling? — conceptual illustration

On this page

A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, change frequency, priority. XML is a tag-based text format, and a canonical URL is the single "official" address a site picks for a page. Sitemaps were designed for search engines, but they are gold for custom crawlers — pulling the sitemap gives you the site's preferred URL list without crawling at all (without following links page by page). For most content sites this is faster, cheaper, and more complete than link-following.

Locations	/sitemap.xml, /sitemap_index.xml, or listed in robots.txt
Format	XML following the sitemaps.org schema
Size limit	50,000 URLs / 50MB per file — large sites use index files
Per-URL metadata	loc, lastmod, changefreq, priority
Crawl benefit	Skip link traversal entirely — go straight to known URLs

Why crawlers should check the sitemap first

If your goal is to grab content pages (not literally every link on a site), the sitemap is usually a more complete and more efficient source than link-following. Picture a site with 100,000 articles buried behind faceted navigation (filter menus by date, category, tag) — walking link by link is a nightmare. The sitemap lists every article flat, in one file. Fetch /sitemap.xml, parse it, and you have the URL list — then scrape each URL directly. This can cut crawl time by 10-100x.

Sitemap index files

Large sites split their sitemap into several files tied together by an index — a sitemap whose only job is to point to other sitemaps. For example, /sitemap_index.xml points to /sitemap-articles-1.xml, /sitemap-articles-2.xml, and so on. Your crawler should handle this: fetch the index, fetch each child sitemap it lists, then join the URL lists together. Site owners often split files by content type (articles, products, categories), so you can target just the section you care about.

When the sitemap is missing or stale

Many smaller sites either have no sitemap or one that has not been regenerated in months (a stale sitemap). When that happens, fall back to link-following or use the site's news/RSS feeds (auto-updating lists of recent posts). If a site has a sitemap but it is stale, combine the two: use the sitemap for the bulk of the URLs, and add a quick recent-changes crawl (the homepage plus the first few category pages) to catch new pages the sitemap missed.

Code example

python

import requests
from xml.etree import ElementTree as ET

def fetch_sitemap_urls(url):
    r = requests.get(url)
    root = ET.fromstring(r.text)
    ns = {'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
    if root.tag.endswith('sitemapindex'):
        urls = []
        for sm in root.findall('s:sitemap/s:loc', ns):
            urls.extend(fetch_sitemap_urls(sm.text))
        return urls
    return [loc.text for loc in root.findall('s:url/s:loc', ns)]

Related terms

What Is a Web Crawler?

A web crawler is a program that finds and downloads web pages on its own by following links - it starts from a few given pages (called seed …

What Is the robots.txt Protocol?

robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch. Thi…

What Is Crawl Budget?

Crawl budget is the upper limit on how much of a site a crawler will fetch in a single run - measured in pages, requests, or wall-clock time…

What Is Link Extraction?

Link extraction is the crawling step where you pull every URL out of a page you have just downloaded, so you can decide which ones to visit …

Best Web Scraping API for SEO Audits

The best web scraping API for SEO audits combines reliable SERP scraping (Google, Bing, regional engines) with on-page extraction — title, m…

What Is a 404 Error?

HTTP 404 Not Found is the server's way of saying "I understood your request, but there is nothing at this address." The server is working fi…

Concept map

How Sitemap connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Crawling

Tools & solutions for this topic

Frequently asked questions

Where do I find the sitemap?

/sitemap.xml is the conventional location, but the authoritative answer is in /robots.txt — the text file that tells crawlers what they may access. Look for a Sitemap: directive (line) there. Large sites have multiple sitemaps, so robots.txt is where you'll find the full list.

Can I trust the sitemap to be complete?

Mostly — it represents the site owner's canonical URL set, meaning the pages they want crawlers to find. Pages they do not want indexed (showing up in search results) won't appear. For a truly complete crawl, combine the sitemap with link-following.

Does the lastmod field actually update?

Sometimes. lastmod is supposed to show when a page last changed, but many sites don't keep it accurate — treat it as a hint, not ground truth. To reliably detect changes, hash the page content yourself (compute a short fingerprint and compare it over time) rather than relying on the sitemap.

Last updated: 2026-05-31