Crawling

What Is Crawl Budget?

What Is Crawl Budget? — conceptual illustration
On this page

Crawl budget is the upper limit on how much of a site a crawler will fetch in a given run — measured in pages, requests, or wall-clock time. The term originated in SEO (Google's crawl budget for a given site) but applies equally to any custom crawler. Without a budget, a crawler can run indefinitely on a large site; with a poorly-sized budget, you miss the pages you actually care about. The skill is spending the budget on the URLs that matter.

Quick facts

UnitsPages, requests, wall time, or all three
Per-host capsAvoid being abusive; per-host limit is its own budget
Spend onContent URLs, not pagination/sort/filter combinations
Common wasteFaceted nav, search results, infinite calendar URLs
SEO equivalentGoogle's crawl budget — same concept, server-controlled

Why budgets exist

Real sites have effectively infinite URLs: pagination, sorting, filtering, search, calendar pages. A naive crawl walks every combination and hits millions of low-value pages before reaching the content you wanted. Setting a budget forces you to think about priority: which URL patterns are worth crawling, in what order, and where to stop.

Spending the budget well

The standard playbook: pull the sitemap first to get the canonical list of content URLs, then crawl by section in priority order, depth-limit at 3-5 hops from each seed, and exclude URL patterns that explode combinatorially (faceted filters, sort variants, session IDs). When the budget is hit, log what was reached and what was missed — the next run can pick up where this one stopped.

SEO crawl budget

The SEO use of "crawl budget" refers to how often Googlebot will fetch your site — controlled by Google based on site speed, content freshness, and authority. You spend it by exposing fast, canonical URLs and not wasting it on duplicate content. The principle is the same as a custom crawler: spend the budget on URLs that matter, prevent waste on URLs that do not.

Code example

python
import time
from urllib.parse import urlparse

class BudgetedCrawler:
    def __init__(self, max_pages, max_seconds, max_per_host):
        self.max_pages = max_pages
        self.deadline = time.time() + max_seconds
        self.per_host_cap = max_per_host
        self.host_counts = {}
        self.seen = set()
    def can_fetch(self, url):
        if len(self.seen) >= self.max_pages: return False
        if time.time() > self.deadline: return False
        host = urlparse(url).netloc
        return self.host_counts.get(host, 0) < self.per_host_cap

Related terms

Concept map

How Crawl Budget connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

How big should my crawl budget be?

Start with: enough pages to cover the content you care about plus 20% buffer. For unknown sites, run a small recon crawl (~500 pages) to estimate the site's shape, then size the real crawl accordingly.

Should I budget per host or globally?

Both. Global cap prevents runaway jobs; per-host cap prevents abusing any one site within an otherwise reasonable global crawl.

What if my budget runs out mid-crawl?

Persist the frontier and seen-set to disk. The next run loads them and resumes. Re-running from scratch wastes both your budget and the target site's bandwidth.

Last updated: 2026-05-26