Crawling

What Is Crawl Budget?

By the Scrappey Research Team

What Is Crawl Budget? — conceptual illustration
On this page

Crawl budget is the upper limit on how much of a site a crawler will fetch in a single run - measured in pages, requests, or wall-clock time. In plain terms, it is a cap on how much crawling you allow before stopping. The term started in SEO (Google's crawl budget for a given site) but the idea applies to any custom crawler you write. Without a budget, a crawler can run forever on a large site; with a budget that is too small, you miss the pages you actually wanted. The real skill is spending the budget on the URLs that matter.

Quick facts

UnitsPages, requests, wall time, or all three
Per-host capsAvoid being abusive; per-host limit is its own budget
Spend onContent URLs, not pagination/sort/filter combinations
Common wasteFaceted nav, search results, infinite calendar URLs
SEO equivalentGoogle's crawl budget — same concept, server-controlled

Why budgets exist

Real sites have a near-endless supply of URLs: pagination, sorting, filtering, search results, calendar pages. A naive crawl follows every combination and hits millions of low-value pages before it ever reaches the content you came for. A budget forces you to set priorities: which URL patterns are worth crawling, in what order, and where to stop.

Spending the budget well

The standard playbook: grab the sitemap first to get the canonical list of content URLs, then crawl section by section in priority order. Limit depth to 3-5 hops (link clicks) from each seed URL, and skip patterns that explode into endless combinations - faceted filters, sort variants, and session IDs (per-visit identifiers stuck in the URL). When you hit the budget, log what you reached and what you missed, so the next run can pick up where this one stopped.

SEO crawl budget

In SEO, "crawl budget" means how often Googlebot will fetch your site - a limit Google sets based on your site speed, how fresh your content is, and your domain authority. You spend it wisely by exposing fast, canonical (single official version) URLs and not wasting it on duplicate content. The principle matches a custom crawler exactly: spend the budget on URLs that matter, and prevent waste on URLs that do not.

Code example

python
import time
from urllib.parse import urlparse

class BudgetedCrawler:
    def __init__(self, max_pages, max_seconds, max_per_host):
        self.max_pages = max_pages
        self.deadline = time.time() + max_seconds
        self.per_host_cap = max_per_host
        self.host_counts = {}
        self.seen = set()
    def can_fetch(self, url):
        if len(self.seen) >= self.max_pages: return False
        if time.time() > self.deadline: return False
        host = urlparse(url).netloc
        return self.host_counts.get(host, 0) < self.per_host_cap

Related terms

Concept map

How Crawl Budget connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

How big should my crawl budget be?

Start with enough pages to cover the content you care about, plus a 20% buffer. For a site you do not know well, run a small recon crawl first (around 500 pages) to estimate its shape and size, then set the real crawl's budget accordingly.

Should I budget per host or globally?

Both. A global cap stops a single runaway job from spiraling out of control, while a per-host cap stops you from hammering any one site even when your overall crawl is reasonable.

What if my budget runs out mid-crawl?

Save the frontier (the queue of URLs still to visit) and the seen-set (URLs already visited) to disk. The next run loads them and resumes where you left off. Starting over from scratch wastes both your budget and the target site's bandwidth.

Last updated: 2026-05-31