What Is Crawl Budget?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

What Is Crawl Budget? — conceptual illustration

On this page

Crawl budget is the upper limit on how much of a site a crawler will fetch in a single run - measured in pages, requests, or wall-clock time. In plain terms, it is a cap on how much crawling you allow before stopping. The term started in SEO (Google's crawl budget for a given site) but the idea applies to any custom crawler you write. Without a budget, a crawler can run forever on a large site; with a budget that is too small, you miss the pages you actually wanted. The real skill is spending the budget on the URLs that matter.

Units	Pages, requests, wall time, or all three
Per-host caps	Avoid being abusive; per-host limit is its own budget
Spend on	Content URLs, not pagination/sort/filter combinations
Common waste	Faceted nav, search results, infinite calendar URLs
SEO equivalent	Google's crawl budget — same concept, server-controlled

Why budgets exist

Real sites have a near-endless supply of URLs: pagination, sorting, filtering, search results, calendar pages. A naive crawl follows every combination and hits millions of low-value pages before it ever reaches the content you came for. A budget forces you to set priorities: which URL patterns are worth crawling, in what order, and where to stop.

Spending the budget well

The standard playbook: grab the sitemap first to get the canonical list of content URLs, then crawl section by section in priority order. Limit depth to 3-5 hops (link clicks) from each seed URL, and skip patterns that explode into endless combinations - faceted filters, sort variants, and session IDs (per-visit identifiers stuck in the URL). When you hit the budget, log what you reached and what you missed, so the next run can pick up where this one stopped.

SEO crawl budget

In SEO, "crawl budget" means how often Googlebot will fetch your site - a limit Google sets based on your site speed, how fresh your content is, and your domain authority. You spend it wisely by exposing fast, canonical (single official version) URLs and not wasting it on duplicate content. The principle matches a custom crawler exactly: spend the budget on URLs that matter, and prevent waste on URLs that do not.

Code example

python

import time
from urllib.parse import urlparse

class BudgetedCrawler:
    def __init__(self, max_pages, max_seconds, max_per_host):
        self.max_pages = max_pages
        self.deadline = time.time() + max_seconds
        self.per_host_cap = max_per_host
        self.host_counts = {}
        self.seen = set()
    def can_fetch(self, url):
        if len(self.seen) >= self.max_pages: return False
        if time.time() > self.deadline: return False
        host = urlparse(url).netloc
        return self.host_counts.get(host, 0) < self.per_host_cap

Related terms

What Is a Web Crawler?

A web crawler is a program that finds and downloads web pages on its own by following links - it starts from a few given pages (called seed …

What Is Crawl Depth Limit?

Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL. A "hop" is one click along a link. The page you …

What Is a Sitemap?

A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, chan…

What Is Polite Crawling?

Polite crawling means running your crawler at a speed and rhythm that won't strain the websites it visits. In practice that means obeying ro…

Breadth-First vs Depth-First Crawling

Breadth-first crawling (BFS) visits every link at the current depth before going any deeper; depth-first crawling (DFS) follows one chain of…

What Is Throttling?

Throttling means deliberately slowing down how fast requests are sent or handled. A website throttles incoming traffic so it doesn't get ove…

List Crawling in Web Scraping

List crawling is the technique of crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching…

Concept map

How Crawl Budget connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Crawling

Tools & solutions for this topic

Frequently asked questions

How big should my crawl budget be?

Start with enough pages to cover the content you care about, plus a 20% buffer. For a site you do not know well, run a small recon crawl first (around 500 pages) to estimate its shape and size, then set the real crawl's budget accordingly.

Should I budget per host or globally?

Both. A global cap stops a single runaway job from spiraling out of control, while a per-host cap stops you from hammering any one site even when your overall crawl is reasonable.

What if my budget runs out mid-crawl?

Save the frontier (the queue of URLs still to visit) and the seen-set (URLs already visited) to disk. The next run loads them and resumes where you left off. Starting over from scratch wastes both your budget and the target site's bandwidth.

Last updated: 2026-05-31