What Is Crawl Depth Limit?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

What Is Crawl Depth Limit? — conceptual illustration

On this page

Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL. A "hop" is one click along a link. The page you start on (the seed) is depth 0; depth 1 is the seed plus everything linked from it; depth 2 follows the links on those pages, and so on. Combined with a budget (a cap on total pages), depth shapes which parts of a site get reached. Most content lives within 2-4 hops of the homepage; beyond that you mostly find pagination, filters, and tag pages.

Depth 0	Seed page only
Depth 1	Seed + direct links — usually category pages
Depth 2-3	Most content on well-structured sites
Depth 4+	Diminishing returns; mostly filters and tags
Combined with	Crawl budget (pages cap) and scope (domain/path filter)

Where content lives

Most sites follow a simple layering. The homepage links to category pages (depth 1), category pages link to listings (depth 2), and listings link to the actual detail pages you want (depth 3). Going deeper rarely reveals anything new — you start hitting pagination, sort variants, and tag clouds instead. Setting depth at 3-4 captures the bulk of meaningful URLs without spending your budget on this combinatorial junk (the explosion of near-duplicate URLs created by filters and sort options).

Depth vs budget interaction

Depth and budget pull on each other. A high depth limit with a small budget runs out partway through and stops mid-traversal; a low depth limit with a large budget leaves capacity unused. The rule of thumb: set depth to match the natural shape of the site (3-4 hops for most), then size the budget to roughly "depth × average fanout" — fanout being how many links a typical page has — plus a safety margin. For example, a site with 20 categories and 200 items each fits in about 5,000 pages at depth 3.

Per-pattern depth

Advanced crawlers set depth differently depending on the type of URL, rather than using one number everywhere. Detail pages get depth 0 (fetch them, but do not follow their links). Category pages get a high depth, since that is where you discover items. Pagination links (the "next page" links) get capped at 50-100 to avoid infinite-calendar traps — pages like a calendar's "next month" link that go on forever. This takes more setup than a single global limit, but it dramatically improves how efficiently you spend your budget on large sites.

Code example

python

from collections import deque

def crawl_with_depth(seed, max_depth=3):
    frontier = deque([(seed, 0)])
    seen = set()
    while frontier:
        url, depth = frontier.popleft()
        if url in seen or depth > max_depth: continue
        seen.add(url)
        for link in extract_links(url):
            if link not in seen:
                frontier.append((link, depth + 1))

Related terms

What Is Crawl Budget?

Crawl budget is the upper limit on how much of a site a crawler will fetch in a single run - measured in pages, requests, or wall-clock time…

What Is a Web Crawler?

A web crawler is a program that finds and downloads web pages on its own by following links - it starts from a few given pages (called seed …

Breadth-First vs Depth-First Crawling

Breadth-first crawling (BFS) visits every link at the current depth before going any deeper; depth-first crawling (DFS) follows one chain of…

What Is Polite Crawling?

Polite crawling means running your crawler at a speed and rhythm that won't strain the websites it visits. In practice that means obeying ro…

What Is the robots.txt Protocol?

robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch. Thi…

What Is Throttling?

Throttling means deliberately slowing down how fast requests are sent or handled. A website throttles incoming traffic so it doesn't get ove…

Concept map

How Crawl Depth Limit connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Crawling

Tools & solutions for this topic

Frequently asked questions

What depth should I start with?

Start with 3 for content sites and 2 for e-commerce listings (homepage → category → item is exactly 2 hops). Then adjust after looking at what you actually reached in the first run.

Does depth limit help with infinite-link traps?

Partially. A depth limit caps the worst case, but a single bad pattern — for instance calendar URLs that link to next-month forever — can still burn through your budget at depth 1. Combine depth limits with URL pattern exclusions (rules that skip URLs matching a pattern) for real protection.

What is the difference between depth and crawl budget?

Depth limits how far you walk from each seed; budget limits the total amount of crawling overall. They are separate controls, and you need both.

Last updated: 2026-05-31