Crawling

What Is Crawl Depth Limit?

What Is Crawl Depth Limit? — conceptual illustration
On this page

Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL. A "hop" is one click along a link. The page you start on (the seed) is depth 0; depth 1 is the seed plus everything linked from it; depth 2 follows the links on those pages, and so on. Combined with a budget (a cap on total pages), depth shapes which parts of a site get reached. Most content lives within 2-4 hops of the homepage; beyond that you mostly find pagination, filters, and tag pages.

Quick facts

Depth 0Seed page only
Depth 1Seed + direct links — usually category pages
Depth 2-3Most content on well-structured sites
Depth 4+Diminishing returns; mostly filters and tags
Combined withCrawl budget (pages cap) and scope (domain/path filter)

Where content lives

Most sites follow a simple layering. The homepage links to category pages (depth 1), category pages link to listings (depth 2), and listings link to the actual detail pages you want (depth 3). Going deeper rarely reveals anything new — you start hitting pagination, sort variants, and tag clouds instead. Setting depth at 3-4 captures the bulk of meaningful URLs without spending your budget on this combinatorial junk (the explosion of near-duplicate URLs created by filters and sort options).

Depth vs budget interaction

Depth and budget pull on each other. A high depth limit with a small budget runs out partway through and stops mid-traversal; a low depth limit with a large budget leaves capacity unused. The rule of thumb: set depth to match the natural shape of the site (3-4 hops for most), then size the budget to roughly "depth × average fanout" — fanout being how many links a typical page has — plus a safety margin. For example, a site with 20 categories and 200 items each fits in about 5,000 pages at depth 3.

Per-pattern depth

Advanced crawlers set depth differently depending on the type of URL, rather than using one number everywhere. Detail pages get depth 0 (fetch them, but do not follow their links). Category pages get a high depth, since that is where you discover items. Pagination links (the "next page" links) get capped at 50-100 to avoid infinite-calendar traps — pages like a calendar's "next month" link that go on forever. This takes more setup than a single global limit, but it dramatically improves how efficiently you spend your budget on large sites.

Code example

python
from collections import deque

def crawl_with_depth(seed, max_depth=3):
    frontier = deque([(seed, 0)])
    seen = set()
    while frontier:
        url, depth = frontier.popleft()
        if url in seen or depth > max_depth: continue
        seen.add(url)
        for link in extract_links(url):
            if link not in seen:
                frontier.append((link, depth + 1))

Related terms

Concept map

How Crawl Depth Limit connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

What depth should I start with?

Start with 3 for content sites and 2 for e-commerce listings (homepage → category → item is exactly 2 hops). Then adjust after looking at what you actually reached in the first run.

Does depth limit help with infinite-link traps?

Partially. A depth limit caps the worst case, but a single bad pattern — for instance calendar URLs that link to next-month forever — can still burn through your budget at depth 1. Combine depth limits with URL pattern exclusions (rules that skip URLs matching a pattern) for real protection.

What is the difference between depth and crawl budget?

Depth limits how far you walk from each seed; budget limits the total amount of crawling overall. They are separate controls, and you need both.

Last updated: 2026-05-31