Crawling

Breadth-First vs Depth-First Crawling

Breadth-First vs Depth-First Crawling — conceptual illustration
On this page

Breadth-first crawling (BFS) visits every link at the current depth before going any deeper; depth-first crawling (DFS) follows one chain of links as far as it goes, then backtracks. A crawler is a program that follows links from page to page to collect data, and these two strategies just decide the order it visits them. BFS surfaces the broad shape of a site quickly and is the default for general-purpose crawlers. DFS reaches deep content faster but risks getting lost in one section of a large site. Most real crawlers use BFS with a depth limit — the simplicity wins.

Quick facts

BFS shapeWide and shallow first; depth grows over time
DFS shapeNarrow and deep; one section explored completely before the next
Data structureBFS = queue; DFS = stack
MemoryBFS uses more memory for the frontier on wide sites
Most crawlers useBFS with depth cap — the safe default

Why BFS is the default

BFS works level by level: it gives you the homepage, then every category page, then every listing page, before drilling into individual items. So for a site whose content sits three clicks deep (depth-3), BFS surfaces a meaningful map of the site within the first few hundred requests. DFS, by contrast, might burn the first thousand requests inside one tag's pagination before it ever touches another category. Most crawl goals — "get a sample of every section" — are a better fit for BFS.

When DFS makes sense

DFS wins when you have one specific deep target and broad coverage does not matter — scraping every product in a single category, say, or every page in one documentation section. It also uses less memory on very wide sites. The crawler holds a waiting list of links it has found but not yet visited; DFS keeps that list as a stack (it always takes the newest link next), so the list only grows with how deep the crawl is — which stays small. BFS keeps it as a queue (oldest link next), so the list grows with fanout × depth — every link at the current level has to be stored, which can get huge.

The hybrid that wins in practice

The strategy most production crawlers actually use is a mix: BFS with a depth limit, then targeted DFS for extraction. The first pass discovers the site's structure (BFS, depth 3). The second pass dives into the specific subtrees you identified (DFS, with no depth limit but kept within that scope). This gives you both a broad lay of the land and the deep coverage your pipeline needs.

Code example

python
from collections import deque

def bfs_crawl(seed, max_depth):
    frontier = deque([(seed, 0)])  # queue → BFS
    seen = set()
    while frontier:
        url, d = frontier.popleft()
        if url in seen or d > max_depth: continue
        seen.add(url)

def dfs_crawl(seed, max_depth):
    stack = [(seed, 0)]  # stack → DFS
    seen = set()
    while stack:
        url, d = stack.pop()
        if url in seen or d > max_depth: continue
        seen.add(url)

Related terms

Concept map

How Breadth-First vs Depth-First Crawling connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

Which is faster?

Neither is faster by nature — total wall-clock time depends on the network, not the strategy. The two simply reach pages in a different order. That ordering matters if you plan to stop early once you have enough data, but not if you intend to crawl every page anyway.

What about priority queues?

Both BFS and DFS can be upgraded to a priority queue: instead of strict oldest-first (FIFO) or newest-first (LIFO) ordering, you visit pages in order of an importance score — sitemap priority, link count, or freshness. This is called "best-first" crawling, and it is what Google's crawler does.

Does it matter for a single-section crawl?

Less so — inside a tight scope the two strategies end up visiting roughly the same pages in roughly the same time. The choice matters most for general-purpose crawls across an unfamiliar site, where the visiting order shapes how quickly you understand its structure.

Last updated: 2026-05-31