Crawling

Breadth-First vs Depth-First Crawling

Breadth-First vs Depth-First Crawling — conceptual illustration
On this page

Breadth-first crawling (BFS) explores all links at the current depth before going deeper; depth-first crawling (DFS) follows a single link chain as far as it goes, then backtracks. BFS surfaces broad site structure quickly and is the default for general-purpose crawlers. DFS reaches deep content faster but risks getting lost in one section of a large site. Most real crawlers use BFS with bounded depth — the simplicity wins.

Quick facts

BFS shapeWide and shallow first; depth grows over time
DFS shapeNarrow and deep; one section explored completely before the next
Data structureBFS = queue; DFS = stack
MemoryBFS uses more memory for the frontier on wide sites
Most crawlers useBFS with depth cap — the safe default

Why BFS is the default

BFS gives you the homepage, then every category page, then every listing page, before drilling into individual items. For a site with depth-3 content, BFS surfaces meaningful structure within the first few hundred requests. DFS might spend the first thousand requests inside one tag's pagination before touching another category. Most crawl objectives ("get a sample of every section") favor BFS.

When DFS makes sense

DFS wins when you have a specific deep target and broad coverage does not matter — scraping every product in a single category, or extracting every page in one documentation section. It also has a memory advantage on very wide sites: DFS keeps a stack proportional to depth (small); BFS keeps a queue proportional to fanout × depth (potentially huge).

The hybrid that wins in practice

The strategy most production crawlers actually use: BFS with bounded depth, then per-pattern DFS for targeted extraction. The first pass discovers the site's structure (BFS, depth 3). The second pass targets specific subtrees you identified (DFS, unbounded depth within scope). This gives you both broad situational awareness and the deep coverage your pipeline needs.

Code example

python
from collections import deque

def bfs_crawl(seed, max_depth):
    frontier = deque([(seed, 0)])  # queue → BFS
    seen = set()
    while frontier:
        url, d = frontier.popleft()
        if url in seen or d > max_depth: continue
        seen.add(url)

def dfs_crawl(seed, max_depth):
    stack = [(seed, 0)]  # stack → DFS
    seen = set()
    while stack:
        url, d = stack.pop()
        if url in seen or d > max_depth: continue
        seen.add(url)

Related terms

Concept map

How Breadth-First vs Depth-First Crawling connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

Which is faster?

Neither is intrinsically faster — wall-clock time depends on the network. They reach pages in different orders, which matters for early-stopping budgets but not for completeness.

What about priority queues?

Both BFS and DFS can be generalized to priority queues — order pages by an importance score (sitemap priority, link count, freshness) instead of pure FIFO/LIFO. This is "best-first" crawling and is what Google's crawler does.

Does it matter for a single-section crawl?

Less so — within a tight scope the two converge. The choice matters most for general-purpose crawls across an unfamiliar site.

Last updated: 2026-05-26