Crawling

What Is a Web Crawler?

What Is a Web Crawler? — conceptual illustration
On this page

A web crawler is software that systematically discovers and fetches web pages by following links from a starting set (seed URLs), building up a corpus of URLs and their contents. Crawlers are the discovery layer of a data pipeline; scrapers are the extraction layer. Google's indexer is the most famous crawler. In the scraping context, a crawler walks a target site to find URLs worth scraping, while the scraper extracts structured data from each URL the crawler discovers.

Quick facts

JobDiscover URLs by following links from seeds
OutputA set of (URL, fetched content) pairs
vs ScraperCrawler discovers; scraper extracts
PolitenessRespect robots.txt, rate limits, crawl-delay
BudgetBounded by depth, page count, or domain whitelist

Crawling vs scraping

The terms get conflated. A crawler's job is breadth — find URLs by following links. A scraper's job is depth — extract structured fields from a known URL. A full pipeline usually does both: crawl to enumerate the URLs of interest, scrape each one. For a known list of URLs (an export, a sitemap, an API), no crawling is needed — go straight to scraping.

How crawlers work

Start with seed URLs in a queue (the frontier). Pop a URL, fetch it, extract all links, normalize and dedupe, filter against scope rules (same domain, allowed paths), enqueue new URLs. Repeat until the frontier empties or the budget is hit. Real crawlers add: respect for robots.txt, per-host rate limiting, deduplication via URL canonicalization, and incremental mode (only re-crawl URLs whose content might have changed).

Politeness and limits

A crawler that ignores robots.txt or hammers a host at 100 requests/second is hostile and gets blocked at the first opportunity. Polite crawling means: respect Disallow directives, honor Crawl-delay if present, cap per-host concurrency (1-5 connections), back off on 429/503 responses, and identify yourself with a real User-Agent and a contact URL so site owners can reach you. Polite crawlers get a lot further than aggressive ones.

Code example

python
from collections import deque

def crawl(seed, max_pages=1000):
    frontier = deque([seed])
    seen = set()
    while frontier and len(seen) < max_pages:
        url = frontier.popleft()
        if url in seen: continue
        seen.add(url)
        # fetch + extract links, then frontier.append() new links
    return seen

Related terms

Concept map

How Web Crawler connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

When do I need a crawler vs a scraper?

Crawler when you do not have the URL list. Scraper when you do. Most projects need both.

Should I respect robots.txt?

For public-facing crawls, yes — it is the industry norm and ignoring it invites IP blocks and legal pushback. For internal use against sites you own, it is your call.

What is the best crawler library?

Scrapy for Python, Apify SDK for Node, or a managed crawl endpoint if you do not want to operate it yourself. For small one-off crawls, a few hundred lines of custom code is faster than learning a framework.

Last updated: 2026-05-26