Crawling

What Is a Web Crawler?

What Is a Web Crawler? — conceptual illustration
On this page

A web crawler is a program that finds and downloads web pages on its own by following links - it starts from a few given pages (called seed URLs), reads the links on them, visits those, and keeps going, collecting URLs and their contents along the way. Think of a crawler as the part that discovers pages, and a scraper as the part that extracts data from them. Google's indexer is the most famous crawler. In a scraping setup, the crawler walks a target site to find URLs worth scraping, while the scraper pulls structured data out of each URL the crawler finds.

Quick facts

JobDiscover URLs by following links from seeds
OutputA set of (URL, fetched content) pairs
vs ScraperCrawler discovers; scraper extracts
PolitenessRespect robots.txt, rate limits, crawl-delay
BudgetBounded by depth, page count, or domain whitelist

Crawling vs scraping

People mix up these two words. A crawler works in breadth - its job is to find URLs by following links. A scraper works in depth - its job is to extract specific fields (a price, a title) from a URL you already have. A full pipeline usually does both: crawl to list the URLs you care about, then scrape each one. If you already have the list of URLs - from an export, a sitemap, or an API - there's nothing to discover, so you skip crawling and go straight to scraping.

How crawlers work

You start with seed URLs in a queue (this to-do list of pages-to-visit is called the frontier). The crawler then repeats one loop: take a URL off the queue, fetch the page, pull out all its links, tidy them up and drop duplicates, keep only the ones that fit your rules (same domain, allowed paths), and add the new ones back to the queue. It repeats until the queue is empty or it hits the limit you set. Real crawlers add a few things on top: respecting robots.txt, limiting how fast they hit each host, removing duplicate URLs by canonicalizing them (reducing different-looking URLs that point to the same page down to one standard form), and an incremental mode (re-crawling only URLs whose content might have changed).

Politeness and limits

A crawler that ignores robots.txt or hammers a host at 100 requests/second is hostile and gets blocked at the first opportunity. Polite crawling means: respect Disallow directives (the robots.txt rules that say which paths are off-limits), honor Crawl-delay if present (a requested wait between requests), cap per-host concurrency (1-5 connections at a time), back off when you get 429/503 responses (the server telling you to slow down or that it's overloaded), and identify yourself with a real User-Agent and a contact URL so site owners can reach you. Polite crawlers get a lot further than aggressive ones.

Code example

python
from collections import deque

def crawl(seed, max_pages=1000):
    frontier = deque([seed])
    seen = set()
    while frontier and len(seen) < max_pages:
        url = frontier.popleft()
        if url in seen: continue
        seen.add(url)
        # fetch + extract links, then frontier.append() new links
    return seen

Related terms

What Is Crawl Budget?
Crawl budget is the upper limit on how much of a site a crawler will fetch in a single run - measured in pages, requests, or wall-clock time…
What Is Crawl Depth Limit?
Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL. A "hop" is one click along a link. The page you …
What Is the robots.txt Protocol?
robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch. Thi…
What Is a Sitemap?
A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, chan…
What Is a Web Scraping API?
A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…
Best Web Scraping API for SEO Audits
The best web scraping API for SEO audits combines reliable SERP scraping (Google, Bing, regional engines) with on-page extraction — title, m…
Breadth-First vs Depth-First Crawling
Breadth-first crawling (BFS) visits every link at the current depth before going any deeper; depth-first crawling (DFS) follows one chain of…
What Is a 404 Error?
HTTP 404 Not Found is the server's way of saying "I understood your request, but there is nothing at this address." The server is working fi…
List Crawling in Web Scraping
List crawling is the technique of crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching…

Concept map

How Web Crawler connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

When do I need a crawler vs a scraper?

Use a crawler when you do not have the list of URLs yet and need to discover them. Use a scraper when you already have the URLs and just need to pull data out of them. Most projects need both.

Should I respect robots.txt?

For public-facing crawls, yes - it is the industry norm, and ignoring it invites IP blocks and legal pushback. For internal use against sites you own, it is your call.

What is the best crawler library?

Scrapy for Python, Apify SDK for Node, or a managed crawl endpoint if you do not want to operate it yourself. For small one-off crawls, a few hundred lines of custom code is faster than learning a framework.

Last updated: 2026-05-31