← Glossary

Crawling Glossary

Discovering and fetching pages at scale — crawl scope, politeness, sitemaps, and how scrapers traverse links without getting blocked or wasting budget.

What Is a Web Crawler?

A web crawler is software that systematically discovers and fetches web pages by following links from a starting set (seed URLs), building up a corpus of URLs and their contents.

What Is Crawl Budget?

Crawl budget is the upper limit on how much of a site a crawler will fetch in a given run — measured in pages, requests, or wall-clock time.

What Is Crawl Depth Limit?

Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL.

What Is the robots.txt Protocol?

robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch.

What Is a Sitemap?

A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, change frequency, priority.

What Is Polite Crawling?

Polite crawling is the practice of fetching a site at a pace and pattern that does not burden its infrastructure — respecting robots.txt, limiting concurrency per host, honoring ra.

Breadth-First vs Depth-First Crawling

Breadth-first crawling (BFS) explores all links at the current depth before going deeper; depth-first crawling (DFS) follows a single link chain as far as it goes, then backtracks.

What Is Link Extraction?

Link extraction is the step in a crawl where you pull every URL out of a fetched page so you can decide which to follow next.

What Is Throttling?

Throttling is the deliberate limiting of how fast requests are made or processed.