Crawling Glossary

Discovering and fetching pages at scale — crawl scope, politeness, sitemaps, and how scrapers traverse links without getting blocked or wasting budget.

What Is a Web Crawler?

A web crawler is a program that finds and downloads web pages on its own by following links - it starts from a few given pages (called seed URLs), reads the links on them, visits t.

What Is Crawl Budget?

Crawl budget is the upper limit on how much of a site a crawler will fetch in a single run - measured in pages, requests, or wall-clock time.

What Is Crawl Depth Limit?

Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL.

What Is the robots.txt Protocol?

robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch.

What Is a Sitemap?

A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, change frequency, priority.

What Is Polite Crawling?

Polite crawling means running your crawler at a speed and rhythm that won't strain the websites it visits.

Breadth-First vs Depth-First Crawling

Breadth-first crawling (BFS) visits every link at the current depth before going any deeper; depth-first crawling (DFS) follows one chain of links as far as it goes, then backtrack.

What Is Link Extraction?

Link extraction is the crawling step where you pull every URL out of a page you have just downloaded, so you can decide which ones to visit next.

What Is Throttling?

Throttling means deliberately slowing down how fast requests are sent or handled.

List Crawling in Web Scraping

List crawling is the technique of crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching each item detail page in a second phase.