Discovering and fetching pages at scale — crawl scope, politeness, sitemaps, and how scrapers traverse links without getting blocked or wasting budget.
A web crawler is a program that finds and downloads web pages on its own by following links - it starts from a few given pages (called seed URLs), reads the links on them, visits t.
Crawl budget is the upper limit on how much of a site a crawler will fetch in a single run - measured in pages, requests, or wall-clock time.
Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL.
robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch.
A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, change frequency, priority.
Polite crawling means running your crawler at a speed and rhythm that won't strain the websites it visits.
Breadth-first crawling (BFS) visits every link at the current depth before going any deeper; depth-first crawling (DFS) follows one chain of links as far as it goes, then backtrack.
Link extraction is the crawling step where you pull every URL out of a page you have just downloaded, so you can decide which ones to visit next.
Throttling means deliberately slowing down how fast requests are sent or handled.
List crawling is the technique of crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching each item detail page in a second phase.