Discovering and fetching pages at scale — crawl scope, politeness, sitemaps, and how scrapers traverse links without getting blocked or wasting budget.
A web crawler is software that systematically discovers and fetches web pages by following links from a starting set (seed URLs), building up a corpus of URLs and their contents.
Crawl budget is the upper limit on how much of a site a crawler will fetch in a given run — measured in pages, requests, or wall-clock time.
Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL.
robots.txt is a plain-text file at the root of a website (/robots.txt) that tells crawlers which paths they should and should not fetch.
A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, change frequency, priority.
Polite crawling is the practice of fetching a site at a pace and pattern that does not burden its infrastructure — respecting robots.txt, limiting concurrency per host, honoring ra.
Breadth-first crawling (BFS) explores all links at the current depth before going deeper; depth-first crawling (DFS) follows a single link chain as far as it goes, then backtracks.
Link extraction is the step in a crawl where you pull every URL out of a fetched page so you can decide which to follow next.
Throttling is the deliberate limiting of how fast requests are made or processed.