Crawling vs scraping
The terms get conflated. A crawler's job is breadth — find URLs by following links. A scraper's job is depth — extract structured fields from a known URL. A full pipeline usually does both: crawl to enumerate the URLs of interest, scrape each one. For a known list of URLs (an export, a sitemap, an API), no crawling is needed — go straight to scraping.
How crawlers work
Start with seed URLs in a queue (the frontier). Pop a URL, fetch it, extract all links, normalize and dedupe, filter against scope rules (same domain, allowed paths), enqueue new URLs. Repeat until the frontier empties or the budget is hit. Real crawlers add: respect for robots.txt, per-host rate limiting, deduplication via URL canonicalization, and incremental mode (only re-crawl URLs whose content might have changed).
Politeness and limits
A crawler that ignores robots.txt or hammers a host at 100 requests/second is hostile and gets blocked at the first opportunity. Polite crawling means: respect Disallow directives, honor Crawl-delay if present, cap per-host concurrency (1-5 connections), back off on 429/503 responses, and identify yourself with a real User-Agent and a contact URL so site owners can reach you. Polite crawlers get a lot further than aggressive ones.
