Crawling vs scraping
People mix up these two words. A crawler works in breadth - its job is to find URLs by following links. A scraper works in depth - its job is to extract specific fields (a price, a title) from a URL you already have. A full pipeline usually does both: crawl to list the URLs you care about, then scrape each one. If you already have the list of URLs - from an export, a sitemap, or an API - there's nothing to discover, so you skip crawling and go straight to scraping.
How crawlers work
You start with seed URLs in a queue (this to-do list of pages-to-visit is called the frontier). The crawler then repeats one loop: take a URL off the queue, fetch the page, pull out all its links, tidy them up and drop duplicates, keep only the ones that fit your rules (same domain, allowed paths), and add the new ones back to the queue. It repeats until the queue is empty or it hits the limit you set. Real crawlers add a few things on top: respecting robots.txt, limiting how fast they hit each host, removing duplicate URLs by canonicalizing them (reducing different-looking URLs that point to the same page down to one standard form), and an incremental mode (re-crawling only URLs whose content might have changed).
Politeness and limits
A crawler that ignores robots.txt or hammers a host at 100 requests/second is hostile and gets blocked at the first opportunity. Polite crawling means: respect Disallow directives (the robots.txt rules that say which paths are off-limits), honor Crawl-delay if present (a requested wait between requests), cap per-host concurrency (1-5 connections at a time), back off when you get 429/503 responses (the server telling you to slow down or that it's overloaded), and identify yourself with a real User-Agent and a contact URL so site owners can reach you. Polite crawlers get a lot further than aggressive ones.
