The two-phase architecture
List crawling separates discovery from extraction into two phases that run independently. Phase one crawls list pages - a category index, search results, or an archive - and pulls out the link for every item shown, collecting them into a deduplicated set of detail URLs. Phase two takes that set and fetches each detail page on its own, parsing the fields you actually want. Keeping the phases apart has real payoffs: you can checkpoint the URL list to disk and resume detail fetching after a crash, you can rate-limit each phase differently, and you can re-run extraction without re-crawling the lists. This is the same discover-then-extract split that separates a web crawler from a scraper - the list phase is the crawl, the detail phase is the scrape. Extracting item links from a list page is just link extraction scoped to the item-card selector rather than every anchor on the page.
Pagination patterns you will meet
Crawling list pages comes down to recognizing how the site advances to the next page, and there are three common patterns.
- Page parameters. The URL carries the page number or offset, e.g.
?page=2or?offset=40. You loop, incrementing the parameter, and stop when a page returns no item links or repeats the previous page. - Cursor / API pagination. The page (or an XHR call behind it) returns a
nextCursorornexttoken. You pass that token to the next request and stop when it is null. This is the cleanest pattern - see how REST APIs work for the request shape. - Infinite scroll. New items load over JavaScript as you scroll. The list page has no static next link, so you either drive a real browser to scroll and render (handled for you when you fetch with a full-browser request) or call the underlying JSON endpoint the page itself uses. See dynamic content scraping for why this matters.
Always set a hard ceiling on pages crawled so a broken stop-condition cannot loop forever.
Dedup, polite crawling, and budget
Robust list crawling needs deduplication, polite pacing, and an explicit budget, or it either wastes work or overloads the target. The same item often appears on multiple list pages (sorting changes, overlapping filters), so canonicalize each detail URL - strip tracking parameters and fragments, normalize the host and trailing slash - and keep a seen set so you fetch each detail page exactly once. For pacing, polite crawling means capping per-host concurrency, adding a small delay between requests, and backing off on 429/503 responses; respecting the robots.txt protocol keeps you on the right side of site rules. For budget, bound the crawl by total list pages, by crawl depth, and by a per-domain page cap so crawl budget stays predictable. A web scraping API that rotates residential proxies and handles browser verification lets the list and detail phases run at steady concurrency without each phase managing its own proxy pool.