Crawling

List Crawling in Web Scraping

On this page

List crawling is the technique of crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching each item detail page in a second phase. Instead of guessing item URLs, you walk the site the way a human browses a catalog: open a list page, read every item link on it, advance to the next page, and repeat until you have collected the full set of item URLs. Once enumeration is complete, a separate detail phase visits each URL and extracts structured fields. This two-phase split - crawling list pages to find what exists, then scraping detail pages to get the data - keeps each phase simple, resumable, and easy to rate-limit.

Quick facts

Phase 1Crawl list pages to enumerate item URLs
Phase 2Fetch each detail page, extract fields
PaginationPage params, cursor/API, or infinite scroll
DedupCanonicalize URLs into a seen set
BudgetCap page count, depth, and per-host rate

The two-phase architecture

List crawling separates discovery from extraction into two phases that run independently. Phase one crawls list pages - a category index, search results, or an archive - and pulls out the link for every item shown, collecting them into a deduplicated set of detail URLs. Phase two takes that set and fetches each detail page on its own, parsing the fields you actually want. Keeping the phases apart has real payoffs: you can checkpoint the URL list to disk and resume detail fetching after a crash, you can rate-limit each phase differently, and you can re-run extraction without re-crawling the lists. This is the same discover-then-extract split that separates a web crawler from a scraper - the list phase is the crawl, the detail phase is the scrape. Extracting item links from a list page is just link extraction scoped to the item-card selector rather than every anchor on the page.

Pagination patterns you will meet

Crawling list pages comes down to recognizing how the site advances to the next page, and there are three common patterns.

  • Page parameters. The URL carries the page number or offset, e.g. ?page=2 or ?offset=40. You loop, incrementing the parameter, and stop when a page returns no item links or repeats the previous page.
  • Cursor / API pagination. The page (or an XHR call behind it) returns a nextCursor or next token. You pass that token to the next request and stop when it is null. This is the cleanest pattern - see how REST APIs work for the request shape.
  • Infinite scroll. New items load over JavaScript as you scroll. The list page has no static next link, so you either drive a real browser to scroll and render (handled for you when you fetch with a full-browser request) or call the underlying JSON endpoint the page itself uses. See dynamic content scraping for why this matters.

Always set a hard ceiling on pages crawled so a broken stop-condition cannot loop forever.

Dedup, polite crawling, and budget

Robust list crawling needs deduplication, polite pacing, and an explicit budget, or it either wastes work or overloads the target. The same item often appears on multiple list pages (sorting changes, overlapping filters), so canonicalize each detail URL - strip tracking parameters and fragments, normalize the host and trailing slash - and keep a seen set so you fetch each detail page exactly once. For pacing, polite crawling means capping per-host concurrency, adding a small delay between requests, and backing off on 429/503 responses; respecting the robots.txt protocol keeps you on the right side of site rules. For budget, bound the crawl by total list pages, by crawl depth, and by a per-domain page cap so crawl budget stays predictable. A web scraping API that rotates residential proxies and handles browser verification lets the list and detail phases run at steady concurrency without each phase managing its own proxy pool.

Code example

python
import requests

API = "https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY"

def fetch(url, browser=False):
    # direct HTTP for static list pages; full browser for JS-rendered ones
    cmd = "request.get" if not browser else "request.get"
    body = {"cmd": cmd, "url": url, "proxyCountry": "UnitedStates",
            "session": "list-crawl", "autoparse": True}
    if browser:
        body["browser"] = [{"action": "scroll"}]
    r = requests.post(API, json=body, timeout=180)
    return r.json()["solution"]["response"]

def enumerate_items(base, max_pages=50):
    # phase 1: crawl paginated list pages, collect detail URLs
    seen, page = set(), 1
    while page <= max_pages:
        html = fetch(f"{base}?page={page}")
        links = extract_item_links(html)       # your CSS/regex selector
        new = [u for u in (canonicalize(l) for l in links) if u not in seen]
        if not new:
            break                              # no new items: stop
        seen.update(new)
        page += 1
    return seen

def crawl_list(base):
    # phase 2: fetch each detail page once
    return {url: fetch(url) for url in enumerate_items(base)}

Related terms

Concept map

How List Crawling in Web Scraping connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

What is list crawling in web scraping?

List crawling is crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching each item detail page in a separate phase to extract its data. It splits discovery from extraction so each phase stays simple and resumable.

How do I handle pagination when crawling list pages?

Match the pattern: increment a page or offset parameter until a page returns no new item links, follow a nextCursor token until it is null, or drive a full browser to scroll infinite-scroll lists. Always set a maximum page count so a broken stop-condition cannot loop forever.

How do I avoid fetching the same item twice?

Canonicalize every detail URL - strip tracking parameters and fragments, normalize the host and trailing slash - and keep a seen set. Items often repeat across list pages when sorts or filters overlap, so dedup before queuing the detail phase.

Last updated: 2026-06-08