List Crawling in Web Scraping

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

On this page

List crawling is the technique of crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching each item detail page in a second phase. Instead of guessing item URLs, you walk the site the way a human browses a catalog: open a list page, read every item link on it, advance to the next page, and repeat until you have collected the full set of item URLs. Once enumeration is complete, a separate detail phase visits each URL and extracts structured fields. This two-phase split - crawling list pages to find what exists, then scraping detail pages to get the data - keeps each phase simple, resumable, and easy to rate-limit.

Phase 1	Crawl list pages to enumerate item URLs
Phase 2	Fetch each detail page, extract fields
Pagination	Page params, cursor/API, or infinite scroll
Dedup	Canonicalize URLs into a seen set
Budget	Cap page count, depth, and per-host rate

The two-phase architecture

List crawling separates discovery from extraction into two phases that run independently. Phase one crawls list pages - a category index, search results, or an archive - and pulls out the link for every item shown, collecting them into a deduplicated set of detail URLs. Phase two takes that set and fetches each detail page on its own, parsing the fields you actually want. Keeping the phases apart has real payoffs: you can checkpoint the URL list to disk and resume detail fetching after a crash, you can rate-limit each phase differently, and you can re-run extraction without re-crawling the lists. This is the same discover-then-extract split that separates a web crawler from a scraper - the list phase is the crawl, the detail phase is the scrape. Extracting item links from a list page is just link extraction scoped to the item-card selector rather than every anchor on the page.

Pagination patterns you will meet

Crawling list pages comes down to recognizing how the site advances to the next page, and there are three common patterns.

Page parameters. The URL carries the page number or offset, e.g. ?page=2 or ?offset=40. You loop, incrementing the parameter, and stop when a page returns no item links or repeats the previous page.
Cursor / API pagination. The page (or an XHR call behind it) returns a nextCursor or next token. You pass that token to the next request and stop when it is null. This is the cleanest pattern - see how REST APIs work for the request shape.
Infinite scroll. New items load over JavaScript as you scroll. The list page has no static next link, so you either drive a real browser to scroll and render (handled for you when you fetch with a full-browser request) or call the underlying JSON endpoint the page itself uses. See dynamic content scraping for why this matters.

Always set a hard ceiling on pages crawled so a broken stop-condition cannot loop forever.

Dedup, polite crawling, and budget

Robust list crawling needs deduplication, polite pacing, and an explicit budget, or it either wastes work or overloads the target. The same item often appears on multiple list pages (sorting changes, overlapping filters), so canonicalize each detail URL - strip tracking parameters and fragments, normalize the host and trailing slash - and keep a seen set so you fetch each detail page exactly once. For pacing, polite crawling means capping per-host concurrency, adding a small delay between requests, and backing off on 429/503 responses; respecting the robots.txt protocol keeps you on the right side of site rules. For budget, bound the crawl by total list pages, by crawl depth, and by a per-domain page cap so crawl budget stays predictable. A web scraping API that rotates residential proxies and handles browser verification lets the list and detail phases run at steady concurrency without each phase managing its own proxy pool.

Code example

python

import requests

API = "https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY"

def fetch(url, browser=False):
    # direct HTTP for static list pages; full browser for JS-rendered ones
    cmd = "request.get" if not browser else "request.get"
    body = {"cmd": cmd, "url": url, "proxyCountry": "UnitedStates",
            "session": "list-crawl", "autoparse": True}
    if browser:
        body["browser"] = [{"action": "scroll"}]
    r = requests.post(API, json=body, timeout=180)
    return r.json()["solution"]["response"]

def enumerate_items(base, max_pages=50):
    # phase 1: crawl paginated list pages, collect detail URLs
    seen, page = set(), 1
    while page <= max_pages:
        html = fetch(f"{base}?page={page}")
        links = extract_item_links(html)       # your CSS/regex selector
        new = [u for u in (canonicalize(l) for l in links) if u not in seen]
        if not new:
            break                              # no new items: stop
        seen.update(new)
        page += 1
    return seen

def crawl_list(base):
    # phase 2: fetch each detail page once
    return {url: fetch(url) for url in enumerate_items(base)}

Related terms

What Is a Web Crawler?

A web crawler is a program that finds and downloads web pages on its own by following links - it starts from a few given pages (called seed …

What Is Link Extraction?

Link extraction is the crawling step where you pull every URL out of a page you have just downloaded, so you can decide which ones to visit …

What Is Crawl Budget?

Crawl budget is the upper limit on how much of a site a crawler will fetch in a single run - measured in pages, requests, or wall-clock time…

What Is Crawl Depth Limit?

Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL. A "hop" is one click along a link. The page you …

What Is Polite Crawling?

Polite crawling means running your crawler at a speed and rhythm that won't strain the websites it visits. In practice that means obeying ro…

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

How to Export Scraped Data to CSV and JSON (Python)

Export scraped data to CSV when you need flat, spreadsheet-ready rows, and to JSON when you need to preserve nested structure. In Python, th…

Is Web Scraping Legal?

Scraping publicly available data is generally legal, but legality depends on what you collect, how you collect it, and what you do with it —…

Concept map

How List Crawling in Web Scraping connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Crawling

Tools & solutions for this topic

Frequently asked questions

What is list crawling in web scraping?

List crawling is crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching each item detail page in a separate phase to extract its data. It splits discovery from extraction so each phase stays simple and resumable.

How do I handle pagination when crawling list pages?

Match the pattern: increment a page or offset parameter until a page returns no new item links, follow a nextCursor token until it is null, or drive a full browser to scroll infinite-scroll lists. Always set a maximum page count so a broken stop-condition cannot loop forever.

How do I avoid fetching the same item twice?

Canonicalize every detail URL - strip tracking parameters and fragments, normalize the host and trailing slash - and keep a seen set. Items often repeat across list pages when sorts or filters overlap, so dedup before queuing the detail phase.

Last updated: 2026-06-08