What Is Batch Web Scraping?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

What Is Batch Web Scraping? — conceptual illustration

On this page

Batch web scraping means handing a whole list of URLs to a service as one job, letting it work through them in the background, and collecting the results once they are ready — instead of firing each request one at a time and waiting for each reply. The batch service handles the hard plumbing for you: running many requests at once (concurrency), retrying failures, setting aside URLs that keep failing (a "dead-letter queue"), and making sure a URL is not processed twice (idempotency). The trade-off is that any single result takes longer to come back. Batch is the right choice when you have thousands or millions of URLs and do not need any one result instantly.

Job size	100 to 1,000,000+ URLs per batch
Latency	Minutes to hours; not for real-time pipelines
Use cases	Crawl ingestion, dataset building, bulk monitoring
Tradeoff	Throughput and reliability up; per-request latency up
Idempotency	Same job ID + same URL list → same result; safe to retry

When to batch

Reach for batch when three things are true: (a) you have a large list of URLs, (b) you do not need the results in real time, and (c) you would rather the service handle retries and concurrency than write that code yourself. Building your own batch processor that does the job well is months of effort — controlling how many requests run at once, retrying failures, parking the URLs that never succeed, avoiding duplicate work, and tracking progress. A managed batch endpoint (a ready-made API that runs the job for you) spreads that work across all its customers, so you do not pay for it alone.

How to size batches

Most batch APIs let you submit up to about 1 million URLs in a single job, but the smart size is smaller: 1,000–10,000 URLs per batch. Smaller batches pay off in three ways. You get faster feedback — a broken configuration shows up in minutes instead of hours. You can run several batches at the same time under different job IDs to balance the load. And if one batch goes wrong, you only re-run that batch, not the whole crawl. So split a 1-million-URL crawl into 100–200 batches.

Synchronous fallback

Sometimes a job is batch-sized overall but a few results need to come back right away — for example, a content-monitoring pipeline that pulls 99% of its data from a nightly batch but needs to check a breaking-news URL the moment it appears. Most scraping APIs offer both endpoints: a batch one and a per-request (synchronous) one that replies immediately. Send the urgent work to the sync endpoint and everything else to batch. Just do not call the sync endpoint in a tight loop hoping to match batch throughput — you will get rate-limited (temporarily blocked for sending too many requests too fast).

Code example

python

import requests
from concurrent.futures import ThreadPoolExecutor

ENDPOINT = 'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY'
urls = ['https://example.com/p/1', 'https://example.com/p/2']

def fetch(url):
    return requests.post(ENDPOINT, json={
        'cmd': 'request.get',
        'url': url,
        'markdown': True
    }).json()

# Fire requests in parallel — each one uses one concurrent thread
with ThreadPoolExecutor(max_workers=10) as pool:
    results = list(pool.map(fetch, urls))

Synchronous web scraping sends one request at a time and waits ("blocks") until each one finishes before starting the next; asynchronous scr…

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

Best Web Scraping API for LLM Training Data

The best web scraping API for LLM training data delivers clean, deduplicated, license-aware text at the scale training pipelines need — boil…

What Is mitmproxy?

mitmproxy is a free tool that sits between an app and the internet so you can read and change the HTTPS traffic passing through it. The name…

What Is a Self-Healing Scraper?

A self-healing scraper is a scraper that notices, while it is running, that the rules it uses to find data on a page have stopped working — …

What Is PyQuery?

PyQuery is a Python library for parsing and manipulating HTML and XML using a jQuery-like syntax. If you have used jQuery in the browser to …

Concept map

How Batch Web Scraping connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping APIs

Tools & solutions for this topic

Frequently asked questions

How is batch different from running async requests in parallel?

Batch APIs handle concurrency, retries, dead-letter queues (a holding spot for URLs that keep failing), and idempotency (not processing the same URL twice) for you. With plain parallel async, all of that is your code's job. For under 10k URLs, doing it yourself is fine; above that, batch is dramatically less work.

How long does a batch job take?

Figure a few seconds of real work per URL, divided by how many the API runs at once (its parallelism). A 10k-URL batch on a typical API finishes in 10–30 minutes. Sites with tough anti-bot defenses take longer, since each request needs more effort to get through.

What happens if the batch contains bad URLs?

Good batch APIs skip URLs they cannot reach (a 404 "not found", or a DNS failure where the domain name will not resolve), retry temporary glitches, and give you a per-URL status in the results. That way you can re-queue just the specific URLs that failed instead of re-running the whole batch.

Last updated: 2026-05-31