Web Scraping APIs

What Is Batch Web Scraping?

By the Scrappey Research Team

What Is Batch Web Scraping? — conceptual illustration
On this page

Batch web scraping means handing a whole list of URLs to a service as one job, letting it work through them in the background, and collecting the results once they are ready — instead of firing each request one at a time and waiting for each reply. The batch service handles the hard plumbing for you: running many requests at once (concurrency), retrying failures, setting aside URLs that keep failing (a "dead-letter queue"), and making sure a URL is not processed twice (idempotency). The trade-off is that any single result takes longer to come back. Batch is the right choice when you have thousands or millions of URLs and do not need any one result instantly.

Quick facts

Job size100 to 1,000,000+ URLs per batch
LatencyMinutes to hours; not for real-time pipelines
Use casesCrawl ingestion, dataset building, bulk monitoring
TradeoffThroughput and reliability up; per-request latency up
IdempotencySame job ID + same URL list → same result; safe to retry

When to batch

Reach for batch when three things are true: (a) you have a large list of URLs, (b) you do not need the results in real time, and (c) you would rather the service handle retries and concurrency than write that code yourself. Building your own batch processor that does the job well is months of effort — controlling how many requests run at once, retrying failures, parking the URLs that never succeed, avoiding duplicate work, and tracking progress. A managed batch endpoint (a ready-made API that runs the job for you) spreads that work across all its customers, so you do not pay for it alone.

How to size batches

Most batch APIs let you submit up to about 1 million URLs in a single job, but the smart size is smaller: 1,000–10,000 URLs per batch. Smaller batches pay off in three ways. You get faster feedback — a broken configuration shows up in minutes instead of hours. You can run several batches at the same time under different job IDs to balance the load. And if one batch goes wrong, you only re-run that batch, not the whole crawl. So split a 1-million-URL crawl into 100–200 batches.

Synchronous fallback

Sometimes a job is batch-sized overall but a few results need to come back right away — for example, a content-monitoring pipeline that pulls 99% of its data from a nightly batch but needs to check a breaking-news URL the moment it appears. Most scraping APIs offer both endpoints: a batch one and a per-request (synchronous) one that replies immediately. Send the urgent work to the sync endpoint and everything else to batch. Just do not call the sync endpoint in a tight loop hoping to match batch throughput — you will get rate-limited (temporarily blocked for sending too many requests too fast).

Code example

python
import requests
from concurrent.futures import ThreadPoolExecutor

ENDPOINT = 'https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY'
urls = ['https://example.com/p/1', 'https://example.com/p/2']

def fetch(url):
    return requests.post(ENDPOINT, json={
        'cmd': 'request.get',
        'url': url,
        'markdown': True
    }).json()

# Fire requests in parallel — each one uses one concurrent thread
with ThreadPoolExecutor(max_workers=10) as pool:
    results = list(pool.map(fetch, urls))

Related terms

Concept map

How Batch Web Scraping connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

How is batch different from running async requests in parallel?

Batch APIs handle concurrency, retries, dead-letter queues (a holding spot for URLs that keep failing), and idempotency (not processing the same URL twice) for you. With plain parallel async, all of that is your code's job. For under 10k URLs, doing it yourself is fine; above that, batch is dramatically less work.

How long does a batch job take?

Figure a few seconds of real work per URL, divided by how many the API runs at once (its parallelism). A 10k-URL batch on a typical API finishes in 10–30 minutes. Sites with tough anti-bot defenses take longer, since each request needs more effort to get through.

What happens if the batch contains bad URLs?

Good batch APIs skip URLs they cannot reach (a 404 "not found", or a DNS failure where the domain name will not resolve), retry temporary glitches, and give you a per-URL status in the results. That way you can re-queue just the specific URLs that failed instead of re-running the whole batch.

Last updated: 2026-05-31