Web Scraping APIs

What Is Batch Web Scraping?

What Is Batch Web Scraping? — conceptual illustration
On this page

Batch web scraping submits a large list of URLs as a single job to be processed asynchronously, then retrieves the results when ready — instead of issuing each request synchronously from the caller. Batch APIs handle concurrency, retries, dead-letter queueing, and idempotency for you, in exchange for higher latency on individual results. They are the right pattern when you have thousands or millions of URLs and do not need any single result in real time.

Quick facts

Job size100 to 1,000,000+ URLs per batch
LatencyMinutes to hours; not for real-time pipelines
Use casesCrawl ingestion, dataset building, bulk monitoring
TradeoffThroughput and reliability up; per-request latency up
IdempotencySame job ID + same URL list → same result; safe to retry

When to batch

Batch is the right pattern when (a) you have a large URL list, (b) you do not need results in real time, and (c) you want the API to handle retries and concurrency for you. Building a custom batch processor in-house is months of work to do well — concurrency control, retry logic, dead-letter handling, idempotency, progress reporting. A managed batch endpoint amortizes that across all customers.

How to size batches

Most batch APIs accept up to ~1M URLs per job, but the right size is smaller: 1,000-10,000 URLs per batch. Smaller batches give you faster feedback (failed configurations surface in minutes, not hours), parallel batches across job IDs let you balance load, and recovery from a single bad batch does not require re-running everything. Split a 1M-URL crawl into 100-200 batches.

Synchronous fallback

Sometimes you have a batch-sized job but need a few results in real time — e.g., a content monitoring pipeline that gets 99% of its data from a nightly batch but needs to check a breaking-news URL immediately. Most scraping APIs offer both batch and per-request endpoints; route the real-time work to the sync endpoint and everything else to batch. Do not run a sync request in a tight loop expecting batch performance — you will get rate-limited.

Code example

python
import requests

job = requests.post('https://publisher.scrappey.com/api/v1/batch', json={
    'urls': ['https://example.com/p/1', 'https://example.com/p/2'],
    'config': {'render_js': True, 'output': 'markdown'}
}, headers={'Authorization': 'YOUR_API_KEY'}).json()

results = requests.get(
    f'https://publisher.scrappey.com/api/v1/batch/{job["id"]}',
    headers={'Authorization': 'YOUR_API_KEY'}
).json()

Related terms

Concept map

How Batch Web Scraping connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

How is batch different from running async requests in parallel?

Batch APIs handle concurrency, retries, dead-letter queues, and idempotency for you. Parallel async puts all that on your code. For under 10k URLs, async is fine; above that, batch is dramatically less work.

How long does a batch job take?

A few seconds per URL of effective wall time, divided by the API's parallelism. A 10k-URL batch on a typical API completes in 10-30 minutes. Hard anti-bot targets are slower.

What happens if the batch contains bad URLs?

Good batch APIs ignore unreachable URLs (404, DNS fail), retry transient failures, and report a per-URL status in the result so you can re-queue specific URLs without re-running the whole batch.

Last updated: 2026-05-26