← Glossary

Web Scraping APIs Glossary

Core concepts behind modern web scraping APIs — what they do, how they handle hard sites, and where they fit in a data pipeline.

What Is a CAPTCHA Solver?

A CAPTCHA solver is software that automatically completes CAPTCHA challenges on behalf of an automated client.

What Is Web Scraping?

Web scraping is the automated extraction of structured data from websites.

What Is a Web Scraping API?

A web scraping API is a managed HTTP service that fetches a target URL on your behalf and returns the rendered HTML, JSON, or parsed data.

What Is a Headless Browser?

A headless browser is a real web browser — Chrome, Firefox, or WebKit — that runs without a visible graphical interface, controlled entirely through code.

What Is Browser Fingerprinting?

Browser fingerprinting is a technique that identifies and tracks a visitor by combining dozens of small, observable characteristics of their browser and device into a single distin.

What Is curl_cffi?

curl_cffi is a Python HTTP client that produces TLS fingerprints identical to real Chrome, Firefox, or Safari.

What Is Camoufox?

Camoufox is a stealth-focused fork of Firefox with anti-fingerprinting patches applied at the C++ build level.

What Is AI Web Scraping?

AI web scraping is an approach that replaces CSS selectors with natural-language prompts, LLM-based extraction, and Markdown-first output.

What Is Mobile API Scraping?

Mobile API scraping is the technique of intercepting traffic between a vendor's mobile app and its backend, then replicating those API calls directly from Python or another HTTP cl.

What Is CloakBrowser?

CloakBrowser is a stealth Chromium build with 49 C++ binary patches.

What Is PatchRight?

PatchRight is a stealth library that patches the Playwright Python source itself before Chrome starts, rather than injecting JavaScript at runtime.

What Is Firecrawl?

Firecrawl is an AI-native scraping API that takes a URL and returns clean Markdown or JSON — no CSS selectors, no XPath, no page parsing.

What Is Schema-Validated LLM Extraction?

Schema-validated LLM extraction is the production pattern for AI scraping: define a Pydantic schema for what you want, pass it to the LLM via the Instructor library, and get back a.

What Is Botasaurus?

Botasaurus is an MIT-licensed Python scraping framework with three top-level decorators — @browser, @request, @task — and built-in Bezier-curve mouse movements designed to pass beh.

What Is Crawl4AI?

Crawl4AI is the most-starred open-source LLM-friendly web crawler on GitHub — 66.3k stars under Apache 2.0 license, maintained by UncleCode.

What Is Burp Suite MCP for Scraping Recon?

The Burp Suite MCP Server is an official PortSwigger extension (released 3 April 2025) that exposes Burp's HTTP history, Repeater, Intruder, Collaborator, and proxy controls as Mod.

What Is the Web Scraping Decision Flow?

The web scraping decision flow is a six-step priority order experienced practitioners follow on any new target.

What Is a Computer Use Agent?

A Computer Use Agent (CUA) is an AI agent that logs into a portal as the user, navigates the UI, handles MFA and CAPTCHAs, and returns structured data.

What Is a Self-Healing Scraper?

A self-healing scraper detects mid-run that its selectors stopped working, sends the broken page HTML to an LLM (typically Claude Haiku or GPT-4o-mini for cost), and writes correct.

Best Web Scraping API for JavaScript-Rendered Sites

The best web scraping API for JavaScript-rendered sites runs a real headless browser per request, executes the page's JavaScript, waits for dynamic content to load, and returns the.

Best Web Scraping API for Price Scraping & E-commerce Price Monitoring

The best web scraping API for e-commerce price monitoring delivers reliable, geo-targeted product data across major retailers (Amazon, Walmart, eBay, Target, Shopify stores) at the.

Best Web Scraping API for SEO Audits

The best web scraping API for SEO audits combines reliable SERP scraping (Google, Bing, regional engines) with on-page extraction — title, meta, headings, schema, internal links, r.

Best Web Scraping API for LLM Training Data

The best web scraping API for LLM training data delivers clean, deduplicated, license-aware text at the scale training pipelines need — boilerplate stripped, main content extracted.

Best Web Scraping API for Competitor Research

The best web scraping API for competitor research covers the full surface a strategy team needs to monitor — pricing pages, product detail, content marketing, ad copy, review platf.

How to Get All Links From a Webpage

Getting all links from a webpage means fetching the page, parsing every <a href> attribute, resolving relative URLs against the base, normalizing for fragments and query order, and.

How to Scrape Infinite-Scroll Pages

Scraping infinite-scroll pages means programmatically triggering the scroll events that load new content, waiting for that content to render, collecting it, and detecting when the .

How to Reverse-Engineer API Requests for Scraping

Reverse-engineering API requests for scraping means inspecting the network traffic a website generates, identifying the JSON endpoints behind the rendered UI, and calling those end.

Synchronous vs Asynchronous Web Scraping

Synchronous web scraping makes one request at a time and blocks until each completes; asynchronous scraping issues many concurrent requests using an event loop or worker pool.

What Is Batch Web Scraping?

Batch web scraping submits a large list of URLs as a single job to be processed asynchronously, then retrieves the results when ready — instead of issuing each request synchronousl.

What Is Stateful Web Scraping?

Stateful web scraping preserves cookies, session tokens, browser fingerprint, and proxy IP across multiple requests so the target site sees a single coherent user across the sessio.

What Is the Chrome DevTools Protocol (CDP)?

The Chrome DevTools Protocol (CDP) is the low-level interface for instrumenting and controlling Chromium-based browsers.

What Is an MCP Server for Scraping?

An MCP server for scraping is a Model Context Protocol endpoint that exposes scraping tools (fetch, screenshot, parse, search) as callable functions an AI agent can invoke.

What Is the Scrapy + Go TLS Sidecar Architecture?

The Scrapy + Go TLS sidecar architecture is the most common production pattern for scraping Akamai- and Cloudflare-protected sites at scale.

Web Scraping Tools 2026 — A Comparison

The web-scraping toolbox in 2026 is large but well-stratified.

What Is Playwright?

Playwright is a cross-browser automation framework from Microsoft that drives Chromium, Firefox, and WebKit through a single API.

What Is Puppeteer?

Puppeteer is Google's Node.js library for controlling Chromium via the Chrome DevTools Protocol (CDP).

What Is Selenium?

Selenium is the original cross-browser automation framework — the W3C WebDriver standard predates Puppeteer by a decade.

What Is Scrapy?

Scrapy is the industry-default crawler framework for Python.

What Is mitmproxy?

mitmproxy is a Python-scriptable HTTPS intercepting proxy used for mobile API discovery, request replay, and reverse engineering authentication flows.

What Is SeleniumBase?

SeleniumBase is a Python automation and testing framework built on Selenium 4 whose UC Mode and CDP Mode make it one of the most effective Python tools for bypassing bot detection.

What Is XDriver?

XDriver is a Playwright stealth patcher that replaces Playwright's driver files in place with hardened versions, activated by a single command.

What Is Scrapling?

Scrapling is an all-in-one Python scraping framework that bundles fetching, parsing, anti-detection, and crawling behind one API — it is a layer above the other tools, not a compet.

What Is Obscura?

Obscura is an open-source headless browser engine written from scratch in Rust — not a fork or patch of Chrome or Firefox.

Anti-Detect Browser Tools Compared

Anti-detect browser tools defeat bot detection by spoofing the signals that distinguish automation from a real user — but they work at very different layers, and none is truly unde.

What Is jsoup?

jsoup is an open-source Java library for parsing and extracting data from HTML.

What Is Data Parsing?

Data parsing is the process of taking raw, unstructured or semi-structured data and converting it into a structured, usable format.

What Is Web Scraping as a Service?

Web scraping as a service (WSaaS) is a managed, cloud-based offering that handles web data extraction for you through an API or dashboard - including the proxies, browsers, and ant.

What Is PyQuery?

PyQuery is a Python library for parsing and manipulating HTML and XML using a jQuery-like syntax.