The comparison table
| Tool | Lang | Role | Strength |
|---|---|---|---|
| HTTP / TLS impersonation | |||
| curl_cffi | Python | HTTP client with Chrome TLS | Default for most scraping today; wraps a forked curl |
| tls-client | Go / Python wrapper | JA3/JA4 fingerprint matching | Used inside Python via Go shim; flexible profile config |
| utls / azuretls | Go | Low-level Chrome TLS | Tracks Chrome master closer than anything else; sidecar-of-choice |
| cycle-tls | Node.js | Browser TLS in JS | Bundles Go under the hood; only solid Node option |
| noble-tls | Python | Pure-Python JA3/JA4 | No native deps — easier deploy, slightly behind on profile freshness |
| hrequests | Python | requests-compatible stealth client | Drop-in for legacy requests-based code |
| Scrapling | Python | High-level scraping client | Built-in Turnstile solve, auto-retry, content fingerprinting |
| webclaw | Rust | MCP-native scraping | 10 MCP tools, sub-second cold start, AI-extraction first |
| Browser automation | |||
| Playwright | Python / Node / .NET / Java | CDP-based browser driver | Multi-language, auto-wait, parallel contexts; default browser tool |
| Puppeteer | Node.js | CDP browser driver (Chrome only) | Google's original; smaller surface, mature ecosystem |
| Selenium | Python / Java / many | Legacy WebDriver browser driver | Widest browser support; oldest detection surface (navigator.webdriver) |
| SeleniumBase UC | Python | Selenium + undetected-chromedriver | Quick on/off CDP stealth, pytest integration |
| undetected-chromedriver | Python | Patched Chrome driver | Patches CDP fingerprint at runtime; handles simple checks |
| nodriver | Python | Raw CDP async, no WebDriver | Asyncio-native; no WebDriver fingerprint at all |
| pydoll | Python | Pure-Python CDP | No native deps; lightweight CDP wrapper |
| Camoufox | Python | Stealth Firefox fork (Juggler protocol) | No CDP leaks; passes most Cloudflare deployments by default |
| CloakBrowser | Python / Node | Patched Chromium with C++ stealth | 49 documented C++ patches; high reCAPTCHA v3 scores |
| PatchRight | Python | Playwright source-patching | Patches Playwright source so toString() inspection passes; holds up against Kasada |
| Botasaurus | Python | High-level scraping framework | Gaussian mouse curves, profile management, deployable as API |
| Botright | Python | CAPTCHA-focused browser automation | Built-in solvers for hCaptcha, FunCaptcha, GeeTest |
| Frameworks & orchestration | |||
| Scrapy | Python | Crawler framework + pipelines | Industry default for large crawls; built-in queue, retries, deduplication |
| Crawlee | Node / Python | Apify's unified scraping framework | Switches between HTTP, Cheerio, Playwright behind one API |
| Colly | Go | Go crawler framework | Fastest framework option; ideal for pure-HTTP heavy-volume jobs |
| Katana | Go | Security-oriented crawler | Recon tool that doubles as a crawler; headless-mode flag |
| Scrapyd / scrapy-redis | Python | Scrapy deployment & distribution | Daemon (Scrapyd) and Redis-backed queue (scrapy-redis) for scaling Scrapy |
| AI / LLM scraping | |||
| Firecrawl | Hosted + open-source | Managed AI-scraping API | Markdown output, MCP server, FIRE-1 extraction agent |
| Crawl4AI | Python | Self-hosted LLM scraping | MIT licensed; Ollama-compatible local extraction |
| ScrapeGraphAI | Python | NL-to-graph scraping pipelines | Self-healing extraction when target schema drifts |
| Jina Reader | Hosted API | One-endpoint AI scrape | r.jina.ai/{url} — simplest possible interface, generous free tier |
| Browserbase | Hosted | Managed cloud browsers for agents | Stagehand integration, MCP server, persistent sessions for AI agents |
| Steel | Self-hosted | Open-source cloud browser | Self-hosted alternative to Browserbase; MCP server included |
| HTML / data parsing | |||
| BeautifulSoup4 | Python | Beginner-friendly HTML parser | Easiest API; slow on large documents |
| lxml | Python | Fast XML/HTML parser | C-backed; the engine behind BeautifulSoup and Parsel |
| selectolax | Python | Ultra-fast HTML parsing | C-based; 10–100× faster than BeautifulSoup, CSS selectors only |
| Parsel | Python | Scrapy's selector library | XPath + CSS, drop-in outside Scrapy too |
| chompjs | Python | JavaScript object literal parser | Extracts JS-embedded data without running a JS runtime |
| Reverse engineering | |||
| mitmproxy | Python | HTTPS intercepting proxy | CLI / web UI / scriptable; mobile API discovery default |
| HTTP Toolkit | App | GUI HTTPS interception | One-click iOS / Android device intercept, friendly UI |
| Frida | Multi-lang | Runtime instrumentation | Certificate-pinning handling, function hooking on mobile |
| Burp Suite | App / Pro | Commercial intercepting proxy + MCP | PortSwigger's pen-test workbench; MCP server for AI-driven recon |
| CAPTCHA solving | |||
| CapSolver | API | AI-powered solver | Sub-10s solves on most CAPTCHA types; Turnstile, reCAPTCHA, hCaptcha |
| 2Captcha | API | Human + AI hybrid | Oldest service in the category; falls back to humans on novel CAPTCHAs |
| Anti-Captcha | API | Human + AI hybrid | Similar to 2Captcha; some teams prefer for hCaptcha accuracy |
| Managed scraping APIs | |||
| Scrappey | API | Full-stack managed scraping | Handles authorized verification workflows, residential proxies, and rendering in one call |
| Bright Data | API + proxy | Largest proxy network + scraping APIs | 100M+ residential IPs; covers F5 Shape targets others can't |
| Oxylabs | API + proxy | Enterprise scraping APIs | SERP, e-commerce, real-estate verticals + OxyCopilot AI assistant |
| Zyte | API | Smart Proxy Manager + Scrapy Cloud | Built by the Scrapy team; deepest Scrapy integration |
| Apify | Platform | Pre-built scraper marketplace | 10k+ ready-made Actors; built-in scheduling and storage |
| ScrapingBee | API | Simple managed scraping | Generous free tier; easy onboarding for one-off jobs |
| Decodo (Smartproxy) | API + proxy | Mid-market proxy + scraping | Renamed from Smartproxy in 2024; balanced price/performance |
How to read this table
Focus on the role groupings, not the individual tool names. Most scraping failures come from picking the wrong role — reaching for a browser automation tool when a plain HTTP client would have worked, or paying for a managed API when a 30-line script would have done the job. The safe approach is to work top-down and stop at the first role that works:
- Try HTTP/TLS first. If
curl_cffiimpersonating Chrome gets you the page, stop there. Every role below it in the table costs more compute or money. - Move up to browser automation only when the page needs JavaScript to run. Most product pages, search results, and API endpoints don't. Infinite scroll, OAuth login flows, and single-page apps (sites that build their content in the browser) do.
- Add a framework once the crawl grows past ~1000 URLs. Below that, a script is fine. Above it, Scrapy or Crawlee earn their keep by handling retries and data pipelines for you.
- Reach for AI scraping when the data layout is fuzzy or keeps changing. Firecrawl, Crawl4AI, and ScrapeGraphAI let an LLM (large language model) pull out the fields, so you stop hand-maintaining a parser for every site.
- Use managed APIs for the hard, low-volume cases. When the protection is Akamai, F5 Shape, or Bot Management Enterprise and your volume doesn't justify running your own infrastructure, a managed API costs less than the engineering time to handle it yourself.
What is and isn't in this list
This list sticks to tools that are actively maintained and used in production today. A few categories exist but were left out on purpose:
- Legacy HTTP clients (plain
requests,aiohttp) — fine for unprotected sites but beaten by any modern anti-bot system, so they're folded into the curl_cffi entry rather than listed on their own. - Browser fingerprint databases (Multilogin, GoLogin) — handy related tooling, but they aren't scraping tools themselves; they're covered in the browser-fingerprint entries.
- Proxy aggregators (SwiftShadow, Scrapoxy) — covered in the proxies category instead of the tools category.
- Generic JavaScript runtimes (Node, Bun) — not scraping tools, but every Node-based tool above runs on one.
If a tool you rely on is missing, it's most likely because it doesn't add anything new to the comparison — it usually fills the same niche as a listed one, and you'll learn it fastest through the listed equivalent.
