The comparison table
| Tool | Lang | Role | Strength |
|---|---|---|---|
| HTTP / TLS impersonation | |||
| curl_cffi | Python | HTTP client with Chrome TLS | Default for most scraping today; wraps a forked curl |
| tls-client | Go / Python wrapper | JA3/JA4 spoofing | Used inside Python via Go shim; flexible profile config |
| utls / azuretls | Go | Low-level Chrome TLS | Tracks Chrome master closer than anything else; sidecar-of-choice |
| cycle-tls | Node.js | Browser TLS in JS | Bundles Go under the hood; only solid Node option |
| noble-tls | Python | Pure-Python JA3/JA4 | No native deps — easier deploy, slightly behind on profile freshness |
| hrequests | Python | requests-compatible stealth client | Drop-in for legacy requests-based code |
| Scrapling | Python | High-level scraping client | Built-in Turnstile solve, auto-retry, content fingerprinting |
| webclaw | Rust | MCP-native scraping | 10 MCP tools, sub-second cold start, AI-extraction first |
| Browser automation | |||
| Playwright | Python / Node / .NET / Java | CDP-based browser driver | Multi-language, auto-wait, parallel contexts; default browser tool |
| Puppeteer | Node.js | CDP browser driver (Chrome only) | Google's original; smaller surface, mature ecosystem |
| Selenium | Python / Java / many | Legacy WebDriver browser driver | Widest browser support; oldest detection surface (navigator.webdriver) |
| SeleniumBase UC | Python | Selenium + undetected-chromedriver | Quick on/off CDP stealth, pytest integration |
| undetected-chromedriver | Python | Patched Chrome driver | Patches CDP fingerprint at runtime; defeats simple checks |
| nodriver | Python | Raw CDP async, no WebDriver | Asyncio-native; no WebDriver fingerprint at all |
| pydoll | Python | Pure-Python CDP | No native deps; lightweight CDP wrapper |
| Camoufox | Python | Stealth Firefox fork (Juggler protocol) | No CDP leaks; passes most Cloudflare deployments by default |
| CloakBrowser | Python / Node | Patched Chromium with C++ stealth | 49 documented C++ patches; high reCAPTCHA v3 scores |
| PatchRight | Python | Playwright source-patching | Patches Playwright source so toString() inspection passes; defeats Kasada |
| Botasaurus | Python | High-level scraping framework | Gaussian mouse curves, profile management, deployable as API |
| Botright | Python | CAPTCHA-focused browser automation | Built-in solvers for hCaptcha, FunCaptcha, GeeTest |
| Frameworks & orchestration | |||
| Scrapy | Python | Crawler framework + pipelines | Industry default for large crawls; built-in queue, retries, deduplication |
| Crawlee | Node / Python | Apify's unified scraping framework | Switches between HTTP, Cheerio, Playwright behind one API |
| Colly | Go | Go crawler framework | Fastest framework option; ideal for pure-HTTP heavy-volume jobs |
| Katana | Go | Security-oriented crawler | Recon tool that doubles as a crawler; headless-mode flag |
| Scrapyd / scrapy-redis | Python | Scrapy deployment & distribution | Daemon (Scrapyd) and Redis-backed queue (scrapy-redis) for scaling Scrapy |
| AI / LLM scraping | |||
| Firecrawl | Hosted + open-source | Managed AI-scraping API | Markdown output, MCP server, FIRE-1 extraction agent |
| Crawl4AI | Python | Self-hosted LLM scraping | MIT licensed; Ollama-compatible local extraction |
| ScrapeGraphAI | Python | NL-to-graph scraping pipelines | Self-healing extraction when target schema drifts |
| Jina Reader | Hosted API | One-endpoint AI scrape | r.jina.ai/{url} — simplest possible interface, generous free tier |
| Browserbase | Hosted | Managed cloud browsers for agents | Stagehand integration, MCP server, persistent sessions for AI agents |
| Steel | Self-hosted | Open-source cloud browser | Self-hosted alternative to Browserbase; MCP server included |
| HTML / data parsing | |||
| BeautifulSoup4 | Python | Beginner-friendly HTML parser | Easiest API; slow on large documents |
| lxml | Python | Fast XML/HTML parser | C-backed; the engine behind BeautifulSoup and Parsel |
| selectolax | Python | Ultra-fast HTML parsing | C-based; 10–100× faster than BeautifulSoup, CSS selectors only |
| Parsel | Python | Scrapy's selector library | XPath + CSS, drop-in outside Scrapy too |
| chompjs | Python | JavaScript object literal parser | Extracts JS-embedded data without running a JS runtime |
| Reverse engineering | |||
| mitmproxy | Python | HTTPS intercepting proxy | CLI / web UI / scriptable; mobile API discovery default |
| HTTP Toolkit | App | GUI HTTPS interception | One-click iOS / Android device intercept, friendly UI |
| Frida | Multi-lang | Runtime instrumentation | Certificate-pinning bypass, function hooking on mobile |
| Burp Suite | App / Pro | Commercial intercepting proxy + MCP | PortSwigger's pen-test workbench; MCP server for AI-driven recon |
| CAPTCHA solving | |||
| CapSolver | API | AI-powered solver | Sub-10s solves on most CAPTCHA types; Turnstile, reCAPTCHA, hCaptcha |
| 2Captcha | API | Human + AI hybrid | Oldest service in the category; falls back to humans on novel CAPTCHAs |
| Anti-Captcha | API | Human + AI hybrid | Similar to 2Captcha; some teams prefer for hCaptcha accuracy |
| Managed scraping APIs | |||
| Scrappey | API | Full-stack managed scraping | Anti-bot bypass, residential proxies, CAPTCHA solve in one call |
| Bright Data | API + proxy | Largest proxy network + scraping APIs | 100M+ residential IPs; covers F5 Shape targets others can't |
| Oxylabs | API + proxy | Enterprise scraping APIs | SERP, e-commerce, real-estate verticals + OxyCopilot AI assistant |
| Zyte | API | Smart Proxy Manager + Scrapy Cloud | Built by the Scrapy team; deepest Scrapy integration |
| Apify | Platform | Pre-built scraper marketplace | 10k+ ready-made Actors; built-in scheduling and storage |
| ScrapingBee | API | Simple managed scraping | Generous free tier; easy onboarding for one-off jobs |
| Decodo (Smartproxy) | API + proxy | Mid-market proxy + scraping | Renamed from Smartproxy in 2024; balanced price/performance |
How to read this table
The role groupings matter more than the tool names. Most scraping failures come from picking the wrong role — using a browser automation tool when a HTTP client would have worked, or running a managed API when a 30-line script would have done it. Work top-down:
- Try HTTP/TLS first. If
curl_cffiwith Chrome impersonation gets the page, stop. Everything below in the table costs more compute or money. - Escalate to browser automation only if JS execution is required. Most product pages, search results, and API endpoints don't need it. Infinite scroll, OAuth flows, and client-rendered SPAs do.
- Add a framework when the crawl is bigger than ~1000 URLs. Below that, a script is fine. Above it, Scrapy or Crawlee pay for themselves in retries and pipelines.
- Reach for AI scraping when the schema is fuzzy or shifts. Firecrawl, Crawl4AI, and ScrapeGraphAI replace per-site parser maintenance with LLM extraction.
- Use managed APIs for the long tail. When the protection is Akamai, F5 Shape, or Bot Management Enterprise and the volume doesn't justify in-house infra, a managed API is cheaper than the engineering time.
What is and isn't in this list
The list focuses on tools actively maintained and used in production today. Some categories that exist but are excluded:
- Legacy HTTP clients (vanilla
requests,aiohttp) — fine for unprotected sites but defeated by any modern anti-bot; rolled into the curl_cffi entry rather than listed separately. - Browser fingerprint databases (Multilogin, GoLogin) — useful adjacent tooling but not scraping tools themselves; covered in browser-fingerprint entries.
- Proxy aggregators (SwiftShadow, Scrapoxy) — covered in the proxies category rather than the tools category.
- Generic JavaScript runtimes (Node, Bun) — not scraping tools, but every Node-based tool above runs on one.
If a tool you depend on is missing, the most likely reason is that it doesn't add to the comparison — most omitted tools occupy the same niche as a listed one and are best learned through the listed equivalent.
