What it gives you out of the box
- URL → Markdown. The default output is clean Markdown with structural formatting preserved. The library prunes navigation, ads, and boilerplate before conversion, so you ingest substance not chrome.
- Adaptive crawling. A built-in heuristic decides when sufficient information has been gathered for a query, so the crawler stops itself instead of running the full BFS to depth-N. Useful for RAG ingestion where "good enough" beats exhaustive.
- Schema-based extraction. Define a CSS schema or pass a Pydantic class plus a prompt and an LLM provider; Crawl4AI handles the call. Works with OpenAI, Anthropic, Gemini, and any local model accessible via LiteLLM (Ollama, vLLM).
- Browser primitives. Built on Playwright, so you get JS execution, session management, custom JS injection, lazy-load handling, and proxy support. Full-page scanning simulates scrolling to load dynamic content.
- Anti-bot detection (0.8.x). Recent versions added proxy escalation when bot detection is hit, plus Shadow DOM flattening that handles components other crawlers cannot read.
Crawl4AI vs Firecrawl
The two are often compared head-to-head; they solve overlapping problems differently. Crawl4AI is self-hosted by default — you run the Python library or Docker image, your data never leaves your infrastructure, you bring your own LLM (including local Ollama), and you pay nothing per scrape. Firecrawl is managed-cloud-first — you call an API, Firecrawl handles the browser fleet, anti-bot bypass, and FIRE-1 agent for hard sites, and you pay per scrape after a 500/month free tier.
Choose Crawl4AI for data sovereignty, cost sensitivity, or local-LLM pipelines. Choose Firecrawl for fastest time-to-results when the target uses real anti-bot and you do not want to maintain proxies.
Where Crawl4AI does not help
Crawl4AI is a crawler and extraction framework, not a complete anti-bot bypass. On heavily protected targets — Akamai sensor.js, F5 Shape, top-tier DataDome — Crawl4AI's Playwright-based browser layer hits the same fingerprinting walls as any other CDP-driven automation. The 0.8.x proxy escalation reduces but does not eliminate this. For those targets you either drop in CloakBrowser or PatchRight as the browser layer (Crawl4AI is pluggable), or you route those URLs to a managed API.
