What it gives you out of the box
- URL → Markdown. Give it a URL and the default output is clean Markdown that keeps the page's structure. It first strips navigation, ads, and boilerplate, so you ingest the substance, not the page chrome around it.
- Adaptive crawling. A built-in rule decides when it has enough information to answer a query and stops, instead of mechanically visiting every link down to a fixed depth (a full breadth-first search, or BFS, to depth-N). Handy for RAG ingestion — loading content into an AI knowledge base — where "good enough" beats exhaustive.
- Schema-based extraction. You describe the data you want, either with a CSS schema or a Pydantic class (a Python way to define a data shape) plus a prompt and an LLM provider, and Crawl4AI makes the call for you. Works with OpenAI, Anthropic, Gemini, and any local model reachable via LiteLLM (Ollama, vLLM).
- Browser primitives. It is built on Playwright, the browser-automation engine, so you get JavaScript execution, session management, custom JS injection, lazy-load handling, and proxy support. Full-page scanning reproduces scrolling to trigger content that only loads as you go down the page.
- Anti-bot detection (0.8.x). Recent versions add proxy escalation — switching to a fresh proxy when a site flags the crawler as a bot — plus Shadow DOM flattening, which reads components built with isolated DOM trees that other crawlers cannot see into.
Crawl4AI vs Firecrawl
The two are often compared head-to-head; they solve overlapping problems in different ways. Crawl4AI is self-hosted by default — you run the Python library or Docker image yourself, so your data never leaves your infrastructure, you bring your own LLM (including a local Ollama model), and there is no per-scrape charge. Firecrawl is managed-cloud-first — you call an API and Firecrawl runs the browser fleet, the anti-bot handling, and its FIRE-1 agent for hard sites, charging per scrape after a 500/month free tier.
Choose Crawl4AI when you need to keep data in-house, watch costs, or run local-LLM pipelines. Choose Firecrawl for the fastest time-to-results when the target has real anti-bot defenses and you do not want to maintain proxies yourself.
Where Crawl4AI does not help
Crawl4AI is a crawler and extraction framework, not a full anti-bot solution. On heavily protected targets — Akamai sensor.js, F5 Shape, top-tier DataDome — its Playwright-based browser hits the same fingerprinting walls as any other CDP-driven automation (CDP is the Chrome DevTools Protocol that tools like Playwright use to control the browser, and which defenses can spot). The 0.8.x proxy escalation reduces this problem but does not remove it. For those targets you either swap in CloakBrowser or PatchRight as the browser layer (Crawl4AI is pluggable, so the browser can be replaced), or route those URLs to a managed API.
