Web Scraping APIs

What Is Crawl4AI?

What Is Crawl4AI? — conceptual illustration
On this page

Crawl4AI is the most-starred open-source LLM-friendly web crawler on GitHub — 66.3k stars under Apache 2.0 license, maintained by UncleCode. It returns clean Markdown by default (LLM-optimised, fewer tokens than raw HTML), supports any LLM via LiteLLM (including Ollama for local extraction), and ships adaptive crawling that uses information-foraging algorithms to decide when enough has been gathered. The current 0.8.x line adds anti-bot detection with proxy escalation and Shadow DOM flattening.

Quick facts

LicenseApache 2.0 (not MIT — common misconception)
GitHub stars66.3k (May 2026)
MaintainerUncleCode (github.com/unclecode)
Python3.10+ — installs Playwright as a dependency
LLM supportAny provider via LiteLLM — OpenAI, Anthropic, Gemini, Ollama (local)

What it gives you out of the box

  • URL → Markdown. The default output is clean Markdown with structural formatting preserved. The library prunes navigation, ads, and boilerplate before conversion, so you ingest substance not chrome.
  • Adaptive crawling. A built-in heuristic decides when sufficient information has been gathered for a query, so the crawler stops itself instead of running the full BFS to depth-N. Useful for RAG ingestion where "good enough" beats exhaustive.
  • Schema-based extraction. Define a CSS schema or pass a Pydantic class plus a prompt and an LLM provider; Crawl4AI handles the call. Works with OpenAI, Anthropic, Gemini, and any local model accessible via LiteLLM (Ollama, vLLM).
  • Browser primitives. Built on Playwright, so you get JS execution, session management, custom JS injection, lazy-load handling, and proxy support. Full-page scanning simulates scrolling to load dynamic content.
  • Anti-bot detection (0.8.x). Recent versions added proxy escalation when bot detection is hit, plus Shadow DOM flattening that handles components other crawlers cannot read.

Crawl4AI vs Firecrawl

The two are often compared head-to-head; they solve overlapping problems differently. Crawl4AI is self-hosted by default — you run the Python library or Docker image, your data never leaves your infrastructure, you bring your own LLM (including local Ollama), and you pay nothing per scrape. Firecrawl is managed-cloud-first — you call an API, Firecrawl handles the browser fleet, anti-bot bypass, and FIRE-1 agent for hard sites, and you pay per scrape after a 500/month free tier.

Choose Crawl4AI for data sovereignty, cost sensitivity, or local-LLM pipelines. Choose Firecrawl for fastest time-to-results when the target uses real anti-bot and you do not want to maintain proxies.

Where Crawl4AI does not help

Crawl4AI is a crawler and extraction framework, not a complete anti-bot bypass. On heavily protected targets — Akamai sensor.js, F5 Shape, top-tier DataDome — Crawl4AI's Playwright-based browser layer hits the same fingerprinting walls as any other CDP-driven automation. The 0.8.x proxy escalation reduces but does not eliminate this. For those targets you either drop in CloakBrowser or PatchRight as the browser layer (Crawl4AI is pluggable), or you route those URLs to a managed API.

Code example

python
# Crawl4AI with local Ollama — extraction never leaves your machine.
# pip install crawl4ai && crawl4ai-setup

import asyncio
from crawl4ai import AsyncWebCrawler, LLMExtractionStrategy
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price_usd: float
    in_stock: bool

async def main():
    strategy = LLMExtractionStrategy(
        provider="ollama/llama3.3",   # local — no token, no API call
        schema=Product.model_json_schema(),
        instruction="Extract the product details from this page.",
    )

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://store.example.com/product/123",
            extraction_strategy=strategy,
        )
        print(result.extracted_content)

asyncio.run(main())

Related terms

Concept map

How Crawl4AI connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

Is Crawl4AI MIT licensed?

No — it is Apache 2.0. This is a common misconception that appears in many secondary write-ups. Apache 2.0 is functionally similar to MIT for most use cases (both are permissive) but adds explicit patent grant language. Attribution is required for redistribution.

Does Crawl4AI work with local LLMs?

Yes — LiteLLM is the underlying provider library, which routes to any LLM accessible by its provider string. ollama/llama3.3, ollama/mistral, vLLM-hosted models, and self-hosted OpenAI-compatible endpoints all work. This is the standard pattern for cost-sensitive or privacy-sensitive pipelines.

How does adaptive crawling decide when to stop?

Crawl4AI scores each crawled page for new information relative to what has already been gathered. When the marginal information from new pages drops below a threshold, the crawl stops itself. Useful for RAG ingestion where you want comprehensive coverage but not infinite recursion through near-duplicate content.

Can Crawl4AI handle heavily protected sites like Akamai?

It uses Playwright underneath, so it inherits Playwright's fingerprint problems on Akamai sensor.js and similar deep-fingerprint targets. The 0.8.x line added proxy escalation that helps with medium targets. For the hardest sites either swap the browser layer to CloakBrowser/Camoufox or route via a managed API.

Last updated: 2026-05-26