Web Scraping APIs

What Is Crawl4AI?

What Is Crawl4AI? — conceptual illustration
On this page

Crawl4AI is the most-starred open-source LLM-friendly web crawler on GitHub — 66.3k stars under Apache 2.0 license, maintained by UncleCode. A web crawler is a program that visits pages and pulls out their content; "LLM-friendly" means the output is shaped for feeding into AI models. By default it returns clean Markdown instead of raw HTML, which uses fewer tokens (the chunks of text an LLM bills and reasons over). It can call any LLM through LiteLLM — a library that talks to many providers behind one interface — including Ollama for running a model on your own machine. It also ships adaptive crawling, which uses information-foraging algorithms to judge when it has gathered enough and stop. The current 0.8.x line adds anti-bot detection with proxy escalation and Shadow DOM flattening.

Quick facts

LicenseApache 2.0 (not MIT — common misconception)
GitHub stars66.3k (May 2026)
MaintainerUncleCode (github.com/unclecode)
Python3.10+ — installs Playwright as a dependency
LLM supportAny provider via LiteLLM — OpenAI, Anthropic, Gemini, Ollama (local)

What it gives you out of the box

  • URL → Markdown. Give it a URL and the default output is clean Markdown that keeps the page's structure. It first strips navigation, ads, and boilerplate, so you ingest the substance, not the page chrome around it.
  • Adaptive crawling. A built-in rule decides when it has enough information to answer a query and stops, instead of mechanically visiting every link down to a fixed depth (a full breadth-first search, or BFS, to depth-N). Handy for RAG ingestion — loading content into an AI knowledge base — where "good enough" beats exhaustive.
  • Schema-based extraction. You describe the data you want, either with a CSS schema or a Pydantic class (a Python way to define a data shape) plus a prompt and an LLM provider, and Crawl4AI makes the call for you. Works with OpenAI, Anthropic, Gemini, and any local model reachable via LiteLLM (Ollama, vLLM).
  • Browser primitives. It is built on Playwright, the browser-automation engine, so you get JavaScript execution, session management, custom JS injection, lazy-load handling, and proxy support. Full-page scanning reproduces scrolling to trigger content that only loads as you go down the page.
  • Anti-bot detection (0.8.x). Recent versions add proxy escalation — switching to a fresh proxy when a site flags the crawler as a bot — plus Shadow DOM flattening, which reads components built with isolated DOM trees that other crawlers cannot see into.

Crawl4AI vs Firecrawl

The two are often compared head-to-head; they solve overlapping problems in different ways. Crawl4AI is self-hosted by default — you run the Python library or Docker image yourself, so your data never leaves your infrastructure, you bring your own LLM (including a local Ollama model), and there is no per-scrape charge. Firecrawl is managed-cloud-first — you call an API and Firecrawl runs the browser fleet, the anti-bot handling, and its FIRE-1 agent for hard sites, charging per scrape after a 500/month free tier.

Choose Crawl4AI when you need to keep data in-house, watch costs, or run local-LLM pipelines. Choose Firecrawl for the fastest time-to-results when the target has real anti-bot defenses and you do not want to maintain proxies yourself.

Where Crawl4AI does not help

Crawl4AI is a crawler and extraction framework, not a full anti-bot solution. On heavily protected targets — Akamai sensor.js, F5 Shape, top-tier DataDome — its Playwright-based browser hits the same fingerprinting walls as any other CDP-driven automation (CDP is the Chrome DevTools Protocol that tools like Playwright use to control the browser, and which defenses can spot). The 0.8.x proxy escalation reduces this problem but does not remove it. For those targets you either swap in CloakBrowser or PatchRight as the browser layer (Crawl4AI is pluggable, so the browser can be replaced), or route those URLs to a managed API.

Code example

python
# Crawl4AI with local Ollama — extraction never leaves your machine.
# pip install crawl4ai && crawl4ai-setup

import asyncio
from crawl4ai import AsyncWebCrawler, LLMExtractionStrategy
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price_usd: float
    in_stock: bool

async def main():
    strategy = LLMExtractionStrategy(
        provider="ollama/llama3.3",   # local — no token, no API call
        schema=Product.model_json_schema(),
        instruction="Extract the product details from this page.",
    )

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://store.example.com/product/123",
            extraction_strategy=strategy,
        )
        print(result.extracted_content)

asyncio.run(main())

Related terms

Concept map

How Crawl4AI connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

Is Crawl4AI MIT licensed?

No — it is Apache 2.0. The MIT claim is a common misconception that shows up in many secondary write-ups. For most use cases Apache 2.0 behaves like MIT (both are permissive, meaning you can use, modify, and redistribute freely), but Apache 2.0 adds explicit patent-grant language. Attribution is required when you redistribute.

Does Crawl4AI work with local LLMs?

Yes. Under the hood it uses LiteLLM, which routes to any LLM you name with a provider string. ollama/llama3.3, ollama/mistral, vLLM-hosted models, and self-hosted OpenAI-compatible endpoints all work. Running the model locally like this is the standard approach when cost or privacy matters.

How does adaptive crawling decide when to stop?

It scores each page it visits for how much new information it adds beyond what it already has. When that gain from new pages falls below a set threshold, the crawl stops itself. This is useful for RAG ingestion — building an AI knowledge base — where you want broad coverage but do not want to keep recursing through near-duplicate pages forever.

Can Crawl4AI handle heavily protected sites like Akamai?

It runs on Playwright, so it inherits Playwright's fingerprint problems against Akamai sensor.js and similar deep-fingerprinting targets — defenses that profile the browser in detail to tell bots from people. The 0.8.x line added proxy escalation, which helps with medium-difficulty targets. For the hardest sites, either swap the browser layer to CloakBrowser/Camoufox or route via a managed API.

Last updated: 2026-05-31