What Is Crawl4AI?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

What Is Crawl4AI? — conceptual illustration

On this page

Crawl4AI is the most-starred open-source LLM-friendly web crawler on GitHub — 60K+ stars under Apache 2.0 license, maintained by UncleCode. A web crawler is a program that visits pages and pulls out their content; "LLM-friendly" means the output is shaped for feeding into AI models. By default it returns clean Markdown instead of raw HTML, which uses fewer tokens (the chunks of text an LLM bills and reasons over). It can call any LLM through LiteLLM — a library that talks to many providers behind one interface — including Ollama for running a model on your own machine. It also ships adaptive crawling, which uses information-foraging algorithms to judge when it has gathered enough and stop. The current 0.8.x line adds anti-bot detection with proxy escalation and Shadow DOM flattening.

License	Apache 2.0 (not MIT — common misconception)
GitHub stars	60K+
Maintainer	UncleCode (github.com/unclecode)
Python	3.10+ — installs Playwright as a dependency
LLM support	Any provider via LiteLLM — OpenAI, Anthropic, Gemini, Ollama (local)

What it gives you out of the box

URL → Markdown. Give it a URL and the default output is clean Markdown that keeps the page's structure. It first strips navigation, ads, and boilerplate, so you ingest the substance, not the page chrome around it.
Adaptive crawling. A built-in rule decides when it has enough information to answer a query and stops, instead of mechanically visiting every link down to a fixed depth (a full breadth-first search, or BFS, to depth-N). Handy for RAG ingestion — loading content into an AI knowledge base — where "good enough" beats exhaustive.
Schema-based extraction. You describe the data you want, either with a CSS schema or a Pydantic class (a Python way to define a data shape) plus a prompt and an LLM provider, and Crawl4AI makes the call for you. Works with OpenAI, Anthropic, Gemini, and any local model reachable via LiteLLM (Ollama, vLLM).
Browser primitives. It is built on Playwright, the browser-automation engine, so you get JavaScript execution, session management, custom JS injection, lazy-load handling, and proxy support. Full-page scanning reproduces scrolling to trigger content that only loads as you go down the page.
Anti-bot detection (0.8.x). Recent versions add proxy escalation — switching to a fresh proxy when a site flags the crawler as a bot — plus Shadow DOM flattening, which reads components built with isolated DOM trees that other crawlers cannot see into.

Crawl4AI vs Firecrawl

The two are often compared head-to-head; they solve overlapping problems in different ways. Crawl4AI is self-hosted by default — you run the Python library or Docker image yourself, so your data never leaves your infrastructure, you bring your own LLM (including a local Ollama model), and there is no per-scrape charge. Firecrawl is managed-cloud-first — you call an API and Firecrawl runs the browser fleet, the anti-bot handling, and its FIRE-1 agent for hard sites, charging per scrape after a 500/month free tier.

Choose Crawl4AI when you need to keep data in-house, watch costs, or run local-LLM pipelines. Choose Firecrawl for the fastest time-to-results when the target has real anti-bot defenses and you do not want to maintain proxies yourself.

Where Crawl4AI does not help

Crawl4AI is a crawler and extraction framework, not a full anti-bot solution. On heavily protected targets — Akamai sensor.js, F5 Shape, top-tier DataDome — its Playwright-based browser hits the same fingerprinting walls as any other CDP-driven automation (CDP is the Chrome DevTools Protocol that tools like Playwright use to control the browser, and which defenses can spot). The 0.8.x proxy escalation reduces this problem but does not remove it. For those targets you either swap in CloakBrowser or PatchRight as the browser layer (Crawl4AI is pluggable, so the browser can be replaced), or route those URLs to a managed API.

Code example

python

# Crawl4AI with local Ollama — extraction never leaves your machine.
# pip install crawl4ai && crawl4ai-setup

import asyncio
from crawl4ai import AsyncWebCrawler, LLMExtractionStrategy
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price_usd: float
    in_stock: bool

async def main():
    strategy = LLMExtractionStrategy(
        provider="ollama/llama3.3",   # local — no token, no API call
        schema=Product.model_json_schema(),
        instruction="Extract the product details from this page.",
    )

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://store.example.com/product/123",
            extraction_strategy=strategy,
        )
        print(result.extracted_content)

asyncio.run(main())

CSS-schema extraction (no LLM, deterministic)

python

import asyncio, json
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

# Deterministic CSS extraction - no LLM, no tokens, no API key.
schema = {
    "name": "Products",
    "baseSelector": "div.product-card",
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "link",  "selector": "a", "type": "attribute", "attribute": "href"},
    ],
}

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://store.example.com/catalog",
            extraction_strategy=JsonCssExtractionStrategy(schema),
        )
        print(json.loads(result.extracted_content)[:3])

asyncio.run(main())

Crawl many URLs concurrently

python

import asyncio
from crawl4ai import AsyncWebCrawler

# Crawl many URLs concurrently; each result carries clean Markdown.
urls = [
    "https://example.com/a",
    "https://example.com/b",
    "https://example.com/c",
]

async def main():
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls)
        for r in results:
            status = "ok" if r.success else "fail"
            print(r.url, status, len(r.markdown or ""), "chars")

asyncio.run(main())

Related terms

What Is AI Web Scraping?

AI web scraping is an approach that replaces CSS selectors with natural-language prompts, LLM-based extraction, and Markdown-first output. N…

What Is Firecrawl?

Firecrawl is a web-scraping API built for AI: you hand it a URL and it hands back clean Markdown or JSON — no CSS selectors, no XPath, no HT…

What Is Schema-Validated LLM Extraction?

Schema-validated LLM extraction is the standard production pattern for AI scraping: you describe the data you want as a Pydantic schema (a P…

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

Web Scraping Tools 2026 — A Comparison

"Web scraping tools" is the whole family of software you use to pull data off websites — and in 2026 that family is big but neatly sorted in…

Best Web Scraping API for LLM Training Data

The best web scraping API for LLM training data delivers clean, deduplicated, license-aware text at the scale training pipelines need — boil…

Best Scraping API for News Monitoring

The best scraping API for news monitoring reliably pulls a structured headline, full article body, byline, publish date, and source name fro…

Crawl4AI vs Firecrawl: Which to Pick

Crawl4AI and Firecrawl both turn a URL into clean Markdown for LLMs, but they sit on opposite ends of the build-vs-buy line: Crawl4AI is a f…

Concept map

How Crawl4AI connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping APIs

Tools & solutions for this topic

Frequently asked questions

Is Crawl4AI MIT licensed?

No — it is Apache 2.0. The MIT claim is a common misconception that shows up in many secondary write-ups. For most use cases Apache 2.0 behaves like MIT (both are permissive, meaning you can use, modify, and redistribute freely), but Apache 2.0 adds explicit patent-grant language. Attribution is required when you redistribute.

Does Crawl4AI work with local LLMs?

Yes. Under the hood it uses LiteLLM, which routes to any LLM you name with a provider string. ollama/llama3.3, ollama/mistral, vLLM-hosted models, and self-hosted OpenAI-compatible endpoints all work. Running the model locally like this is the standard approach when cost or privacy matters.

How does adaptive crawling decide when to stop?

It scores each page it visits for how much new information it adds beyond what it already has. When that gain from new pages falls below a set threshold, the crawl stops itself. This is useful for RAG ingestion — building an AI knowledge base — where you want broad coverage but do not want to keep recursing through near-duplicate pages forever.

Can Crawl4AI handle heavily protected sites like Akamai?

It runs on Playwright, so it inherits Playwright's fingerprint problems against Akamai sensor.js and similar deep-fingerprinting targets — defenses that profile the browser in detail to tell bots from people. The 0.8.x line added proxy escalation, which helps with medium-difficulty targets. For the hardest sites, either swap the browser layer to CloakBrowser/Camoufox or route via a managed API.

Last updated: 2026-06-16 · Facts last verified: 2026-06-16

What Is Crawl4AI?

Quick facts

What it gives you out of the box

Crawl4AI vs Firecrawl

Where Crawl4AI does not help

Code example

CSS-schema extraction (no LLM, deterministic)

Crawl many URLs concurrently

Related terms

Concept map

How Crawl4AI connects

Tools & solutions for this topic

Frequently asked questions

Is Crawl4AI MIT licensed?

Does Crawl4AI work with local LLMs?

How does adaptive crawling decide when to stop?

Can Crawl4AI handle heavily protected sites like Akamai?