LlamaIndex

Feed real web pages into your LlamaIndex RAG pipeline

The Scrappey reader for LlamaIndex turns any URL into clean, LLM-ready Document objects. Load JavaScript-heavy and modern websites as Markdown, then index them for retrieval in a few lines of Python.

Start your free trial PyPI: llama-index-readers-scrappey Scrappey Docs Get an API key

Quick start

Requires a LlamaIndex install (pip install llama-index) and a Scrappey API key.

Install from PyPI

pip install llama-index-readers-scrappey

1
Install the reader
Add the Scrappey reader package alongside LlamaIndex.
bash
```
pip install llama-index llama-index-readers-scrappey
```
2
Set your API key
Grab a key after registering, then expose it as an environment variable.
bash
```
export SCRAPPEY_API_KEY="YOUR_API_KEY"
```

Load URLs into Documents

Instantiate ScrappeyReader and call load_data with the URLs you want to ingest. Each page returns a LlamaIndex Document with Markdown content by default.

python

import os
from llama_index.readers.scrappey import ScrappeyReader

reader = ScrappeyReader(api_key=os.environ["SCRAPPEY_API_KEY"])
documents = reader.load_data(["https://example.com"])
print(documents[0].text[:500])

Code examples

Build a queryable index from live web pages

import os
from llama_index.core import VectorStoreIndex
from llama_index.readers.scrappey import ScrappeyReader

# as_markdown=True (the default) returns clean, LLM-ready Markdown
reader = ScrappeyReader(api_key=os.environ["SCRAPPEY_API_KEY"])

documents = reader.load_data([
    "https://example.com",
    "https://en.wikipedia.org/wiki/Web_scraping",
])

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
print(query_engine.query("Summarize what web scraping is."))

What the reader calls under the hood

# The reader wraps this canonical Scrappey request for each URL.
# markdown:true gives the Markdown that lands in Document.text.
curl -X POST "https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "cmd": "request.get",
    "url": "https://example.com",
    "markdown": true
  }'

Why use the Scrappey reader

Returns native Document objects

load_data hands you a list of LlamaIndex Documents, ready to pass straight into VectorStoreIndex.from_documents or any node parser.

Markdown by default

as_markdown is on out of the box, so pages arrive as clean, LLM-ready Markdown instead of raw HTML noise. Set it to False if you want HTML.

Handles modern websites

Full-browser rendering and automatic web access handling mean JavaScript-heavy and dynamic pages load with high success rates.

Sync and async

Use load_data for scripts or aload_data inside async ingestion pipelines without blocking the event loop.

Residential proxies included

Managed sessions and residential proxies are built in on every request, with no separate proxy setup or billing.

Pay only for successes

Pay-as-you-go with no subscription. Failed requests are free, so a flaky URL never shows up on your bill.

Popular use cases

RAG over public documentation

Pull product docs, knowledge bases, or articles into a vector store and answer questions grounded in current web content.

Agent web tools

Give a LlamaIndex agent a tool that fetches and indexes live pages on demand for up-to-date retrieval.

Periodic content refresh

Re-run load_data on a schedule to keep your index in sync with sources that change often.

LlamaIndex FAQ

What does the reader return?

ScrappeyReader.load_data(urls) returns a list of LlamaIndex Document objects. By default Document.text holds clean Markdown, ready for indexing and retrieval.

How is it priced?

Scrappey is pay-as-you-go with no subscription. You get a free trial, then it is EUR 0.20 per 1,000 direct HTTP requests or EUR 1.00 per 1,000 full-browser requests. Residential proxies are included and you only pay for successful requests.

Do I get HTML or Markdown?

Markdown by default, because as_markdown defaults to True. Pass as_markdown=False to ScrappeyReader if you prefer raw HTML in Document.text.

Can I use it in an async pipeline?

Yes. Alongside load_data there is an async aload_data(urls) method you can await inside async ingestion or agent code.

What constructor options are available?

ScrappeyReader takes api_key (required), plus optional api_url (defaults to the Scrappey API endpoint), timeout (defaults to 120 seconds), and as_markdown (defaults to True).

Does it handle JavaScript-heavy pages?

Yes. Requests use full-browser rendering with automatic web access handling and managed sessions, so dynamic and modern websites load with high success rates.

More ways to plug Scrappey into your stack

MCP server Claude & Codex LangChain n8n Make Markdown for RAG All integrations

Start building with Scrappey

Try It For Free. No Subscription Required. No Credit Card Required. Instant Set-Up. Your Free Trial Is Waiting For You!

Join our ✨ Discord ✨ community