

The Scrappey reader for LlamaIndex turns any URL into clean, LLM-ready Document objects. Load JavaScript-heavy and modern websites as Markdown, then index them for retrieval in a few lines of Python.
Requires a LlamaIndex install (pip install llama-index) and a Scrappey API key.
pip install llama-index-readers-scrappeyAdd the Scrappey reader package alongside LlamaIndex.
pip install llama-index llama-index-readers-scrappeyGrab a key after registering, then expose it as an environment variable.
export SCRAPPEY_API_KEY="YOUR_API_KEY"Instantiate ScrappeyReader and call load_data with the URLs you want to ingest. Each page returns a LlamaIndex Document with Markdown content by default.
import os
from llama_index.readers.scrappey import ScrappeyReader
reader = ScrappeyReader(api_key=os.environ["SCRAPPEY_API_KEY"])
documents = reader.load_data(["https://example.com"])
print(documents[0].text[:500])import os
from llama_index.core import VectorStoreIndex
from llama_index.readers.scrappey import ScrappeyReader
# as_markdown=True (the default) returns clean, LLM-ready Markdown
reader = ScrappeyReader(api_key=os.environ["SCRAPPEY_API_KEY"])
documents = reader.load_data([
"https://example.com",
"https://en.wikipedia.org/wiki/Web_scraping",
])
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
print(query_engine.query("Summarize what web scraping is."))# The reader wraps this canonical Scrappey request for each URL.
# markdownResponse:true gives the Markdown that lands in Document.text.
curl -X POST "https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"cmd": "request.get",
"url": "https://example.com",
"markdownResponse": true
}'load_data hands you a list of LlamaIndex Documents, ready to pass straight into VectorStoreIndex.from_documents or any node parser.
as_markdown is on out of the box, so pages arrive as clean, LLM-ready Markdown instead of raw HTML noise. Set it to False if you want HTML.
Full-browser rendering and automatic web access handling mean JavaScript-heavy and dynamic pages load with high success rates.
Use load_data for scripts or aload_data inside async ingestion pipelines without blocking the event loop.
Managed sessions and residential proxies are built in on every request, with no separate proxy setup or billing.
Pay-as-you-go with no subscription. Failed requests are free, so a flaky URL never shows up on your bill.
Pull product docs, knowledge bases, or articles into a vector store and answer questions grounded in current web content.
Give a LlamaIndex agent a tool that fetches and indexes live pages on demand for up-to-date retrieval.
Re-run load_data on a schedule to keep your index in sync with sources that change often.
More ways to plug Scrappey into your stack
Try It For Free. No Subscription Required. No Credit Card Required. Instant Set-Up. 150 Free Requests Are Waiting For You!