LangChain

Feed clean web data into your LangChain chains and agents

@scrappey/langchain is the official Scrappey document loader for LangChain.js. It fetches JavaScript-heavy and modern websites as LLM-ready Markdown, so the output drops straight into a text splitter, vector store, or agent — no local HTML parsing required.

Start your free trial npm: @scrappey/langchain GitHub: langchain-scrappey Scrappey Docs Get an API key

Quick start

Requires Node 18+. @langchain/core is the only peer dependency.

Install with npm

npm install @scrappey/langchain @langchain/core

1
Set your API key
Register at scrappey.com to get your key (free trial included), then expose it to your environment. The loader reads SCRAPPEY_API_KEY by default, or you can pass apiKey directly.
bash
```
export SCRAPPEY_API_KEY="your_api_key"
```

Load a page as Markdown

Create a ScrappeyLoader with one or more URLs and call load(). Each URL becomes a LangChain Document whose pageContent is server-side Markdown.

typescript

import { ScrappeyLoader } from "@scrappey/langchain";

const loader = new ScrappeyLoader({
  urls: ["https://example.com", "https://news.ycombinator.com"],
});

const docs = await loader.load();
console.log(docs[0].pageContent.slice(0, 120));

Stream large URL lists

Use lazyLoad() to process documents one at a time as each page lands, so you can embed and persist without buffering the whole batch.

typescript

const loader = new ScrappeyLoader({ urls: bigUrlList, concurrency: 2 });
for await (const doc of loader.lazyLoad()) {
  // embed / persist each Document as soon as it arrives
}

Full examples

End-to-end RAG pipeline

import { ScrappeyLoader } from "@scrappey/langchain";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";

// Scrappey returns LLM-ready Markdown, so it drops straight into a splitter.
const loader = new ScrappeyLoader({
  urls: ["https://en.wikipedia.org/wiki/Web_scraping"],
  concurrency: 2,
  skipOnError: true,
});

const docs = await loader.load();

const splits = await new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 150,
}).splitDocuments(docs);

const store = await MemoryVectorStore.fromDocuments(
  splits,
  new OpenAIEmbeddings()
);

const hits = await store.similaritySearch("What is web scraping?", 3);
console.log(hits.map((h) => h.metadata.source));

Calling the Scrappey API directly

# The loader wraps this canonical Scrappey request under the hood.
# Add "markdown": true for LLM-ready Markdown instead of HTML.
curl -X POST "https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "cmd": "request.get",
    "url": "https://example.com",
    "markdown": true
  }'

Why use the LangChain loader

Drop-in Document loader

ScrappeyLoader implements the standard LangChain loader interface with load() and lazyLoad(), so it plugs into existing chains, splitters, and vector stores.

Server-side Markdown

Pages come back as clean, LLM-ready Markdown converted on Scrappey's side — no local HTML-to-Markdown step before chunking and embedding.

JavaScript-heavy sites handled

Full-browser rendering with automatic web access handling means content from modern, dynamic websites loads reliably and at high success rates.

Rich metadata per Document

Each Document carries metadata like source URL, status code, final URL after redirects, and the session Scrappey used — useful for citations and debugging.

Concurrency and error control

Tune the concurrency option for parallel fetches and set skipOnError to omit failed URLs instead of throwing.

Zero runtime dependencies

Built on native fetch with @langchain/core as the only peer dependency. Dual ESM + CJS build with first-class TypeScript types.

Popular use cases

RAG over public web pages

Load documentation, articles, or knowledge bases as Markdown, chunk them, and embed into a vector store for retrieval-augmented generation.

Agent web research tools

Give a LangChain agent the ability to pull fresh content from JavaScript-heavy sites the user has the right to access.

Scheduled ingestion jobs

Stream a large URL list with lazyLoad() and persist embeddings incrementally for nightly index refreshes.

LangChain FAQ

How much does it cost to run?

Scrappey is pay-as-you-go with no subscription. You get a free trial, then pay €0.10 per 1,000 direct HTTP requests or €1.00 per 1,000 full-browser requests. Residential proxies are included, and you only pay for successful requests.

How do I pass my API key?

Set the SCRAPPEY_API_KEY environment variable and the loader picks it up automatically, or pass apiKey directly in the ScrappeyLoader constructor. Never hardcode the key in committed code.

Do I get Markdown or HTML?

Markdown by default. The loader's mode option accepts "markdown" (server-side, the default) or "html" for the raw body. Markdown is recommended for RAG and LLM ingestion.

What does each loaded Document contain?

One Document per URL, with pageContent holding the Markdown or HTML and metadata fields such as source, statusCode, verified, timeElapsedMs, and the Scrappey session and final URL.

Can it handle JavaScript-heavy sites?

Yes. Scrappey renders pages in a full browser with automatic web access handling, so dynamic and modern websites return reliably with high success rates.

Does it work with TypeScript and CommonJS?

Yes. The package ships a dual ESM + CJS build with bundled TypeScript types and supports Node 18 and above.

More ways to plug Scrappey into your stack

MCP server Claude & Codex LlamaIndex n8n Make Markdown for RAG All integrations

Start building with Scrappey

Try It For Free. No Subscription Required. No Credit Card Required. Instant Set-Up. Your Free Trial Is Waiting For You!

Join our ✨ Discord ✨ community