Glowing Web Network
Glowing Web Network
LangChain

Feed clean web data into your LangChain chains and agents

@scrappey/langchain is the official Scrappey document loader for LangChain.js. It fetches JavaScript-heavy and modern websites as LLM-ready Markdown, so the output drops straight into a text splitter, vector store, or agent — no local HTML parsing required.

Quick start

Requires Node 18+. @langchain/core is the only peer dependency.

Install with npm
npm install @scrappey/langchain @langchain/core
  1. 1

    Set your API key

    Register at scrappey.com to get your key (150 free requests included), then expose it to your environment. The loader reads SCRAPPEY_API_KEY by default, or you can pass apiKey directly.

    bash
    export SCRAPPEY_API_KEY="your_api_key"
  2. 2

    Load a page as Markdown

    Create a ScrappeyLoader with one or more URLs and call load(). Each URL becomes a LangChain Document whose pageContent is server-side Markdown.

    typescript
    import { ScrappeyLoader } from "@scrappey/langchain";
    
    const loader = new ScrappeyLoader({
      urls: ["https://example.com", "https://news.ycombinator.com"],
    });
    
    const docs = await loader.load();
    console.log(docs[0].pageContent.slice(0, 120));
  3. 3

    Stream large URL lists

    Use lazyLoad() to process documents one at a time as each page lands, so you can embed and persist without buffering the whole batch.

    typescript
    const loader = new ScrappeyLoader({ urls: bigUrlList, concurrency: 2 });
    for await (const doc of loader.lazyLoad()) {
      // embed / persist each Document as soon as it arrives
    }

Full examples

End-to-end RAG pipeline
import { ScrappeyLoader } from "@scrappey/langchain";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";

// Scrappey returns LLM-ready Markdown, so it drops straight into a splitter.
const loader = new ScrappeyLoader({
  urls: ["https://en.wikipedia.org/wiki/Web_scraping"],
  concurrency: 2,
  skipOnError: true,
});

const docs = await loader.load();

const splits = await new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 150,
}).splitDocuments(docs);

const store = await MemoryVectorStore.fromDocuments(
  splits,
  new OpenAIEmbeddings()
);

const hits = await store.similaritySearch("What is web scraping?", 3);
console.log(hits.map((h) => h.metadata.source));
Calling the Scrappey API directly
# The loader wraps this canonical Scrappey request under the hood.
# Add "markdownResponse": true for LLM-ready Markdown instead of HTML.
curl -X POST "https://publisher.scrappey.com/api/v1?key=YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "cmd": "request.get",
    "url": "https://example.com",
    "markdownResponse": true
  }'

Why use the LangChain loader

Drop-in Document loader

ScrappeyLoader implements the standard LangChain loader interface with load() and lazyLoad(), so it plugs into existing chains, splitters, and vector stores.

Server-side Markdown

Pages come back as clean, LLM-ready Markdown converted on Scrappey's side — no local HTML-to-Markdown step before chunking and embedding.

JavaScript-heavy sites handled

Full-browser rendering with automatic web access handling means content from modern, dynamic websites loads reliably and at high success rates.

Rich metadata per Document

Each Document carries metadata like source URL, status code, final URL after redirects, and the session Scrappey used — useful for citations and debugging.

Concurrency and error control

Tune the concurrency option for parallel fetches and set skipOnError to omit failed URLs instead of throwing.

Zero runtime dependencies

Built on native fetch with @langchain/core as the only peer dependency. Dual ESM + CJS build with first-class TypeScript types.

Popular use cases

RAG over public web pages

Load documentation, articles, or knowledge bases as Markdown, chunk them, and embed into a vector store for retrieval-augmented generation.

Agent web research tools

Give a LangChain agent the ability to pull fresh content from JavaScript-heavy sites the user has the right to access.

Scheduled ingestion jobs

Stream a large URL list with lazyLoad() and persist embeddings incrementally for nightly index refreshes.

LangChain FAQ

How much does it cost to run?

Scrappey is pay-as-you-go with no subscription. You get 150 free requests, then pay €0.20 per 1,000 direct HTTP requests or €1.00 per 1,000 full-browser requests. Residential proxies are included, and you only pay for successful requests.

How do I pass my API key?

Set the SCRAPPEY_API_KEY environment variable and the loader picks it up automatically, or pass apiKey directly in the ScrappeyLoader constructor. Never hardcode the key in committed code.

Do I get Markdown or HTML?

Markdown by default. The loader's mode option accepts "markdown" (server-side, the default) or "html" for the raw body. Markdown is recommended for RAG and LLM ingestion.

What does each loaded Document contain?

One Document per URL, with pageContent holding the Markdown or HTML and metadata fields such as source, statusCode, verified, timeElapsedMs, and the Scrappey session and final URL.

Can it handle JavaScript-heavy sites?

Yes. Scrappey renders pages in a full browser with automatic web access handling, so dynamic and modern websites return reliably with high success rates.

Does it work with TypeScript and CommonJS?

Yes. The package ships a dual ESM + CJS build with bundled TypeScript types and supports Node 18 and above.
footer-frame

Start building with Scrappey

Try It For Free. No Subscription Required. No Credit Card Required. Instant Set-Up. 150 Free Requests Are Waiting For You!