Web Scraping With Node.js

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

Web Scraping With Node.js — conceptual illustration

On this page

Web scraping with Node.js means fetching a page (with Axios or the built-in fetch) and parsing it with Cheerio for static sites, or driving a real browser with Playwright or Puppeteer for JavaScript-rendered ones. JavaScript is a natural fit for scraping because the same language runs in the browser you are scraping. The 2026 stack is Axios + Cheerio for static pages, Playwright for dynamic pages, and Crawlee as the production crawler framework.

Static parsing	Cheerio v1.x — fast, jQuery-like server-side parsing
HTTP client	Axios or the built-in fetch / undici (node-fetch is legacy)
JavaScript pages	Playwright (recommended) or Puppeteer
Crawler framework	Crawlee — queues, proxies, retries built in
Install	npm install axios cheerio

Your first Node.js scraper with Axios + Cheerio

The canonical static-scraping combo is Axios to fetch and Cheerio to parse. Cheerio gives you a jQuery-like $ API on the server. Install with npm install axios cheerio.

const axios = require('axios');
const cheerio = require('cheerio');

(async () => {
  const { data: html } = await axios.get('https://books.toscrape.com/', {
    headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' },
  });

  const $ = cheerio.load(html);

  $('article.product_pod').each((i, el) => {
    const title = $(el).find('h3 a').attr('title');
    const price = $(el).find('.price_color').text();
    console.log(`${title} | ${price}`);
  });
})();

Cheerio mirrors jQuery: $(selector) selects, .find() drills down, .text() reads text, .attr() reads attributes, and .each() iterates. Modern Node (18+) also ships a global fetch, so you can drop Axios for simple GETs if you prefer zero dependencies for the HTTP layer.

Scraping JavaScript-rendered pages with Playwright

Cheerio only parses static HTML — it does not run JavaScript. For client-side-rendered pages, Playwright drives a real browser with built-in auto-waiting. Install with npm install playwright then npx playwright install chromium.

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto('https://quotes.toscrape.com/js/');

  await page.waitForSelector('.quote');   // wait for JS to render

  const quotes = await page.$eval('.quote', (els) =>
    els.map((e) => ({
      text: e.querySelector('.text').innerText,
      author: e.querySelector('.author').innerText,
    }))
  );

  console.log(quotes);
  await browser.close();
})();

A common production pattern is hybrid: let Playwright render the page, grab await page.content(), then parse that HTML with Cheerio — you get the browser's rendering with Cheerio's fast, familiar extraction. Puppeteer is the Chrome-only alternative; Playwright is the recommended default in 2026 for its multi-browser support and cleaner API.

Production crawlers with Crawlee

For real crawlers — queues, retries, proxy rotation, and automatic scaling — Crawlee is the framework most competitor guides miss. It wraps Cheerio and Playwright with production concerns built in. Install with npm install crawlee.

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
  async requestHandler({ $, enqueueLinks }) {
    $('article.product_pod').each((i, el) => {
      Dataset.pushData({
        title: $(el).find('h3 a').attr('title'),
        price: $(el).find('.price_color').text(),
      });
    });

    // Automatically follow pagination links.
    await enqueueLinks({ selector: 'li.next a' });
  },
});

await crawler.run(['https://books.toscrape.com/']);

Crawlee handles the request queue, concurrency, retries, and result storage for you, and you can swap CheerioCrawler for PlaywrightCrawler when a target needs JavaScript — same structure, real browser underneath.

Which Node.js library should you use?

Library	Type	Runs JS?	Best for
Axios / fetch	HTTP client	No	Fetching pages and APIs
Cheerio	HTML parser	No	Fast static parsing (jQuery-like)
Playwright	Browser automation	Yes	JavaScript pages — the default
Puppeteer	Browser automation	Yes	Chrome-only headless control
Crawlee	Crawler framework	Yes (optional)	Production crawlers at scale

Start with Axios + Cheerio, add Playwright for JavaScript, and adopt Crawlee when you are running a real, ongoing crawl.

The hard part: handling anti-bot blocking

The Node code is the easy part; anti-bot defenses are what break scrapers. Axios sends a TLS fingerprint no browser sends, and headless Playwright leaks automation signals that Cloudflare, DataDome, and Akamai flag. Cheerio cannot parse a 403 or CAPTCHA page.

Handling this means residential proxies and a real browser fingerprint — and keeping them coherent. A scraping API handles it server-side, so your Node code posts the URL and parses the returned HTML with Cheerio:

Code example

javascript

const axios = require('axios');
const cheerio = require('cheerio');

(async () => {
  const { data } = await axios.post(
    'https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY',
    { cmd: 'request.get', url: 'https://example.com/protected' }
  );

  // Fully rendered, unblocked HTML -- parse it with Cheerio as usual.
  const html = data.solution.response;
  const $ = cheerio.load(html);
  console.log($('title').text());
})();

Web scraping with Java means fetching a web page over HTTP and extracting structured data from its HTML, usually with Jsoup for static pages…

Web Scraping With Go (Golang): A Complete 2026 Guide

Web scraping with Go (Golang) means using net/http or the Colly framework to fetch pages and goquery to extract data with jQuery-like select…

Which is better for web scraping: Python or JavaScript?

Both Python and JavaScript can scrape websites well, so the "right" one depends on your project, not on which language is objectively better…

What Is Playwright?

Playwright is a cross-browser automation framework from Microsoft that drives Chromium, Firefox, and WebKit through a single API. An automat…

What is Puppeteer? (Complete Guide 2026)

Puppeteer is a Node.js tool that lets your code drive a real Chrome browser automatically — clicking, typing, and reading pages just like a …

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

XPath for Web Scraping: A Complete 2026 Guide

XPath (XML Path Language) is a query language for selecting nodes in an HTML or XML document, widely used in web scraping to pinpoint the ex…

Web Scraping With C#: A Complete 2026 Guide

Web scraping with C# means using .NET's HttpClient to fetch a page and a parser like HtmlAgilityPack or AngleSharp to extract data from the …

Web Scraping With Ruby: A Complete 2026 Guide

Web scraping with Ruby means fetching a page with an HTTP gem like HTTParty and parsing the HTML with Nokogiri, which supports both CSS sele…

Web Scraping With PHP: A Complete 2026 Guide

Web scraping with PHP means fetching pages with the Guzzle HTTP client and extracting data with Symfony's DomCrawler component, which suppor…

Web Scraping With R: A Complete 2026 Guide

Web scraping with R means using the rvest package to download and parse HTML into tidy data frames, with CSS selectors or XPath. rvest is th…

Concept map

How Web Scraping With Node.js: A Complete 2026 Guide connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping by Language

Frequently asked questions

What is the best library for web scraping with Node.js?

For static pages, Axios (or the built-in fetch) plus Cheerio is the standard — Cheerio gives you fast, jQuery-like parsing on the server. For JavaScript-rendered pages, Playwright is the recommended browser-automation choice in 2026, with Puppeteer as the Chrome-only alternative. For production crawlers with queues, retries, and proxy rotation, use Crawlee.

Is Playwright or Puppeteer better for Node.js scraping?

Playwright is the better default in 2026: it supports Chromium, Firefox, and WebKit, has built-in auto-waiting, and a cleaner API. Puppeteer is Chrome/Chromium-only but is still solid and well documented. Both run a real browser, so both are heavier and more detectable than a plain Axios + Cheerio request.

Can I use Cheerio for JavaScript-rendered pages?

Not directly — Cheerio only parses the static HTML you give it and does not execute JavaScript. The common pattern is to render the page with Playwright or Puppeteer, take page.content(), and then parse that HTML with Cheerio. Alternatively, call the JSON API the page fetches its data from and skip the browser entirely.

Why does my Node.js scraper get blocked, and how do I fix it?

Use realistic headers, throttle requests, and rotate residential proxies. Against Cloudflare, DataDome, or Akamai you also need a browser-grade TLS fingerprint , which Axios and even headless Playwright struggle with. Many teams route hard targets through a scraping API that handles proxies, fingerprinting, and challenges server-side.

Last updated: 2026-06-08