Your first Node.js scraper with Axios + Cheerio
The canonical static-scraping combo is Axios to fetch and Cheerio to parse. Cheerio gives you a jQuery-like $ API on the server. Install with npm install axios cheerio.
const axios = require('axios');
const cheerio = require('cheerio');
(async () => {
const { data: html } = await axios.get('https://books.toscrape.com/', {
headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' },
});
const $ = cheerio.load(html);
$('article.product_pod').each((i, el) => {
const title = $(el).find('h3 a').attr('title');
const price = $(el).find('.price_color').text();
console.log(`${title} | ${price}`);
});
})();Cheerio mirrors jQuery: $(selector) selects, .find() drills down, .text() reads text, .attr() reads attributes, and .each() iterates. Modern Node (18+) also ships a global fetch, so you can drop Axios for simple GETs if you prefer zero dependencies for the HTTP layer.
Scraping JavaScript-rendered pages with Playwright
Cheerio only parses static HTML — it does not run JavaScript. For client-side-rendered pages, Playwright drives a real browser with built-in auto-waiting. Install with npm install playwright then npx playwright install chromium.
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://quotes.toscrape.com/js/');
await page.waitForSelector('.quote'); // wait for JS to render
const quotes = await page.$eval('.quote', (els) =>
els.map((e) => ({
text: e.querySelector('.text').innerText,
author: e.querySelector('.author').innerText,
}))
);
console.log(quotes);
await browser.close();
})();A common production pattern is hybrid: let Playwright render the page, grab await page.content(), then parse that HTML with Cheerio — you get the browser's rendering with Cheerio's fast, familiar extraction. Puppeteer is the Chrome-only alternative; Playwright is the recommended default in 2026 for its multi-browser support and cleaner API.
Production crawlers with Crawlee
For real crawlers — queues, retries, proxy rotation, and automatic scaling — Crawlee is the framework most competitor guides miss. It wraps Cheerio and Playwright with production concerns built in. Install with npm install crawlee.
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ $, enqueueLinks }) {
$('article.product_pod').each((i, el) => {
Dataset.pushData({
title: $(el).find('h3 a').attr('title'),
price: $(el).find('.price_color').text(),
});
});
// Automatically follow pagination links.
await enqueueLinks({ selector: 'li.next a' });
},
});
await crawler.run(['https://books.toscrape.com/']);Crawlee handles the request queue, concurrency, retries, and result storage for you, and you can swap CheerioCrawler for PlaywrightCrawler when a target needs JavaScript — same structure, real browser underneath.
Which Node.js library should you use?
| Library | Type | Runs JS? | Best for |
|---|---|---|---|
| Axios / fetch | HTTP client | No | Fetching pages and APIs |
| Cheerio | HTML parser | No | Fast static parsing (jQuery-like) |
| Playwright | Browser automation | Yes | JavaScript pages — the default |
| Puppeteer | Browser automation | Yes | Chrome-only headless control |
| Crawlee | Crawler framework | Yes (optional) | Production crawlers at scale |
Start with Axios + Cheerio, add Playwright for JavaScript, and adopt Crawlee when you are running a real, ongoing crawl.
The hard part: handling anti-bot blocking
The Node code is the easy part; anti-bot defenses are what break scrapers. Axios sends a TLS fingerprint no browser sends, and headless Playwright leaks automation signals that Cloudflare, DataDome, and Akamai flag. Cheerio cannot parse a 403 or CAPTCHA page.
Handling this means residential proxies and a real browser fingerprint — and keeping them coherent. A scraping API handles it server-side, so your Node code posts the URL and parses the returned HTML with Cheerio: