Websites are becoming increasingly complex and dynamic. The modern web is full of JavaScript-rendered apps that load content asynchronously, use auth systems involving multiple steps and JavaScript-based token handling, and block scraping bots.
That's why JavaScript is still a great choice for collecting web data in 2025. But if you're a developer new to web scraping or unfamiliar with the JavaScript language, you're probably wondering which libraries and frameworks you should try.
At Apify, we've been scraping the web with JavaScript and Node.js for a decade. This selection of 5 libraries is informed by our experience of using them for data extraction, from parsing HTML to navigating web pages and scraping dynamic content.
1. Crawlee
Juggling multiple libraries for requests, parsing, browser automation, and crawling logic quickly becomes a maintenance headache. You end up writing glue code to handle queues, rotate proxies, and merge results, only to find you still get blocked or stranded when scale increases.
Crawlee, developed by the Apify team, unifies everything under a single interface. Out of the box, it mimics real browsers (headers, TLS fingerprints, and even stealth plugins) so you avoid common anti-bot defenses without manual header or fingerprint tweaking. Instead of wiring together Cheerio + Playwright/Puppeteer + queue managers, Crawlee provides:
- Switchable crawler classes:
CheerioCrawler
for static HTML,PlaywrightCrawler
orPuppeteerCrawler
for dynamic pages, all sharing a common configuration style. - Built-in queue management: Breadth-first or depth-first crawling with concurrency settings, retry logic, and automatic backoff. You define start URLs; Crawlee handles enqueuing, prioritization, and scheduling.
- Automatic proxy rotation and session handling: Effectively rotate proxies or manage cookies and browser contexts, so you stay under rate limits and maintain logins across multiple pages.
- Pluggable data storages: Datasets (JSON, CSV, or key-value stores) appear in a local “datasets” directory, making it trivial to persist results or resume failed crawls.
- Lifecycle hooks and customizability: Logging, error handling, and custom request handlers via routers, so you can insert your own logic at enqueue, request success, or failure without rewriting core code.
- Native integration with the Apify platform: Once your crawler is ready, running
apify push
deploys it, and Apify handles autoscaling, proxy billing, and data exports. No extra configuration needed. - Starter templates and file structure: When you run
npx crawlee create my-crawler
, you get amain.js
androutes.js
setup. Boilerplate code means you can focus on selectors rather than instantiating browser instances, setting headers, or wiring queues. The default file structure looks like this:
/my-crawler
├── main.js # entry point: initializes crawler class and starts run()
├── routes.js # defines request handlers via createCheerioRouter/createPlaywrightRouter
├── storages /
datasets/ # where results are stored as JSON files per page
├── key-value-stores/ # storage for arbitrary binary data (images, videos, JSON files…)
└── package.json
Code snapshot (Cheerio crawler)
// routes.js
import { Dataset, createCheerioRouter } from "crawlee";
export const router = createCheerioRouter();
router.addDefaultHandler(async ({ enqueueLinks, log, $ }) => {
log.info(`enqueueing new URLs`);
// Finds “next” pages and enqueues them
await enqueueLinks({ globs: ["https://news.ycombinator.com/?p=*"] });
// Extract post URL, title, rank
const data = $(".athing")
.map((idx, post) => ({
postUrl: $(post).find(".title a").attr("href"),
title: $(post).find(".title a").text(),
rank: $(post).find(".rank").text(),
}))
.toArray();
// Push to dataset for automatic file output
await Dataset.pushData({ data });
});
Why this matters for your workflow
- Simplicity: No more separate proxy rotation libraries, queue managers, or manual header generators. Crawlee handles it all.
- Scaling: You can start locally and then deploy to the Apify platform, where it auto‐scales, monitors memory/CPU, and logs failures.
- Maintenance: Switching from CheerioCrawler to PlaywrightCrawler only requires changing one import and maybe tweaking selectors. The core logic stays the same.
2. Impit
Sending vanilla HTTP requests often gets you blocked by modern anti-scraping systems. You might spend hours rotating user-agents, randomizing delays, or implementing captchas manually, only to still find your IP banned.
Impit is an HTTP client for Node.js and Python, based on Rust’s reqwest
, specifically tailored for scraping. Instead of wrestling with header spoofing or TLS fingerprinting yourself, you get:
- Automatic fingerprint spoofing: Pick from a library of existing browser fingerprints, and impit builds a full set of realistic HTTP headers and matching TLS settings. This makes your requests indistinguishable from browser requests and reduces detection risk.
- Integrated
tough-cookie
support: Handle session cookies out of the box, so you can maintain login sessions or track redirects using the most popular JS cookie library. fetch
API: Impit implements a subset of the well-knownfetch
API (MDN), so you can write your scrapers without having to read lengthy docs.- Proxy integration: Support for HTTP and HTTPS proxies via a single option, so you can rotate IPs with minimal code.
Code snapshot (impersonating Firefox)
import { Impit } from "impit";
async function fetchHtml() {
const impit = new Impit({ browser: 'firefox', http3: true })
const response = await impit.fetch("https://news.ycombinator.com/");
console.log(await response.text()); // raw HTML
}
fetchHtml();
Why this matters for your workflow
- Stealth: You no longer manually assemble user-agent strings or randomize headers; impit covers 95% of common anti-bot checks.
- Error handling: Configurable retries and timeouts mean fewer surprises when a request fails.
3. Cheerio
Plain HTML is cluttered: nested tags, inconsistent class names, and no programmatic way to navigate the DOM on the server. If you’ve written custom regex or string-based parsers, you know how brittle that can be.
Cheerio loads raw HTML into a fast, jQuery-like API on the server. You can query for elements, attributes, and text using familiar CSS selectors, then extract exactly what you need without worrying about manual string manipulation.
Code snapshot (parsing Hacker News)
import { gotScraping } from "got-scraping";
import * as cheerio from "cheerio";
async function fetchTitles() {
const response = await gotScraping("https://news.ycombinator.com/");
const $ = cheerio.load(response.body);
$(".athing").each((_, post) => {
const title = $(post).find(".title a").text();
const rank = $(post).find(".rank").text();
console.log(`${rank} ${title}`);
});
}
fetchTitles();
Why this matters for your workflow
- Robustness: No more fragile regex. With Cheerio, you use
.find()
,.text()
, and.attr()
just like jQuery. - Performance: Cheerio is lightweight and blazingly fast, so parsing large HTML documents doesn’t become your scraper’s bottleneck - especially when compared to full-headless browsers.
- Familiar syntax: If you’ve used jQuery on the front end, there’s almost zero onboarding time.
4. Playwright
Many modern websites rely on client-side JavaScript to populate the DOM, for lazy loading, infinite scrolling, or data fetched via XHR/AJAX. Cheerio has no power here.
Playwright, on the other hand, spins up a real browser (Chromium, Firefox, or WebKit), navigates pages as a human would, waits for selectors or network to idle, and then gives you a fully rendered DOM snapshot. You can even intercept requests to block ads or unwanted resources.
Code snapshot (Amazon product page)
import { firefox } from "playwright";
async function scrapeAmazon() {
const browser = await firefox.launch({ headless: true });
const page = await browser.newPage();
await page.goto(
"https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/"
);
const book = {
title: await page.locator("#productTitle").innerText(),
author: await page.locator("span.author a").innerText(),
kindlePrice: await page
.locator("#formats span.ebook-price-value")
.innerText(),
paperbackPrice: await page
.locator("#tmm-grid-swatch-PAPERBACK .slot-price span")
.innerText(),
hardcoverPrice: await page
.locator("#tmm-grid-swatch-HARDCOVER .slot-price span")
.innerText(),
};
console.log(book);
await browser.close();
}
scrapeAmazon();
Why this matters for your workflow
- Reliability: If the data isn’t in the initial HTML, you need a browser to run the page’s JS. Playwright ensures you get exactly what a real user sees.
- Flexible waits: You can
await page.waitForSelector()
or specify.waitUntil("networkidle")
so you only scrape once all resources load, reducing flaky results. - Intercepting resources: Block images, CSS, or analytics endpoints to speed up scrapes and reduce noise in your logs.
5. Puppeteer
Puppeteer and Playwright are pretty much the same thing, except for some minor differences in API, and unlike Playwright, it's limited to JavaScript and Node.js. Puppeteer is older, but only recently did Firefox add official support for Puppeteer.
Switching to Playwright isn't difficult, but if you prefer Chromium’s engine, this is a good option.
Puppeteer gives you a headless (or headed) Chrome instance with an easy API for navigation, selection, and evaluation. It supports intercepting requests, generating PDFs, and capturing screenshots. While it doesn’t include the same cross-browser support as Playwright, it’s been around for longer and integrates well with Chrome DevTools Protocol.
Code snapshot (basic Puppeteer scraper)
import puppeteer from "puppeteer";
async function scrapeSite() {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://news.ycombinator.com/", { waitUntil: "networkidle2" });
const articles = await page.$$eval(".athing", posts =>
posts.map(post => ({
title: post.querySelector(".title a")?.innerText.trim(),
url: post.querySelector(".title a")?.href,
rank: post.querySelector(".rank")?.innerText.trim(),
}))
);
console.log(articles);
await browser.close();
}
scrapeSite();
Why this matters for your workflow
- Existing Puppeteer code: Migrate incrementally or reuse libraries that depend on Puppeteer.
- Chrome-only features: Use DevTools Protocol to capture screenshots, trace performance, or emulate network conditions without additional dependencies.
- Lightweight automation needs: if you only need a headless Chrome for a few pages and already have an IP rotation or session management solution, Puppeteer might be the simplest choice.
In summary
JavaScript remains a top choice for web scraping in 2025, thanks to its solid ecosystem of open-source libraries, which make it easier to parse HTML, interact with web pages, and deal with dynamic content. In our opinion, these five are the best:
- Crawlee - A comprehensive, all-in-one scraping framework that handles browser automation, proxy rotation, session management, queuing, and data storage. It simplifies scaling and maintenance by unifying multiple tools under one interface.
- Impit - A stealthy HTTP client tailored for scraping, with automatic realistic header generation, cookie jar support, and proxy integration - ideal for scraping without a full browser.
- Cheerio - A fast and lightweight HTML parser that mimics jQuery, perfect for extracting structured data from static HTML without using a browser.
- Playwright - A full browser automation library for scraping JavaScript-rendered sites. It supports multiple browsers, waits for content to load, and intercepts resources, making it highly reliable for dynamic pages.
- Puppeteer - A headless Chrome automation tool with strong DevTools support. It’s suitable for existing Puppeteer codebases or lightweight scraping needs focused on Chromium.
Whether you’re parsing static HTML or navigating complex, JavaScript-rendered pages, this toolkit helps you choose and combine the best options for performance, stealth, and scalability.