For years, Python has been the go-to language for web scraping. However, the JavaScript scraping community is growing steadily. Originally designed as a client-side language, JavaScript is one of the most popular programming languages out there and is rightfully referred to as the language of the web.
As new powerful libraries emerge, the Node.js ecosystem is one of the most reliable choices for modern web scraping.
This article will explore some of the best libraries and frameworks available for web scraping in Node.js and show how you can use them in your projects.
Why use Node.js?
Node.js is a great choice for web scraping thanks to its event-driven programming model and because it can handle asynchronous I/O operations. These allow for efficient and scalable scraping of large datasets and fast processing of multiple requests simultaneously. Node.js also has a rich ecosystem of open-source packages, such as Cheerio and Puppeteer, that assist with web scraping tasks by providing robust APIs for parsing HTML, interacting with web pages, and handling dynamic content. On top of all this, Node.js is also highly customizable, making it suitable for building custom scraping scripts tailored to specific project requirements.
Requirements
To fully understand the content and code samples showcased in this post, you should have a grasp of the following concepts:
1. Familiarity with JavaScript ES6
2. Understanding of CSS selectors
3. Familiarity with jQuery syntax
4. Comfortable navigating the browser DevTools to find and select page elements
⭐
Editor's choice
Throughout this article, a star (⭐️) indicates the tools we usually prefer when working on our web scraping projects here at Apify. That’s not to discredit the other libraries mentioned here (we use all of them in one way or another, depending on the project) but to give you an insight into the tech stack of a Node.js web scraper.
Get data with an HTTP client
HTTP clients are end-point applications that use the HTTP protocol to send requests and get a response from servers. In the context of web scraping, they’re necessary to send requests to your target website and retrieve information such as the website’s HTML markup or JSON payload.
We're going to use Got Scraping ⭐️ to send a request to a target website, retrieve its HTML code of the page, and log it to the console.
📌
What is Got Scraping?
Got Scraping is a package extension of the Got HTTP client. Its primary purpose is to address common drawbacks in modern web scraping by offering built-in tools to make bot requests less likely to be detected and blocked by modern website anti-scraping protections.
Got Scraping sends browser-like requests, which enables web scraping bots to blend in with the website traffic, making it less likely for them to be detected and blocked.
⚒️ Main features
Out-of-the-box browser-like requests to blend in with the website traffic and reduce blocking rates.
Default configuration to retry requests on failure
Option to generate browser-like headers
Basic understanding of TypeScript (optional)
1. Installation
$ npm install got-scraping
2. Code sample
Here's an example of how to send a request to a website and retrieve its HTML with Got Scraping:
Got-scraping also comes bundled with the header-generator package, which enables us to choose from various browsers from different operating systems and devices.
It generates all the headers automatically, which can be handy when trying to scrape websites that employ aggressive anti-bot blocking systems. To work around this potential setback, we often need to make the bot requests look "browser-like" and reduce the chances of them getting blocked.
To demonstrate that, let's take a look at an example of a request using the HeaderGenerationOptions:
Axios is a promise-based HTTP Client for Node.js and the browser, capable of running in the browser and Node.js with the same codebase. On the server side, Axios uses the native node.js HTTP module, while on the client side, it uses XMLHttpRequests.
In some cases, Axios might be preferable to Got Scraping because it supports both the browser and Node.js environments, which makes it versatile for various projects.
When it comes to web scraping in Node.js, you can’t go wrong with Cheerio, the most popular and widely used HTML and XML parser for the Node.js ecosystem.
Now, we're going to use Got Scraping together with Cheerio ⭐️ to extract text from all articles on Hacker News.
📌
What is Cheerio?
Cheerio is an efficient and flexible implementation of core jQuery designed to run on the server. Because of its incredible efficiency and familiar syntax, Cheerio is our best friend when scraping pages that don't require JavaScript to load their contents.
⚒️ Main features
Implements a subset of core jQuery, providing developers with a familiar and easy-to-use syntax.
Works with a simple and consistent DOM model, making parsing, manipulating, and rendering incredibly efficient.
Offers great flexibility, being able to parse nearly any HTML or XML document.
1. Installation
npm install cheerio
2. Code sample
Let's now see how we can use Cheerio + Got Scraping to extract the text content from all the articles on the first page of Hacker News.
const { gotScraping } = require("got-scraping");
const cheerio = require("cheerio");
(async () => {
const response = await gotScraping("https://news.ycombinator.com/");
const html = response.body;
// Use Cheerio to parse the HTML
const $ = cheerio.load(html);
// Select all the elements with the class name "athing"
const articles = $(".athing");
// Loop through the selected elements
for (const article of articles) {
const articleTitleText = $(article).text();
// Log each element's text to the terminal
console.log(articleTitleText);
}
})();
A few seconds after running the script, we'll see the title and ranking from the 30 most recent articles on HackerNews logged to our terminal.
Output example:
1. US Department of Energy: Fusion Ignition Achieved (energy.gov)
2. Reddit's photo albums broke due to Integer overflow of Signed Int32 (reddit.com)
3. About the security content of iOS 16.2 and iPadOS 16.2 (support.apple.com)
4. Balloon framing is worse-is-better (2021) (constructionphysics.substack.com)
5. After 20 years the Dwarf Fortress devs have to get used to being millionaires (pcgamer.com)
...
25. How much decentralisation is too much? (shkspr.mobi)
26. What we can learn from vintage computing (github.com/readme)
27. Data2vec 2.0: Highly efficient self-supervised learning for vision, speech, text (facebook.com)
28. Pony Programming Language (github.com/ponylang)
29. Al Seckel on Richard Feynman (2001) (fotuva.org)
30. Hydra – the fastest Postgres for analytics [benchmarks] (hydras.io)
Scrape dynamic websites
Browser automation libraries are used scraping dynamic pages. Their ability to emulate a real browser enables scrapers to access data on websites that require JavaScript to load their content.
Our tool of choice for scraping dynamic websites is Playwright ⭐️. Its ability to emulate a real browser allows it to render JavaScript. That's particularly useful when we want to extract data from pages that load their content dynamically, as we wouldn't be able to scrape it with just plain HTTP requests and Cheerio.
To demonstrate how to scrape dynamic pages, we'll use Playwright to extract data from Amazon.
📌
What is Playwright?
Playwright is an open-source framework for web testing and automation developed and maintained by Microsoft. While similar to its predecessor, Puppeteer, Playwright is considered a more modern and capable version.
⚒️ Main features
Auto-wait. Playwright, by default, waits for elements to be actionable before performing actions, eliminating the need for artificial timeouts.
Cross-browser support, being able to drive Chromium, WebKit, Firefox, and Microsoft Edge.
Playwright is available in multiple languages, including JavaScript and Typescript, Python, Java, and .NET
1. Installation
# Run from your project's root directory
npm init playwright@latest
# Or create a new project
npm init playwright@latest new-project
Puppeteer is an open-source Node.js browser automation library developed and maintained by Google that provides a high-level API to manipulate a headless Chrome programmatically, which can also be configured to use a full, non-headless browser.
Puppeteer is the predecessor of Playwright. Because it's been around for longer, Puppeteer has a strong developer community and documentation. Like Playwright, Puppeteer's ability to emulate a real browser allows it to render JavaScript and scrape dynamically loaded content.
⚒️ Main features
Crawl a Single-Page Application and generate pre-rendered content (i.e., server-side rendering)
Take screenshots and generate PDFs of pages.
Automate manual user interactions, such as UI testing, form submissions, and keyboard inputs.
Supports Chromium and Firefox.
1. Installation
# npm
npm i puppeteer
# Yarn
yarn add puppeteer
# pNPM
pnpm i puppeteer
2. Code sample
To demonstrate some of Puppeteer's capabilities, let's again go to Amazon, scrape The Hitchhiker's Guide to the Galaxy product page, and save a screenshot of the accessed page.
By default, Puppeteer will launch a headless browser. In this example, we'll set the headless option to false so we can follow Puppeteer as it loads the browser and goes to the specified website.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
headless: false, // Set headless to false, so we can see the browser working
});
const page = await browser.newPage();
await page.setViewport({ width: 1366, height: 768 }); // Maximize the screensize
await page.goto(
"https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=1642536225&sr=8-1"
);
const book = await page.evaluate(() => {
return {
title: document.querySelector("#productTitle").innerText,
author: document.querySelector(".a-link-normal.contributorNameID")
.innerText,
edition: document.querySelector("#productSubtitle").innerText,
digitalListPrice: document.querySelector("#digital-list-price")
.innerText,
printListPrice:
document.querySelector("#print-list-price").innerText,
kindlePrice: document.querySelector("#kindle-price").innerText,
};
});
await page.screenshot({ path: "book.png" }); // Take a screenshot of the page
console.log(book);
await browser.close();
})();
After the script finishes its run, we will see an object containing the book's title, author, edition, and prices logged to the console, and a screenshot of the page saved as book.png .
Output example:
{
title: "The Hitchhiker's Guide to the Galaxy: The Illustrated Edition ",
author: 'Douglas Adams',
edition: 'Kindle Edition',
digitalListPrice: '$7.24',
printListPrice: '$7.99',
kindlePrice: '$6.31'
}
Advanced dynamic scraping using Playwright with Cheerio
The primary reason why we need a browser automation library for web scraping is to load a browser so we can access dynamically generated content on web pages that require JavaScript to function.
However, the prospect of having to memorize yet another set of library-specific syntax doesn't sound that exciting. So, wouldn't it be nice if we could take advantage of Puppeteer and Playwright's functionalities while still being able to use Cheerio's jQuery syntax to select elements and extract data? Well, that's precisely what we will do in this section.
We'll start by accessing the target website using Playwright, then using it to get the HTML code markup from the page and save it to a variable, which, in turn, we will feed into Cheerio's load() function so it can parse the resulting HTML.
For this demonstration, we will use https://www.mintmobile.com/product/google-pixel-7-pro-bundle/, as our target website. Mint Mobile requires JavaScript to load most of the content displayed on its product page, which makes it an ideal scenario for using Playwright in the context of web scraping.
Mint Mobile product page with JavaScript disabled:
Mint Mobile product page with JavaScript enabled:
1. Installation
npm install playwright cheerio
2. Code sample
So, without further ado, let's use Playwright + Cheerio to extract the product data highlighted in the image above.
const playwright = require("playwright");
const cheerio = require("cheerio");
(async () => {
const browser = await playwright.firefox.launch({
headless: false,
});
const page = await browser.newPage();
await page.goto(
"https://www.mintmobile.com/product/google-pixel-7-pro-bundle/"
);
const html = await page.evaluate(() => document.body.innerHTML); // Save the page's HTML to a variable
const $ = cheerio.load(html); // Use Cheerio to load the page's HTML code
// Continue writing your scraper using Cheerio's jQuery syntax
const phone = {
name: $("div.m-productCard__heading h1").text().trim(),
memory: $(
"div.composited_product_details_wrapper > div > div > div:nth-child(2) > div.label > span"
)
.text()
.split(" ")
.pop(),
payMonthlyPrice: $("div.composite_price_monthly span").text().trim(),
payTodayPrice: $("div.composite_price > p > ins > span").text().trim(),
};
console.log(phone);
await browser.close();
})();
Using the full-featured Node.js web scraping library: Crawlee ⭐️
📌
What is Crawlee?
Crawlee is an open-source Node.js web scraping and automation library developed and maintained by Apify. It builds on top of Got Scraping, Cheerio, Puppeteer, and Playwright, and takes advantage of the already great features of these tools while providing extra functionality tailored to the needs and preferences of web scraping developers.
One of Crawlee's major selling points is its extensive out-of-the-box collection of features to help scrapers overcome modern website anti-bot defenses and reduce blocking rates. It achieves that by making HTTP requests that mimic browser headers and TLS fingerprints without requiring extra configuration.
Another handy characteristic of Crawlee is that it functions as an all-in-one toolbox for web scraping. We can switch between the available classes, such as CheerioCrawler, PuppeteerCrawler, and PlaywrightCrawler, to quickly access the features we need for each specific scraping use case.
⚒️ Main features
Single interface for HTTP and headless browser crawling
Automatic generation of browser-like headers
Replication of browser TLS fingerprints
Zero-config generation of human-like fingerprints
Automatic scaling with available system resources
Integrated proxy rotation and session management
Lifecycles customizable with hooks
CLI to bootstrap your projects
Configurable routing, error handling, and retries
Dockerfiles ready to deploy
1. Installation
npx crawlee create my-crawler
2. File structure
Before we jump into some code examples, it's important to understand the basic file structure we can expect to see after running the npx crawlee create my-crawler command and choosing a starting template for our project.
To promote code modularity, the crawler logic is split between two files, main.js and routes.js. Once you run your scraper, the extracted data will be automatically stored as json files in the datasets directory.
3. Code samples
#1. Using CheerioCrawler
In the first code sample, we will use Crawlee's CheerioCrawler to recursively scrape the Hacker News website.
The crawler starts with a single URL, finds links to the following pages, enqueues them, and continues until no more page links are available. The results are then stored on your disk in the datasets directory.
// main.js
import { CheerioCrawler } from "crawlee";
import { router } from "./routes.js";
const startUrls = ["https://news.ycombinator.com/"];
const crawler = new CheerioCrawler({
requestHandler: router,
});
await crawler.run(startUrls);
Conclusion: why use JavaScript and Node.js for web scraping?
JavaScript, particularly when paired with Node.js, offers a powerful platform for web scraping. This language-framework duo leverages the full capabilities of JavaScript on the server side, allowing developers to utilize familiar syntax and a vast array of libraries for scraping tasks. JavaScript's ubiquity across web technologies means that using it for scraping can simplify interactions with web pages that are heavily reliant on JavaScript themselves.
Frequently asked questions
Can you use JavaScript for web scraping?
Yes, you can use JavaScript for web scraping. It is particularly effective for websites that are heavily dependent on JavaScript to render their content, allowing you to interact dynamically with the web page's elements.
Is JavaScript good for scraping?
Yes, JavaScript is good for web scraping because it executes web page scripts the same way a browser does, enabling access to dynamically generated content that other languages might miss.
What is the best web scraping tool for JavaScript?
The best web scraping tool for JavaScript depends on the task at hand. For sending requests, we recommend Got Scraping and Axios. For parsing, Cheerio is best. For scraping dynamic content, we recommend Playwright. For a complete web scraping library that combines all of these features, we recommend Crawlee.
Is Python or JavaScript better for web scraping?
Both Python and JavaScript are effective for web scraping, but the choice depends on the project's specifics. JavaScript is better for scraping dynamic content directly executed in the browser, while Python offers robust libraries like BeautifulSoup and Scrapy for diverse scraping needs.
JavaScript and Node.js resources
If you want to dive deeper into some of the libraries and frameworks we presented during this post, here's a curated list of great videos and articles about the topic: