Web scraping with JavaScript and Node.js (Guide for 2024)

The best JavaScript libraries and frameworks for web scraping in Node.js and how to use them.

Content

Getting started with web scraping in Node.js

For years, Python has been the go-to language for web scraping. However, the JavaScript scraping community is growing steadily. Originally designed as a client-side language, JavaScript is one of the most popular programming languages out there and is rightfully referred to as the language of the web.

As new powerful libraries emerge, the Node.js ecosystem is one of the most reliable choices for modern web scraping.

This article will explore some of the best libraries and frameworks available for web scraping in Node.js and show how you can use them in your projects.

Why use Node.js?

Node.js is a great choice for web scraping thanks to its event-driven programming model and because it can handle asynchronous I/O operations. These allow for efficient and scalable scraping of large datasets and fast processing of multiple requests simultaneously. Node.js also has a rich ecosystem of open-source packages, such as Cheerio and Puppeteer, that assist with web scraping tasks by providing robust APIs for parsing HTML, interacting with web pages, and handling dynamic content. On top of all this, Node.js is also highly customizable, making it suitable for building custom scraping scripts tailored to specific project requirements.

Requirements

To fully understand the content and code samples showcased in this post, you should have a grasp of the following concepts:

  • 1. Familiarity with JavaScript ES6
  • 2. Understanding of CSS selectors
  • 3. Familiarity with jQuery syntax
  • 4. Comfortable navigating the browser DevTools to find and select page elements
Editor's choice

Throughout this article, a star (⭐️) indicates the tools we usually prefer when working on our web scraping projects here at Apify. That’s not to discredit the other libraries mentioned here (we use all of them in one way or another, depending on the project) but to give you an insight into the tech stack of a Node.js web scraper.

Get data with an HTTP client

HTTP clients are end-point applications that use the HTTP protocol to send requests and get a response from servers. In the context of web scraping, they’re necessary to send requests to your target website and retrieve information such as the website’s HTML markup or JSON payload.

We're going to use Got Scraping ⭐️ to send a request to a target website, retrieve its HTML code of the page, and log it to the console.

GOT HTTP client logo
📌
What is Got Scraping?

Got Scraping is a package extension of the Got HTTP client. Its primary purpose is to address common drawbacks in modern web scraping by offering built-in tools to make bot requests less likely to be detected and blocked by modern website anti-scraping protections.

Got Scraping sends browser-like requests, which enables web scraping bots to blend in with the website traffic, making it less likely for them to be detected and blocked.

⚒️ Main features

  • Out-of-the-box browser-like requests to blend in with the website traffic and reduce blocking rates.
  • Default configuration to retry requests on failure
  • Option to generate browser-like headers
  • Basic understanding of TypeScript (optional)

1. Installation

$ npm install got-scraping

2. Code sample

Here's an example of how to send a request to a website and retrieve its HTML with Got Scraping:

const { gotScraping } = require("got-scraping");

(async () => {
    const response = await gotScraping.get("https://news.ycombinator.com/");
    const html = response.body;
    console.log(html);
})();

3. Using the header-generator package

Got-scraping also comes bundled with the header-generator package, which enables us to choose from various browsers from different operating systems and devices.

It generates all the headers automatically, which can be handy when trying to scrape websites that employ aggressive anti-bot blocking systems. To work around this potential setback, we often need to make the bot requests look "browser-like" and reduce the chances of them getting blocked.

To demonstrate that, let's take a look at an example of a request using the HeaderGenerationOptions:

const { gotScraping } = require("got-scraping");

(async () => {
    const response = await gotScraping({
        url: "https://api.apify.com/v2/browser-info",
        headerGeneratorOptions: {
            browsers: [{ name: "firefox", minVersion: 80 }],
            devices: ["desktop"],
            locales: ["en-US", "en"],
            operatingSystems: ["linux"],
        },
    });
    console.log(response.body);
})();

Here's the result you can expect to be generated for the example above and logged to the console:

"headers": {
    "user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "accept-language": "en-US,en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "upgrade-insecure-requests": "1",
    "sec-fetch-site": "same-site",
    "sec-fetch-mode": "navigate",
    "sec-fetch-user": "?1",
    "sec-fetch-dest": "document"
  }

Alternative to Got Scraping: Axios

Axios logo
📌
What is Axios?

Axios is a promise-based HTTP Client for Node.js and the browser, capable of running in the browser and Node.js with the same codebase. On the server side, Axios uses the native node.js HTTP module, while on the client side, it uses XMLHttpRequests.

In some cases, Axios might be preferable to Got Scraping because it supports both the browser and Node.js environments, which makes it versatile for various projects.​

⚒️ Main features

  • Make XMLHttpRequests from the browser
  • Make HTTP requests from Node.js
  • Promise API support
  • Intercept request and response
  • Transform request and response data

1. Installation

# npm
npm install axios

# Yarn
yarn add axios

# pNPM
pnpm add axios

2. Code sample

Send a request to the target website, retrieve its HTML code and log the result to the console.

const axios = require("axios");

(async () => {
    const response = await axios.get("https://news.ycombinator.com/");
    const html = response.data;
    console.log(html);
})();

Parse HTML and XML data

When it comes to web scraping in Node.js, you can’t go wrong with Cheerio, the most popular and widely used HTML and XML parser for the Node.js ecosystem.

Now, we're going to use Got Scraping together with Cheerio ⭐️ to extract text from all articles on Hacker News.

📌
What is Cheerio?

Cheerio is an efficient and flexible implementation of core jQuery designed to run on the server. Because of its incredible efficiency and familiar syntax, Cheerio is our best friend when scraping pages that don't require JavaScript to load their contents.

⚒️ Main features

  • Implements a subset of core jQuery, providing developers with a familiar and easy-to-use syntax.
  • Works with a simple and consistent DOM model, making parsing, manipulating, and rendering incredibly efficient.
  • Offers great flexibility, being able to parse nearly any HTML or XML document.

1. Installation

npm install cheerio

2. Code sample

Let's now see how we can use Cheerio + Got Scraping to extract the text content from all the articles on the first page of Hacker News.

const { gotScraping } = require("got-scraping");
const cheerio = require("cheerio");

(async () => {
    const response = await gotScraping("https://news.ycombinator.com/");
    const html = response.body;

    // Use Cheerio to parse the HTML
    const $ = cheerio.load(html);

    // Select all the elements with the class name "athing"
    const articles = $(".athing");

    // Loop through the selected elements
    for (const article of articles) {
        const articleTitleText = $(article).text();

        // Log each element's text to the terminal
        console.log(articleTitleText);
    }
})();

A few seconds after running the script, we'll see the title and ranking from the 30 most recent articles on HackerNews logged to our terminal.

Output example:


      1.      US Department of Energy: Fusion Ignition Achieved (energy.gov)
      2.      Reddit's photo albums broke due to Integer overflow of Signed Int32 (reddit.com)
      3.      About the security content of iOS 16.2 and iPadOS 16.2 (support.apple.com)
      4.      Balloon framing is worse-is-better (2021) (constructionphysics.substack.com)
      5.      After 20 years the Dwarf Fortress devs have to get used to being millionaires (pcgamer.com)
      ...
      25.      How much decentralisation is too much? (shkspr.mobi)
      26.      What we can learn from vintage computing (github.com/readme)
      27.      Data2vec 2.0: Highly efficient self-supervised learning for vision, speech, text (facebook.com)
      28.      Pony Programming Language (github.com/ponylang)
      29.      Al Seckel on Richard Feynman (2001) (fotuva.org)
      30.      Hydra – the fastest Postgres for analytics [benchmarks] (hydras.io)

Scrape dynamic websites

Browser automation libraries are used scraping dynamic pages. Their ability to emulate a real browser enables scrapers to access data on websites that require JavaScript to load their content.

Our tool of choice for scraping dynamic websites is Playwright ⭐️. Its ability to emulate a real browser allows it to render JavaScript. That's particularly useful when we want to extract data from pages that load their content dynamically, as we wouldn't be able to scrape it with just plain HTTP requests and Cheerio.

To demonstrate how to scrape dynamic pages, we'll use Playwright to extract data from Amazon.

Playwright logo
📌
What is Playwright?

Playwright is an open-source framework for web testing and automation developed and maintained by Microsoft. While similar to its predecessor, Puppeteer, Playwright is considered a more modern and capable version.

⚒️ Main features

  • Auto-wait. Playwright, by default, waits for elements to be actionable before performing actions, eliminating the need for artificial timeouts.
  • Cross-browser support, being able to drive Chromium, WebKit, Firefox, and Microsoft Edge.
  • Playwright is available in multiple languages, including JavaScript and Typescript, Python, Java, and .NET

1. Installation

# Run from your project's root directory
npm init playwright@latest

# Or create a new project
npm init playwright@latest new-project

2. Code sample

Here's an example of using Playwright to extract some data from the Amazon page for Douglas Adams' The Hitchhiker's Guide to the Galaxy:

const playwright = require("playwright");

(async () => {
    const browser = await playwright.webkit.launch({
        headless: false, // Set headless to false, so we can see the browser working
    });
    const page = await browser.newPage();
    await page.setViewportSize({ width: 1366, height: 768 }); // Maximize the screensize
    await page.goto(
        "https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=1642536225&sr=8-1"
    );

    const book = {
        bookTitle: await page.locator("#productTitle").innerText(),
        author: await page
            .locator(".a-link-normal.contributorNameID")
            .innerText(),
        edition: await page.locator("#productSubtitle").innerText(),
        digitalListPrice: await page.locator("#digital-list-price").innerText(),
        printListPrice: await page.locator("#print-list-price").innerText(),
        kindlePrice: await page.locator("#kindle-price").innerText(),
    };

    await page.screenshot({ path: "book.png" }); // Take a screenshot of the page
    console.log(book);
    await browser.close();
})();

After the scraper finishes its run, the browser controlled by Playwright will close, and the extracted data will be logged into the console.

🌐
How to scrape the web with Playwright - step by step from building a scraper to extracting data.

Alternative to Playwright: Puppeteer

Puppeteer logo
📌
What is Puppeteer?

Puppeteer is an open-source Node.js browser automation library developed and maintained by Google that provides a high-level API to manipulate a headless Chrome programmatically, which can also be configured to use a full, non-headless browser.

Puppeteer is the predecessor of Playwright. Because it's been around for longer, Puppeteer has a strong developer community and documentation. Like Playwright, Puppeteer's ability to emulate a real browser allows it to render JavaScript and scrape dynamically loaded content.

⚒️ Main features

  • Crawl a Single-Page Application and generate pre-rendered content (i.e., server-side rendering)
  • Take screenshots and generate PDFs of pages.
  • Automate manual user interactions, such as UI testing, form submissions, and keyboard inputs.
  • Supports Chromium and Firefox.

1. Installation

# npm
npm i puppeteer

# Yarn
yarn add puppeteer

# pNPM
pnpm i puppeteer

2. Code sample

To demonstrate some of Puppeteer's capabilities, let's again go to Amazon, scrape The Hitchhiker's Guide to the Galaxy product page, and save a screenshot of the accessed page.

By default, Puppeteer will launch a headless browser. In this example, we'll set the headless option to false so we can follow Puppeteer as it loads the browser and goes to the specified website.

const puppeteer = require("puppeteer");

(async () => {
    const browser = await puppeteer.launch({
        headless: false, // Set headless to false, so we can see the browser working
    });
    const page = await browser.newPage();
    await page.setViewport({ width: 1366, height: 768 }); // Maximize the screensize
    await page.goto(
        "https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=1642536225&sr=8-1"
    );

    const book = await page.evaluate(() => {
        return {
            title: document.querySelector("#productTitle").innerText,
            author: document.querySelector(".a-link-normal.contributorNameID")
                .innerText,
            edition: document.querySelector("#productSubtitle").innerText,
            digitalListPrice: document.querySelector("#digital-list-price")
                .innerText,
            printListPrice:
                document.querySelector("#print-list-price").innerText,
            kindlePrice: document.querySelector("#kindle-price").innerText,
        };
    });

    await page.screenshot({ path: "book.png" }); // Take a screenshot of the page
    console.log(book);
    await browser.close();
})();

After the script finishes its run, we will see an object containing the book's title, author, edition, and prices logged to the console, and a screenshot of the page saved as book.png .

Output example:

{
  title: "The Hitchhiker's Guide to the Galaxy: The Illustrated Edition ",
  author: 'Douglas Adams',
  edition: 'Kindle Edition',
  digitalListPrice: '$7.24',
  printListPrice: '$7.99',
  kindlePrice: '$6.31'
}

Saved png:

Scraped page screenshot
📒
Playwright vs. Puppeteer: which is better?

Advanced dynamic scraping using Playwright with Cheerio

The primary reason why we need a browser automation library for web scraping is to load a browser so we can access dynamically generated content on web pages that require JavaScript to function.

However, the prospect of having to memorize yet another set of library-specific syntax doesn't sound that exciting. So, wouldn't it be nice if we could take advantage of Puppeteer and Playwright's functionalities while still being able to use Cheerio's jQuery syntax to select elements and extract data? Well, that's precisely what we will do in this section.

We'll start by accessing the target website using Playwright, then using it to get the HTML code markup from the page and save it to a variable, which, in turn, we will feed into Cheerio's load() function so it can parse the resulting HTML.

How to efficiently scrape any website with Cheerio Scraper.

For this demonstration, we will use https://www.mintmobile.com/product/google-pixel-7-pro-bundle/, as our target website. Mint Mobile requires JavaScript to load most of the content displayed on its product page, which makes it an ideal scenario for using Playwright in the context of web scraping.

Mint Mobile product page with JavaScript disabled:

Mint Mobile JavaScript disabled

Mint Mobile product page with JavaScript enabled:

Mint Mobile JavaScript enabled

1. Installation

npm install playwright cheerio

2. Code sample

So, without further ado, let's use Playwright + Cheerio to extract the product data highlighted in the image above.

const playwright = require("playwright");
const cheerio = require("cheerio");

(async () => {
    const browser = await playwright.firefox.launch({
        headless: false,
    });
    const page = await browser.newPage();
    await page.goto(
        "https://www.mintmobile.com/product/google-pixel-7-pro-bundle/"
    );
    const html = await page.evaluate(() => document.body.innerHTML); // Save the page's HTML to a variable

    const $ = cheerio.load(html); // Use Cheerio to load the page's HTML code

    // Continue writing your scraper using Cheerio's jQuery syntax
    const phone = {
        name: $("div.m-productCard__heading h1").text().trim(),
        memory: $(
            "div.composited_product_details_wrapper > div > div > div:nth-child(2) > div.label > span"
        )
            .text()
            .split(" ")
            .pop(),
        payMonthlyPrice: $("div.composite_price_monthly span").text().trim(),
        payTodayPrice: $("div.composite_price > p > ins > span").text().trim(),
    };

    console.log(phone);
    await browser.close();
})();

Expected output:

{
  name: 'Google Pixel 7 Pro',
  memory: '128GB',
  payMonthlyPrice: '$50',
  payTodayPrice: '$589'
}
Crawlee logo
📌
What is Crawlee?

Crawlee is an open-source Node.js web scraping and automation library developed and maintained by Apify. It builds on top of Got Scraping, Cheerio, Puppeteer, and Playwright, and takes advantage of the already great features of these tools while providing extra functionality tailored to the needs and preferences of web scraping developers.

One of Crawlee's major selling points is its extensive out-of-the-box collection of features to help scrapers overcome modern website anti-bot defenses and reduce blocking rates. It achieves that by making HTTP requests that mimic browser headers and TLS fingerprints without requiring extra configuration.

Crawlee is a web scraping and browser automation library. It helps you build reliable crawlers. Fast.

Another handy characteristic of Crawlee is that it functions as an all-in-one toolbox for web scraping. We can switch between the available classes, such as CheerioCrawler, PuppeteerCrawler, and PlaywrightCrawler, to quickly access the features we need for each specific scraping use case.

⚒️ Main features

  • Single interface for HTTP and headless browser crawling
  • Automatic generation of browser-like headers
  • Replication of browser TLS fingerprints
  • Zero-config generation of human-like fingerprints
  • Automatic scaling with available system resources
  • Integrated proxy rotation and session management
  • Lifecycles customizable with hooks
  • CLI to bootstrap your projects
  • Configurable routing, error handling, and retries
  • Dockerfiles ready to deploy

1. Installation

npx crawlee create my-crawler

2. File structure

Before we jump into some code examples, it's important to understand the basic file structure we can expect to see after running the npx crawlee create my-crawler command and choosing a starting template for our project.

To promote code modularity, the crawler logic is split between two files, main.js and routes.js. Once you run your scraper, the extracted data will be automatically stored as json files in the datasets directory.

3. Code samples


#1. Using CheerioCrawler

In the first code sample, we will use Crawlee's CheerioCrawler to recursively scrape the Hacker News website.

The crawler starts with a single URL, finds links to the following pages, enqueues them, and continues until no more page links are available. The results are then stored on your disk in the datasets directory.

// main.js

import { CheerioCrawler } from "crawlee";
import { router } from "./routes.js";

const startUrls = ["https://news.ycombinator.com/"];

const crawler = new CheerioCrawler({
    requestHandler: router,
});

await crawler.run(startUrls);
// routes.js

import { Dataset, createCheerioRouter } from "crawlee";

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log, $ }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ["https://news.ycombinator.com/?p=*"],
    });

    const data = $(".athing")
        .map((index, post) => {
            return {
                postUrL: $(post).find(".title a").attr("href"),
                title: $(post).find(".title a").text(),
                rank: $(post).find(".rank").text(),
            };
        })
        .toArray();

    await Dataset.pushData({
        data,
    });
});

Expected output (each stored JSON file will contain the results for the particular scraped page):


{
	"data": [
		{
			"postUrL": "https://www.withdiode.com/projects/62716731-5e1e-4622-86af-90d8e6b5123b",
			"title": "A circuit simulator that doesn't look like it was made in 2003withdiode.com",
			"rank": "1."
		},
		{
			"postUrL": "https://lwn.net/ml/linux-doc/20221214185714.868374-1-tytso@mit.edu/",
			"title": "Documentation/Process: Add Linux Kernel Contribution Maturity Modellwn.net",
			"rank": "2."
		},
		{
			"postUrL": "https://computoid.com/APPerl/",
			"title": "Actually Portable Perlcomputoid.com",
			"rank": "3."
		},
		{
			"postUrL": "https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1",
			"title": "How does GPT obtain its ability? Tracing emergent abilities of language modelsyaofu.notion.site",
			"rank": "4."
		},
		{
			"postUrL": "https://www.smashingmagazine.com/2022/09/javascript-api-guide/",
			"title": "Lesser-known JavaScript APIssmashingmagazine.com",
			"rank": "5."
		},
		{
			"postUrL": "item?id=33996871",
			"title": "Ask HN: What prevents a company from hiring remote employees internationally?",
			"rank": "6."
		},
		{
			"postUrL": "https://krebsonsecurity.com/2022/12/six-charged-in-mass-takedown-of-ddos-for-hire-sites/",
			"title": "Six charged in mass takedown of DDoS-for-hire siteskrebsonsecurity.com",
			"rank": "7."
		},
		{
			"postUrL": "https://arxiv.org/abs/2212.07332",
			"title": "Two temperate Earth-mass planets orbiting the nearby star GJ1002arxiv.org",
			"rank": "8."
		},
		{
			"postUrL": "https://techcrunch.com/2022/12/13/boom-takes-the-wraps-off-its-supersonic-symphony-engine-design/",
			"title": "Boom takes the wraps off its supersonic Symphony engine designtechcrunch.com",
			"rank": "9."
		},
		{
			"postUrL": "https://www.theregister.com/2022/12/14/firefox_108/",
			"title": "You can hook your MIDI keyboard up to a website with Firefox 108theregister.com",
			"rank": "10."
		},
        
      ...
      
     	{
			"postUrL": "https://www.nytimes.com/2022/12/12/business/ftx-sam-bankman-fried-bahamas.html",
			"title": "FTX’s Sam Bankman-Fried Said to Be Arrested in the Bahamasnytimes.com",
			"rank": "608."
		},
		{
			"postUrL": "https://arstechnica.com/gadgets/2022/12/report-apple-engineers-are-working-on-third-party-app-store-support-in-ios/",
			"title": "Apple plans to support sideloading and third-party app stores by 2024arstechnica.com",
			"rank": "609."
		},
		{
			"postUrL": "https://twitter.com/latraelrahming/status/1602446687712473088",
			"title": "Sam Bankman-Fried reportedly arrested in the Bahamastwitter.com/latraelrahming",
			"rank": "610."
		},
		{
			"postUrL": "https://www.axios.com/2022/12/14/twitter-elon-musk-jet-tracker-account-suspended",
			"title": "Musk Bans His Twitter's Jet Tracker Account and Its Authoraxios.com",
			"rank": "611."
		},
		{
			"postUrL": "https://www.youtube.com/watch?v=10pFCIFpAtY",
			"title": "Police Caught Red-Handed Making Bogus Traffic Stopyoutube.com",
			"rank": "612."
		},
		{
			"postUrL": "https://www.wsj.com/articles/tesla-investors-voice-concern-over-elon-musks-focus-on-twitter-11670948786",
			"title": "Tesla Investors Voice Concern over Elon Musk’s Focus on Twitterwsj.com",
			"rank": "613."
		}
      
    ]
}

#2. Using PlaywrightCrawler

In the second code example, we will use Crawlee's PlaywrightCrawler to scrape the product card of each phone on Mint Mobile's deals page.

// main.js

import { PlaywrightCrawler } from "crawlee";
import { router } from "./routes.js";

const startUrls = ["https://www.mintmobile.com/deals/"];

const crawler = new PlaywrightCrawler({
    requestHandler: router,
});

await crawler.run(startUrls);
// routes.js

import { Dataset, createPlaywrightRouter } from "crawlee";

export const router = createPlaywrightRouter();

router.addDefaultHandler(async ({ log, page }) => {
    log.info("Extracting data");
    const data = await page.$$eval(
        ".m-filterable-products-container__card",
        ($phones) => {
            const scrapedData = [];

            // We're getting the URL, product name, total price and monthly price of each phone on Mint Mobile's first page.
            $phones.forEach(($phone) => {
                scrapedData.push({
                    productPageUrl: $phone.querySelector(
                        'a[data-elact="Click View Details"]'
                    ).href,
                    name: $phone.querySelector("h2.subHeader--phoneTitle")
                        .innerText,
                    totalPrice: $phone
                        .querySelector(".m-center__price p")
                        .innerText.split(":")
                        .pop(),
                    monthlyPrice: $phone
                        .querySelector(".a-pricing")
                        .innerText.replace(/\n/g, "")
                        .split("/")
                        .shift(),
                });
            });
            return scrapedData;
        }
    );
    log.info("Pushing scraped data to the dataset");
    await Dataset.pushData({
        data,
    });
});

Expected output:


{
	"data": [
		{
			"productPageUrl": "https://www.mintmobile.com/product/apple-iphone-14-plus-bundle/",
			"name": "*NEW* Apple iPhone 14 Plus",
			"totalPrice": " $1019",
			"monthlyPrice": "$85"
		},
		{
			"productPageUrl": "https://www.mintmobile.com/product/google-pixel-7-pro-bundle/",
			"name": "Google Pixel 7 Pro",
			"totalPrice": " $589",
			"monthlyPrice": "$50"
		},
		{
			"productPageUrl": "https://www.mintmobile.com/product/google-pixel-6a-bundle/",
			"name": "Google Pixel 6a",
			"totalPrice": " $389",
			"monthlyPrice": "$33"
		},
		{
			"productPageUrl": "https://www.mintmobile.com/product/iphone-se-bundle-3rd-generation/",
			"name": "Apple iPhone SE (3rd generation)",
			"totalPrice": " $569",
			"monthlyPrice": "$48"
		},
		{
			"productPageUrl": "https://www.mintmobile.com/product/apple-iphone-14-bundle/",
			"name": "*NEW* Apple iPhone 14",
			"totalPrice": " $919",
			"monthlyPrice": "$77"
		},
		{
			"productPageUrl": "https://www.mintmobile.com/product/apple-iphone-14-pro-bundle/",
			"name": "*NEW* Apple iPhone 14 Pro",
			"totalPrice": " $1089",
			"monthlyPrice": "$91"
		}
	]
}

Conclusion: why use JavaScript and Node.js for web scraping?

JavaScript, particularly when paired with Node.js, offers a powerful platform for web scraping. This language-framework duo leverages the full capabilities of JavaScript on the server side, allowing developers to utilize familiar syntax and a vast array of libraries for scraping tasks. JavaScript's ubiquity across web technologies means that using it for scraping can simplify interactions with web pages that are heavily reliant on JavaScript themselves.

Frequently asked questions

Can you use JavaScript for web scraping?

Yes, you can use JavaScript for web scraping. It is particularly effective for websites that are heavily dependent on JavaScript to render their content, allowing you to interact dynamically with the web page's elements.

Is JavaScript good for scraping?

Yes, JavaScript is good for web scraping because it executes web page scripts the same way a browser does, enabling access to dynamically generated content that other languages might miss.

What is the best web scraping tool for JavaScript?

The best web scraping tool for JavaScript depends on the task at hand. For sending requests, we recommend Got Scraping and Axios. For parsing, Cheerio is best. For scraping dynamic content, we recommend Playwright. For a complete web scraping library that combines all of these features, we recommend Crawlee.

Is Python or JavaScript better for web scraping?

Both Python and JavaScript are effective for web scraping, but the choice depends on the project's specifics. JavaScript is better for scraping dynamic content directly executed in the browser, while Python offers robust libraries like BeautifulSoup and Scrapy for diverse scraping needs.

JavaScript and Node.js resources

If you want to dive deeper into some of the libraries and frameworks we presented during this post, here's a curated list of great videos and articles about the topic:

General web scraping

HTML and XML parsers

Browser automation tools

Crawlee

Discord

Finally, don't forget to join the Apify & Crawlee community on Discord to connect with other web scraping and automation enthusiasts 🚀

Percival Villalva
Percival Villalva
Developer Advocate on a mission to help developers build scalable, human-like bots for data extraction and web automation.

Get started now

Step up your web scraping and automation