Web scraping with JavaScript and Node.js

Percival Villalva
Percival Villalva
Table of Contents

Explore some of the best JavaScript libraries and frameworks available for web scraping in Node.js and learn how to use them in your projects.

Getting started with web scraping in Node.js

Originally designed as a client-side language, JavaScript is one of the most popular programming languages out there and is rightfully referred to as the language of the web.

With the introduction of Node.js, developers could suddenly do much more with the language. As a result, the JavaScript ecosystem exploded with new use cases and tools that turned the language into a powerhouse on both the client and server sides.

On top of that, TypeScript is becoming increasingly popular. This indicates that acclaim for the Node.js and JavaScript ecosystem will continue to grow in the coming years.

How does all that correlate to web scraping? For years Python has been the go-to language for data extraction. However, the JavaScript scraping community is growing steadily. As new powerful libraries come along, it’s safe to say that the Node.js ecosystem is one of the most reliable choices for modern web scraping.

This article will explore some of the best libraries and frameworks available for web scraping in Node.js and show how you can use them in your projects.

Web scraping in Node.js with Axios and Cheerio
Using Axios and Cheerio in Node.js. With code examples.

Requirements

To fully understand the content and code samples showcased in this post, you should have a grasp of the following concepts:

  • Familiarity with JavaScript ES6
  • Understanding of CSS selectors
  • Familiarity with jQuery syntax
  • Comfortable navigating the browser DevTools to find and select page elements

Editor's choice ⭐️

Throughout this article, a star (⭐️) indicates the tools we usually give preference to when working on our web scraping projects here at Apify. That’s not to discredit the other libraries mentioned here (we use all of them in one way or another depending on the project) but to give you an insight into the tech stack of a Node.js web scraper.

HTTP Clients

HTTP clients are end-point applications that use the HTTP protocol to send requests and get a response from servers. In the context of web scraping, they’re necessary to send requests to your target website and retrieve information such as the website’s HTML markup or JSON payload.

Axios

Axios logo

Axios is a promise-based HTTP Client for Node.js and the browser, capable of running in the browser and Node.js with the same codebase.

On the server side, Axios uses the native node.js HTTP module, while on the client side, it uses XMLHttpRequests.

⚒️  Main Features

  • Make XMLHttpRequests from the browser
  • Make HTTP requests from Node.js
  • Promise API support
  • Intercept request and response
  • Transform request and response data

⚙️  Installation

# npm
npm install axios

# Yarn
yarn add axios

# pNPM
pnpm add axios

💡  Code Sample

Send a request to the target website, retrieve its HTML code and log the result to the console.

const axios = require("axios");

(async () => {
    const response = await axios.get("https://news.ycombinator.com/");
    const html = response.data;
    console.log(html);
})();

Got Scraping ⭐️

GOT HTTP client logo


Got Scraping is a package extension of the Got HTTP client. Its primary purpose is to address common drawbacks in modern web scraping by offering built-in tools to make bot requests less likely to be detected and blocked by modern website anti-scraping protections.

In other words, Got-scraping sends browser-like requests, which enables web scraping bots to blend in with the website traffic, making it less likely for them to be detected and blocked.

⚒️  Main Features

  • Out-of-the-box browser-like requests to blend in with the website traffic and reduce blocking rates.
  • Default configuration to retry requests on failure
  • Option to generate browser-like headers
  • Basic understanding of TypeScript (optional)

⚙️  Installation

$ npm install got-scraping

💡  Code Sample

Similar to the Axios example, we will send a request to the target website, retrieve its HTML code of the page and log it to the console.

const { gotScraping } = require("got-scraping");

(async () => {
    const response = await gotScraping.get("https://news.ycombinator.com/");
    const html = response.body;
    console.log(html);
})();

Got-scraping also comes bundled with the header-generator package, which enables us to choose from various browsers from different operating systems and devices.

It generates all the headers automatically, which can be handy when trying to scrape websites that employ aggressive anti-bot blocking systems. To work around this potential setback, we often need to make the bot requests look "browser-like" and reduce the chances of them getting blocked.

To demonstrate that, let's take a look at an example of a request using the HeaderGenerationOptions:

const { gotScraping } = require("got-scraping");

(async () => {
    const response = await gotScraping({
        url: "https://api.apify.com/v2/browser-info",
        headerGeneratorOptions: {
            browsers: [{ name: "firefox", minVersion: 80 }],
            devices: ["desktop"],
            locales: ["en-US", "en"],
            operatingSystems: ["linux"],
        },
    });
    console.log(response.body);
})();

And the result you can expect to be generated for the example above and logged to the console:

"headers": {
    "user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "accept-language": "en-US,en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "upgrade-insecure-requests": "1",
    "sec-fetch-site": "same-site",
    "sec-fetch-mode": "navigate",
    "sec-fetch-user": "?1",
    "sec-fetch-dest": "document"
  }

HTML and XML parser

When it comes to web scraping in Node.js, you can’t go wrong with Cheerio, the most popular and widely-used HTML and XML parser for the Node.js ecosystem.

Cheerio ⭐️

Cheerio is an efficient and flexible implementation of core jQuery designed to run on the server. Because of its incredible efficiency and familiar syntax, Cheerio is our best friend when scraping pages that don't require JavaScript to load their contents.

⚒️ Main features

  • Implements a subset of core jQuery, providing developers with a familiar and easy-to-use syntax.
  • Works with a simple and consistent DOM model, making parsing, manipulating, and rendering incredibly efficient.
  • Offers great flexibility, being able to parse nearly any HTML or XML document.

⚙️  Installation

npm install cheerio

💡  Code Sample

Let's now see how we can use Cheerio + Got-scraping to extract the text content from all the articles on the first page of Hacker News.

const { gotScraping } = require("got-scraping");
const cheerio = require("cheerio");

(async () => {
    const response = await gotScraping("https://news.ycombinator.com/");
    const html = response.body;

    // Use Cheerio to parse the HTML
    const $ = cheerio.load(html);

    // Select all the elements with the class name "athing"
    const articles = $(".athing");

    // Loop through the selected elements
    for (const article of articles) {
        const articleTitleText = $(article).text();

        // Log each element's text to the terminal
        console.log(articleTitleText);
    }
})();

A few seconds after running the script, we will see the title and ranking from the 30 most recent articles on HackerNews logged to our terminal.

Output example:


      1.      US Department of Energy: Fusion Ignition Achieved (energy.gov)
      2.      Reddit's photo albums broke due to Integer overflow of Signed Int32 (reddit.com)
      3.      About the security content of iOS 16.2 and iPadOS 16.2 (support.apple.com)
      4.      Balloon framing is worse-is-better (2021) (constructionphysics.substack.com)
      5.      After 20 years the Dwarf Fortress devs have to get used to being millionaires (pcgamer.com)
      ...
      25.      How much decentralisation is too much? (shkspr.mobi)
      26.      What we can learn from vintage computing (github.com/readme)
      27.      Data2vec 2.0: Highly efficient self-supervised learning for vision, speech, text (facebook.com)
      28.      Pony Programming Language (github.com/ponylang)
      29.      Al Seckel on Richard Feynman (2001) (fotuva.org)
      30.      Hydra – the fastest Postgres for analytics [benchmarks] (hydras.io)

Browser automation tools

Browser automation libraries have an off-label use for web scraping. Their ability to emulate a real browser enables scrapers to access data on websites that require JavaScript to load their content.

Puppeteer

Puppeteer logo

Puppeteer is an open-source Node.js browser automation library developed and maintained by Google that provides a high-level API to manipulate a headless Chrome programmatically, which can also be configured to use a full, non-headless browser.

🔐
4 ways to authenticate a proxy in Puppeteer with Headless Chrome in 2022.

In the context of web scraping, Puppeteer's ability to emulate a real browser allows it to render JavaScript. That is particularly useful when we want to extract data from pages that load their content dynamically, and, therefore, we wouldn't be able to scrape with just plain HTTP requests and Cheerio.

⚒️ Main features

  • Crawl a Single-Page Application and generate pre-rendered content (i.e., server-side rendering)
  • Take screenshots and generate PDFs of pages.
  • Automate manual user interactions, such as UI testing, form submissions, and keyboard inputs.
  • Supports Chromium and Firefox.

⚙️  Installation

# npm
npm i puppeteer

# Yarn
yarn add puppeteer

# pNPM
pnpm i puppeteer

💡  Code Sample

To demonstrate some of Puppeteer's capabilities, let's go to Amazon, scrape The Hitchhiker's Guide to the Galaxy product page, and save a screenshot of the accessed page.

By default, Puppeteer will launch a headless browser. In this example, we will set the headless option to false so we can follow Puppeteer as it loads the browser and goes to the specified website.

const puppeteer = require("puppeteer");

(async () => {
    const browser = await puppeteer.launch({
        headless: false, // Set headless to false, so we can see the browser working
    });
    const page = await browser.newPage();
    await page.setViewport({ width: 1366, height: 768 }); // Maximize the screensize
    await page.goto(
        "https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=1642536225&sr=8-1"
    );

    const book = await page.evaluate(() => {
        return {
            title: document.querySelector("#productTitle").innerText,
            author: document.querySelector(".a-link-normal.contributorNameID")
                .innerText,
            edition: document.querySelector("#productSubtitle").innerText,
            digitalListPrice: document.querySelector("#digital-list-price")
                .innerText,
            printListPrice:
                document.querySelector("#print-list-price").innerText,
            kindlePrice: document.querySelector("#kindle-price").innerText,
        };
    });

    await page.screenshot({ path: "book.png" }); // Take a screenshot of the page
    console.log(book);
    await browser.close();
})();

After the script finishes its run, we will see an object containing the book's title, author, edition, and prices logged to the console, and a screenshot of the page saved as book.png .

Output example:

{
  title: "The Hitchhiker's Guide to the Galaxy: The Illustrated Edition ",
  author: 'Douglas Adams',
  edition: 'Kindle Edition',
  digitalListPrice: '$7.24',
  printListPrice: '$7.99',
  kindlePrice: '$6.31'
}

Saved png:

Scraped page screenshot

Playwright ⭐️

Playwright logo

By definition, Playwright is an open-source framework for web testing and automation developed and maintained by Microsoft.

Fun fact: a significant part of Playwright's developer team is composed of the same engineers that worked on Puppeteer. Because of that, they have many things in common in terms of functionality and syntax. This is a positive factor for developers already working with Puppeteer since it lowers the learning curve and reduces the hassle of migrating code from Puppeteer to Playwright.

📒
Playwright vs. Puppeteer: which is better?

Despite its similarities with Puppeteer, there are some major differences between those tools. In many ways, Playwright is considered a more modern and capable version of the aforementioned library.

In addition to its extra functionalities and flexibility (see the Main Features section below), Playwright has performed incredibly well in speed benchmark tests compared to other popular web automation libraries and frameworks, including Puppeteer. For those reasons, Playwright is nowadays our go-to browser automation framework for web scraping.

⚒️ Main features

  • Auto-wait. Playwright, by default, waits for elements to be actionable before performing actions, eliminating the need for artificial timeouts.
  • Cross-browser support, being able to drive Chromium, WebKit, Firefox, and Microsoft Edge.
  • Playwright is available in multiple languages, including JavaScript and Typescript, Python, Java, and .NET

⚙️  Installation

# Run from your project's root directory
npm init playwright@latest

# Or create a new project
npm init playwright@latest new-project

💡  Code Sample

To highlight Playwright's features and syntax similarities with Puppeteer, let's go back to Amazon's website and extract some data from Douglas Adams' The Hitchhiker's Guide to the Galaxy.

Playwright version:

const playwright = require("playwright");

(async () => {
    const browser = await playwright.webkit.launch({
        headless: false, // Set headless to false, so we can see the browser working
    });
    const page = await browser.newPage();
    await page.setViewportSize({ width: 1366, height: 768 }); // Maximize the screensize
    await page.goto(
        "https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=1642536225&sr=8-1"
    );

    const book = {
        bookTitle: await page.locator("#productTitle").innerText(),
        author: await page
            .locator(".a-link-normal.contributorNameID")
            .innerText(),
        edition: await page.locator("#productSubtitle").innerText(),
        digitalListPrice: await page.locator("#digital-list-price").innerText(),
        printListPrice: await page.locator("#print-list-price").innerText(),
        kindlePrice: await page.locator("#kindle-price").innerText(),
    };

    await page.screenshot({ path: "book.png" }); // Take a screenshot of the page
    console.log(book);
    await browser.close();
})();

After the scraper finishes its run, the browser controlled by Playwright will close, and the extracted data will be logged into the console.

🌐
How to scrape the web with Playwright - step by step from building a scraper to extracting data.

Using Playwright with Cheerio

The primary reason why we need a browser automation library for web scraping is to load a browser so we can access dynamically generated content on web pages that require JavaScript to function.

However, the prospect of having to memorize yet another set of library-specific syntax does not sound that exciting. So, wouldn't it be nice if we could take advantage of Puppeteer and Playwright's functionalities while still being able to use Cheerio's jQuery syntax to select elements and extract data? Well, that's precisely what we will do in this section.

We will start by accessing the target website using Playwright, then using it to get the HTML code markup from the page and save it to a variable which, in turn, we will feed into Cheerio's load() function so it can parse the resulting HTML.

How to efficiently scrape any website with Cheerio Scraper.

For this demonstration, we will use https://www.mintmobile.com/product/google-pixel-7-pro-bundle/, as our target website. Mint Mobile requires JavaScript to load most of the content displayed on its product page, which makes it an ideal scenario for using Playwright in the context of web scraping.

Mint Mobile product page with JavaScript disabled:

Mint Mobile JavaScript disabled

Mint Mobile product page with JavaScript enabled:

Mint Mobile JavaScript enabled

⚙️  Installation

npm install playwright cheerio

💡  Code Sample

So, without further ado, let's use Playwright + Cheerio to extract the product data highlighted in the image above.

const playwright = require("playwright");
const cheerio = require("cheerio");

(async () => {
    const browser = await playwright.firefox.launch({
        headless: false,
    });
    const page = await browser.newPage();
    await page.goto(
        "https://www.mintmobile.com/product/google-pixel-7-pro-bundle/"
    );
    const html = await page.evaluate(() => document.body.innerHTML); // Save the page's HTML to a variable

    const $ = cheerio.load(html); // Use Cheerio to load the page's HTML code

    // Continue writing your scraper using Cheerio's jQuery syntax
    const phone = {
        name: $("div.m-productCard__heading h1").text().trim(),
        memory: $(
            "div.composited_product_details_wrapper > div > div > div:nth-child(2) > div.label > span"
        )
            .text()
            .split(" ")
            .pop(),
        payMonthlyPrice: $("div.composite_price_monthly span").text().trim(),
        payTodayPrice: $("div.composite_price > p > ins > span").text().trim(),
    };

    console.log(phone);
    await browser.close();
})();

Expected output:

{
  name: 'Google Pixel 7 Pro',
  memory: '128GB',
  payMonthlyPrice: '$50',
  payTodayPrice: '$589'
}


Crawlee ⭐️

Crawlee logo

Crawlee is an open-source Node.js web scraping and automation library developed and maintained by Apify. It builds on top of many of the previously mentioned libraries and frameworks, namely Got-scraping, Cheerio, Puppeteer, and Playwright, and takes advantage of the already great features of these tools while providing extra functionality tailored to the needs and preferences of web scraping developers.

Crawlee was born out of the necessity for a specialized web scraping library for the Node.js ecosystem. Modern data extraction is becoming increasingly challenging, as websites become more complex and continuously employ aggressive anti-scraping blocking methods.

Crawlee is a web scraping and browser automation library. It helps you build reliable crawlers. Fast.

One of Crawlee's major selling points is its extensive out-of-the-box collection of features to help scrapers overcome modern website anti-bot defenses and reduce blocking rates. It achieves that by making HTTP requests that mimic browser headers and TLS fingerprints without requiring extra configuration.

Another handy characteristic of Crawlee is that it functions as an all-in-one toolbox for web scraping. We can switch between the available classes, such as CheerioCrawler, PuppeteerCrawler, and PlaywrightCrawler, to quickly access the features we need for each specific scraping use case.

⚒️ Main features

  • Single interface for HTTP and headless browser crawling
  • Automatic generation of browser-like headers
  • Replication of browser TLS fingerprints
  • Zero-config generation of human-like fingerprints
  • Automatic scaling with available system resources
  • Integrated proxy rotation and session management
  • Lifecycles customizable with hooks
  • CLI to bootstrap your projects
  • Configurable routing, error handling, and retries
  • Dockerfiles ready to deploy

⚙️  Installation

npx crawlee create my-crawler

📂  File structure

Before we jump into some code examples, it's important to understand the basic file structure we can expect to see after running the npx crawlee create my-crawler command and choosing a starting template for our project.

To promote code modularity, the crawler logic is split between two files, main.js and routes.js. Once you run your scraper, the extracted data will be automatically stored as json files in the datasets directory.

💡  Code Sample


CheerioCrawler

In the first code sample, we will use Crawlee's CheerioCrawler to recursively scrape the Hacker News website.

The crawler starts with a single URL, finds links to the following pages, enqueues them, and continues until no more page links are available. The results are then stored on your disk in the datasets directory.

// main.js

import { CheerioCrawler } from "crawlee";
import { router } from "./routes.js";

const startUrls = ["https://news.ycombinator.com/"];

const crawler = new CheerioCrawler({
    requestHandler: router,
});

await crawler.run(startUrls);
// routes.js

import { Dataset, createCheerioRouter } from "crawlee";

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log, $ }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ["https://news.ycombinator.com/?p=*"],
    });

    const data = $(".athing")
        .map((index, post) => {
            return {
                postUrL: $(post).find(".title a").attr("href"),
                title: $(post).find(".title a").text(),
                rank: $(post).find(".rank").text(),
            };
        })
        .toArray();

    await Dataset.pushData({
        data,
    });
});

Expected output (each stored JSON file will contain the results for the particular scraped page):


{
	"data": [
		{
			"postUrL": "https://www.withdiode.com/projects/62716731-5e1e-4622-86af-90d8e6b5123b",
			"title": "A circuit simulator that doesn't look like it was made in 2003withdiode.com",
			"rank": "1."
		},
		{
			"postUrL": "https://lwn.net/ml/linux-doc/20221214185714.868374-1-tytso@mit.edu/",
			"title": "Documentation/Process: Add Linux Kernel Contribution Maturity Modellwn.net",
			"rank": "2."
		},
		{
			"postUrL": "https://computoid.com/APPerl/",
			"title": "Actually Portable Perlcomputoid.com",
			"rank": "3."
		},
		{
			"postUrL": "https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1",
			"title": "How does GPT obtain its ability? Tracing emergent abilities of language modelsyaofu.notion.site",
			"rank": "4."
		},
		{
			"postUrL": "https://www.smashingmagazine.com/2022/09/javascript-api-guide/",
			"title": "Lesser-known JavaScript APIssmashingmagazine.com",
			"rank": "5."
		},
		{
			"postUrL": "item?id=33996871",
			"title": "Ask HN: What prevents a company from hiring remote employees internationally?",
			"rank": "6."
		},
		{
			"postUrL": "https://krebsonsecurity.com/2022/12/six-charged-in-mass-takedown-of-ddos-for-hire-sites/",
			"title": "Six charged in mass takedown of DDoS-for-hire siteskrebsonsecurity.com",
			"rank": "7."
		},
		{
			"postUrL": "https://arxiv.org/abs/2212.07332",
			"title": "Two temperate Earth-mass planets orbiting the nearby star GJ1002arxiv.org",
			"rank": "8."
		},
		{
			"postUrL": "https://techcrunch.com/2022/12/13/boom-takes-the-wraps-off-its-supersonic-symphony-engine-design/",
			"title": "Boom takes the wraps off its supersonic Symphony engine designtechcrunch.com",
			"rank": "9."
		},
		{
			"postUrL": "https://www.theregister.com/2022/12/14/firefox_108/",
			"title": "You can hook your MIDI keyboard up to a website with Firefox 108theregister.com",
			"rank": "10."
		},
        
      ...
      
     	{
			"postUrL": "https://www.nytimes.com/2022/12/12/business/ftx-sam-bankman-fried-bahamas.html",
			"title": "FTX’s Sam Bankman-Fried Said to Be Arrested in the Bahamasnytimes.com",
			"rank": "608."
		},
		{
			"postUrL": "https://arstechnica.com/gadgets/2022/12/report-apple-engineers-are-working-on-third-party-app-store-support-in-ios/",
			"title": "Apple plans to support sideloading and third-party app stores by 2024arstechnica.com",
			"rank": "609."
		},
		{
			"postUrL": "https://twitter.com/latraelrahming/status/1602446687712473088",
			"title": "Sam Bankman-Fried reportedly arrested in the Bahamastwitter.com/latraelrahming",
			"rank": "610."
		},
		{
			"postUrL": "https://www.axios.com/2022/12/14/twitter-elon-musk-jet-tracker-account-suspended",
			"title": "Musk Bans His Twitter's Jet Tracker Account and Its Authoraxios.com",
			"rank": "611."
		},
		{
			"postUrL": "https://www.youtube.com/watch?v=10pFCIFpAtY",
			"title": "Police Caught Red-Handed Making Bogus Traffic Stopyoutube.com",
			"rank": "612."
		},
		{
			"postUrL": "https://www.wsj.com/articles/tesla-investors-voice-concern-over-elon-musks-focus-on-twitter-11670948786",
			"title": "Tesla Investors Voice Concern over Elon Musk’s Focus on Twitterwsj.com",
			"rank": "613."
		}
      
    ]
}

PlaywrightCrawler

In the second code example, we will use Crawlee's PlaywrightCrawler to scrape the product card of each phone on Mint Mobile's deals page.

// main.js

import { PlaywrightCrawler } from "crawlee";
import { router } from "./routes.js";

const startUrls = ["https://www.mintmobile.com/deals/"];

const crawler = new PlaywrightCrawler({
    requestHandler: router,
});

await crawler.run(startUrls);
// routes.js

import { Dataset, createPlaywrightRouter } from "crawlee";

export const router = createPlaywrightRouter();

router.addDefaultHandler(async ({ log, page }) => {
    log.info("Extracting data");
    const data = await page.$$eval(
        ".m-filterable-products-container__card",
        ($phones) => {
            const scrapedData = [];

            // We're getting the URL, product name, total price and monthly price of each phone on Mint Mobile's first page.
            $phones.forEach(($phone) => {
                scrapedData.push({
                    productPageUrl: $phone.querySelector(
                        'a[data-elact="Click View Details"]'
                    ).href,
                    name: $phone.querySelector("h2.subHeader--phoneTitle")
                        .innerText,
                    totalPrice: $phone
                        .querySelector(".m-center__price p")
                        .innerText.split(":")
                        .pop(),
                    monthlyPrice: $phone
                        .querySelector(".a-pricing")
                        .innerText.replace(/\n/g, "")
                        .split("/")
                        .shift(),
                });
            });
            return scrapedData;
        }
    );
    log.info("Pushing scraped data to the dataset");
    await Dataset.pushData({
        data,
    });
});

Expected output:


{
	"data": [
		{
			"productPageUrl": "https://www.mintmobile.com/product/apple-iphone-14-plus-bundle/",
			"name": "*NEW* Apple iPhone 14 Plus",
			"totalPrice": " $1019",
			"monthlyPrice": "$85"
		},
		{
			"productPageUrl": "https://www.mintmobile.com/product/google-pixel-7-pro-bundle/",
			"name": "Google Pixel 7 Pro",
			"totalPrice": " $589",
			"monthlyPrice": "$50"
		},
		{
			"productPageUrl": "https://www.mintmobile.com/product/google-pixel-6a-bundle/",
			"name": "Google Pixel 6a",
			"totalPrice": " $389",
			"monthlyPrice": "$33"
		},
		{
			"productPageUrl": "https://www.mintmobile.com/product/iphone-se-bundle-3rd-generation/",
			"name": "Apple iPhone SE (3rd generation)",
			"totalPrice": " $569",
			"monthlyPrice": "$48"
		},
		{
			"productPageUrl": "https://www.mintmobile.com/product/apple-iphone-14-bundle/",
			"name": "*NEW* Apple iPhone 14",
			"totalPrice": " $919",
			"monthlyPrice": "$77"
		},
		{
			"productPageUrl": "https://www.mintmobile.com/product/apple-iphone-14-pro-bundle/",
			"name": "*NEW* Apple iPhone 14 Pro",
			"totalPrice": " $1089",
			"monthlyPrice": "$91"
		}
	]
}

Learning resources 📚

If you want to dive deeper into some of the libraries and frameworks we presented during this post, here is a curated list of great videos and articles about the topic:

General web scraping

HTML and XML parsers

Browser automation tools

Crawlee

Open-source Crawlee project using TypeScript

Complete Zappos.com e-commerce scraper built using Crawlee's Cheerio Scraper.

GitHub - PerVillalva/zappos-scraper
Contribute to PerVillalva/zappos-scraper development by creating an account on GitHub.

Open-source Crawlee project using JavaScript

Example of API scraping using Crawlee's HttpCrawler to extract data from Sears.com.

GitHub - PerVillalva/sears-scraper
Contribute to PerVillalva/sears-scraper development by creating an account on GitHub.

Discord

Finally, don't forget to join the Apify & Crawlee community on Discord to connect with other web scraping and automation enthusiasts. 🚀

Join the Crawlee & Apify Discord Server!
Join the best web scraping & automation community. | 2,170 members


Great! Next, complete checkout for full access to Apify
Welcome back! You've successfully signed in
You've successfully subscribed to Apify
Success! Your account is fully activated, you now have access to all content
Success! Your billing info has been updated
Your billing was not updated