How to scale Puppeteer and Playwright for web scraping

Lukáš Křivka – Head of Actor Development and Delivery at Apify – explains how he overcame the challenges of large-scale scraping with Puppeteer and scraped millions of product pages from one of the top 5 online marketplaces in the world.

So what are the challenges of large-scale scraping?

For large-scale web scraping, you need to use as few resources as possible. Browsers are computationally heavy and slow for scraping. But using browsers means getting blocked less, and avoiding blocking is one of the biggest challenges for web data extraction. I had to find a way to maximize performance by skipping extra resources, such as images, styles, fonts, and even a lot of JavaScript. Some files are required for rendering, but most just did things I didn’t need, so I had to skip them to speed things up and configure Puppeteer to do as little work as possible.

Figures building blocks - to represent scaling web scraping projects — Scaling Puppeteer or Playwright - how it all came together on a large-scale scraping project

Why did you use Puppeteer and not Playwright?

Playwright had just been introduced when I started the project in 2020. Puppeteer was older and had been tested more, and there were more tutorials for it, so that helped me get going quicker.

If you did the project now, would you use Playwright instead?

Playwright is just an alternative to Puppeteer. There’s not much difference between them. Both are libraries for JavaScript to run browsers with a similar API. Most of the computation doesn’t happen in the libraries but in the browsers.

One advantage of Playwright is that it allows you to run other browsers as well as Chrome, so you can run Firefox and WebKit. Firefox is heavier, so when scaling, you might want to skip it, but it’s better for overcoming blocking. Firefox tries to make browsing as anonymous as possible, so it doesn’t leave much of a trace. That makes it harder for anti-scraping measures to distinguish bots from users based on the browser fingerprint.

If your main issue is computing resources, go with WebKit or Chromium, but, if blocking is a problem, Firefox can help by making it unnecessary to use residential proxies. Residential proxies can be much more expensive than datacenter proxies, so you can save a lot if you don’t need to use them.

If you’re trying to choose between Puppeteer and Playwright and starting from scratch, definitely go with Playwright.

So how did you solve the problem and scale Puppeteer?

The online e-commerce marketplace I was scraping has more than two million individual sellers. These sellers can appear and disappear without warning, so I needed to scrape them as soon as the URL was identified, rather than finding all the URLs and processing them later, which is how I would normally have scraped a large site like this. So it made sense to run a separate instance of the scraper for each seller. To achieve that, I had to scale the scraper horizontally across multiple containers for optimal concurrency.

There were two key things I needed to do: 1) scale inside a single container to maximize efficiency, and 2) duplicate the containers to create hundreds of them to run in parallel. I used the Apify platform, so my containers were actors (serverless microapps on the Apify platform are called actors), but the principle is the same for any container or serverless platform, such as AWS Lambda.

1. Scaling inside a single container

To scale inside a single container, I had to check the available resources of Node.js and the hardware, and scale up (or down) accordingly. I did this using the web scraping and browser automation library, Crawlee.

Crawlee was part of Apify SDK at the time, but Apify released it as an open source web scraping and browser automation library in August 2022.

Crawlee checks the status of CPU, memory, and event loop and it manages concurrency out of the box. If it’s doing too much work, it will scale down, and vice versa.

Another feature that helped me was to block requests automatically without disabling the browser cache. This was really crucial.

// Block all requests to URLs that include `adsbygoogle.js` 
// and also all defaults like *.jpg, *.png and so on.
await puppeteerUtils.blockRequests(page, {
    extraUrlPatterns: ['adsbygoogle.js'],
});

There are two ways to block requests in Puppeteer. One is `request interception`. For each request, you inspect the request object before it’s sent, and you can analyze and decide whether to continue or abort. That’s great, but a huge drawback with modern browsers is that, for reasons due to browser implementation, which you can’t do anything about, it stops the browser cache. The cache saves you a lot of bandwidth. It also keeps parsed JavaScript in memory, which in turn saves a lot of CPU power. But if you use `request interception`, the cache function gets disabled, and it will make your crawler slower.

The block requests function on Crawlee uses a hidden functionality of the browser that doesn’t disable the cache.

const crawler = new PuppeteerCrawler({
    autoscaledPoolOptions: {
        // You can set various options to fine-tune
        // the scaling behavior, but it works great
        // with default settings on most machines.
    },
    async requestHandler({ page }) {
        console.log(await page.title());
    }
})

🦾

Crawlee. It helps you build reliable crawlers. Fast.

2. Scaling across multiple containers

Apify SDK made it easy to spawn many different containers (actors) and it automatically provided the storage, proxies, and APIs needed. I just had to tell the scraper to extract one set of URLs while another scraper would extract another set of URLs, and they would all run simultaneously, dynamically creating new instances as required.

To put this into perspective, what was the scale of the project in numbers?

To put it as simply and concisely as possible: I scraped 90 million web pages with Puppeteer in just two months. It’s important to say that scraping with headless browsers is far more resource intensive than plain HTTP scraping. I had to run hundreds of containers in that time to make it happen.

What final tip would you give other developers who need to scrape at scale?

Use as few resources as possible. JSON APIs are the best way to do this, so if a site has this type of API, use it, unless blocking is a problem. Some websites, such as Amazon, don’t have JSON and provide everything in HTML. The JSON might be there in the web page, but you need to download the HTML and then use data extraction techniques. If you can get at the JSON, you can scale efficiently.