Playwright to Cheerio: a 10x faster scraper

👉

This article was written by Guillaume Lancrenon as part of Write for Apify - a program for developers sharing original articles about what they've built with Apify and Crawlee.

Lately, I had a request to extract data for a client. The website was UGAP (Union des Groupements d'Achats Publics), France’s public procurement marketplace. So I built an Actor - like I usually do - that does the following:

Searches for a list of keywords (for example, table)
Collects product links from search results
Visits product pages and extracts title, prices (HT/TTC - pre-tax and tax-included), description, and reference code
Pushes clean structured data into an Apify dataset

As usual, Playwright was my first choice, as it handles almost any site. But that flexibility comes at a cost: it's slow, consumes a lot of memory, and the Docker image is large.

This January, instead of getting fit myself, I reshaped this Actor to use CheerioCrawler instead of Playwright. That single decision changed everything: 10x faster, 10x lower memory usage, better stability, lighter builds, and much higher concurrency on the Apify platform.

This article is my migration story, based on the actual code.

Why UGAP was a good target

UGAP is a central purchasing body for French public procurement. It has a huge catalog (hundreds of thousands of items) and the data is available in static HTML, which makes it a good target for scraping.

General rule here: public-sector websites usually serve their data in static HTML since their goal is to let people access the data easily (and for anyone) and not to make it difficult to scrape.

No need to log in, no need to interact with private areas.

Apify Actor page for UGAP Website Scraper showing the overview — UGAP Scraper on Apify Store

Prerequisites

This isn't a beginner's guide to web scraping. I’m assuming you’re comfortable with Node.js and you’ve used Apify or Crawlee before. What you need:

Node.js 18+
Basic familiarity with Crawlee routers and the Apify SDK

Useful docs I reference in this article:

I'm not going to mention proxy settings, dataset handling or any logging in this article. The only focus is on the performance.

Where I started: PlaywrightCrawler

When I created the project, I used the Playwright-based Actor template. To be honest, I hadn't yet looked closely at how UGAP loaded product details - I just defaulted to what I knew. My reasoning at the time was simple: I wanted robustness. If the site rendered prices via JavaScript, a browser would be the easiest path. I also wanted a single scraping approach for both search and product pages.

So my initial stack looked like this:

PlaywrightCrawler in src/main.ts
createPlaywrightRouter() in src/routes.ts
Extractors built around Playwright’s Page (so everything was async and browser-driven)

It worked. I could run a small input and get correct results. But performance could be better.

The performance problem with Playwright

With a browser, each “request” is a navigation inside a real Chromium instance. Even if you technically increase concurrency, you’re fighting CPU usage, RAM usage, and execution time.

And per-request latency is dominated by overhead. For every page, Playwright needs to create or reuse a browser context, navigate to the page, wait for the load state (I was using networkidle in the product handler), and evaluate selectors inside the browser context.

All of this added up, and the Actor was slow. Here's what the migration looked like in numbers.

Before (with Playwright):

After (with Cheerio):

The execution time dropped from 3 minutes to 14 seconds. As you can see, the memory usage is considerably lower with Cheerio.

Dry January: static is a good fit for Cheerio

As I mentioned earlier, the static nature of the website made CheerioCrawler a good alternative to test. Sure, it would require me to rewrite some parts of the Actor, but it would be worth it. I treated it as an experiment.

If you’ve never migrated from a browser crawler to Cheerio, here’s the core mental shift: with Playwright, you “drive” a browser like a user; with Cheerio, you just parse HTML as text.

The migration: what I changed in practice

I touched very few files:

package.json (and package-lock.json)
src/main.ts
src/routes.ts
src/utils/extractors.util.ts (where the logic for extracting data from pages is)

1) Dependencies: removing Playwright from the runtime

Before the migration, the Actor depended on:

@crawlee/playwright
playwright
A postinstall that installed Playwright browsers - a heavy install that slowed down every build.

After the migration, the only runtime dependency was @crawlee/cheerio.

This is the part people underestimate. Removing Playwright isn’t just about speed - it changes package size, installation time, and build time. And the main advantage is that you can now iterate faster on your Actor.

2) `main.ts`: switch `PlaywrightCrawler` → `CheerioCrawler`

With Playwright, my main entry was essentially:

Create proxy config
New PlaywrightCrawler({ requestHandler: router, ... })
Run the start URLs

The Cheerio version has the same structure, just with different options.

Before:

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl,
    requestHandler: router,
    launchContext: {
        launchOptions: {
            args: [
                '--disable-gpu', 
            ],
        },
    },
});

After:

const crawler = new CheerioCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl,
    requestHandler: router,
    minConcurrency: 10,
    maxConcurrency: 50,
    requestHandlerTimeoutSecs: 30,
    maxRequestRetries: 3,
});

3) `routes.ts`: switch `page` → `$`

The handler signature changes a little bit. So there was some code to update.

Search handler: browser evaluation → HTML link parsing

In Playwright, I used:

page.waitForSelector('a[href*="-p"]')
page.$$eval('a[href*="-p"]', ...) to get product links

Before:

router.addHandler('search', async ({ request, page, enqueueLinks, log }) => {
    await page.waitForSelector('a[href*="-p"]', { timeout: 10000 })
    .catch(() => {
        log.warning('No product links found on search results page', { url: request.loadedUrl });
    });
});

In Cheerio, no need to wait for the page to load. I switched to:

router.addHandler('search', async ({ request, $, enqueueLinks }) => {
    const productLinks: string[] = [];
    $('a[href*="-p"]').each((_, element) => {
        const href = $(element).attr('href');
        const absoluteUrl = new URL(href, request.loadedUrl).href;
        productLinks.push(absoluteUrl);
    });
});

`extractors.util.ts`: async calls → synchronous calls

In the Playwright version, my extractors looked like:

extractTitle(page): Promise<string>
extractPrices(page): Promise<{ priceHT, priceTTC }>

After migrating, the extractors became synchronous:

extractTitle($): string
extractPrices($): { priceHT, priceTTC }

Same logic as applied in the routes.ts file.

Why the speedup hit ~10x

By speedup, I mean crawl throughput: how many pages (search + product) I can process per unit time on the same budget. That throughput is a product of two things:

Per-request latency
Safe concurrency

On static sites, Playwright moves both in the wrong direction.

Per-request latency: browser tax vs. raw HTML fetch

On a static page:

An HTTP fetch + HTML parse can take tens of milliseconds to a few hundred milliseconds.
A Playwright navigation often lands in the seconds range once you include load state, network settling, and page lifecycle overhead.

Even if the server is fast, the browser workflow is doing more work than needed.

Concurrency: 50 parallel HTTP requests vs. a few pages at once

With CheerioCrawler, running maxConcurrency: 50 is normal. With PlaywrightCrawler, 50 concurrent pages is usually a stress test, not a default. You can do it, but you’ll pay in memory, CPU, and instability.

So the “10x” story is not one knob. It’s the combination:

Lower latency per request
Much higher safe concurrency
Lower memory footprint, which keeps autoscaling happy
Fewer timeouts and retries, which reduces wasted work

What surprised me

A few things stood out.

The migration was small

Switching from Playwright to Cheerio touched fewer than 200 lines of code.

See the git diff here:

Switch the crawler from PlaywrightCrawler to CheerioCrawler
Switch the router from page to $
Switch the extractors from async to synchronous

Synchronous router is much simpler

Playwright routers often include waiting logic:

Wait for the selector
Wait for the load state
Handle partial renders

Once I moved to Cheerio, those concerns disappeared. The page is either in the HTML, or it isn’t.

The build got a lot lighter

Removing postinstall browser setup is underrated. It reduced friction immediately: The final image is around 450MB, down from about 2.3GB.

What I’d do differently next time

Two things I'd change on the next Actor:

I would validate “static vs. dynamic” on day zero
I would design for easier “crawler swap”, keeping separation of concerns in mind

Conclusion

I didn’t switch from Playwright to Cheerio because Playwright is bad. I switched because it was the wrong tool for this specific job.

UGAP serves the product data I need in static HTML. That fact makes CheerioCrawler the natural fit, as it removes the browser overhead, unlocks real concurrency, and cuts memory dramatically.

If you’re building Actors and you want a free performance upgrade, here’s my recommendation:

Start with website inspection: static or dynamic?
Use the lightest crawler that can do the job.
Keep your extraction logic modular so you can switch later.

That’s how my Actor is now 10x faster, thinner, and less RAM hungry: all thanks to picking the right tool.

FAQ

Should I always replace `PlaywrightCrawler` with `CheerioCrawler`?

No. Only do it when your target pages are server-rendered and the fields you need exist in the raw HTML.

How do I know if the content is in static HTML?

Use view source or fetch the page HTML directly (curl, HTTP client). If the fields you need are missing and only appear after JS runs, you likely need Playwright.

You can also test it by disabling JavaScript in your browser and checking if the fields you need are still present. In Chrome, you can do this in the settings as shown below:

Can I mix both approaches?

Yes. A hybrid approach can be the best of both worlds:

CheerioCrawler for most pages
PlaywrightCrawler for a small subset that truly requires interaction or JS rendering

Your Docker build image will be considerably bigger, but it can be worth it if you need to scrape a lot of pages that are static and some that are dynamic.

What’s the biggest win from switching?

Lower RAM
Higher concurrency
Fewer timeouts
Lighter builds

Is the UGAP Actor available publicly?

Yes, you can find it on Apify Store: UGAP Website Scraper

What do the results look like?

Here's a preview of the dataset with extracted title, prices, reference code, and image:

Dataset preview showing extracted title, prices, reference, and image