Scraping single-page applications with Playwright

Extracting data from single-page applications (SPAs) has always been a challenging task for web scrapers. Unlike traditional websites, SPAs generate content dynamically and asynchronously, making it harder to retrieve data using traditional scraping methods. Playwright is a powerful browser automation tool that can be used to scrape SPAs efficiently. In this article, we’ll show you how to do it.

What is Playwright?

Playwright is an open-source Node.js library that automates Chromium, Firefox, and WebKit with a single API. Although it was released only three years ago (2020), it has become a very popular tool for developers, as it enables you to write scripts that interact with web pages in a human-like way and provides automated control of a browser with a few lines of code. A more powerful and full-featured successor to Puppeteer, Playwright provides powerful tools for web scraping and automation, including network interception, emulation of user interactions, and screenshot capture.

How to scrape SPAs with Playwright

To scrape data from a single-page application with Playwright, we need to understand how a SPA works.

Unlike traditional web applications, SPAs dynamically generate content using JavaScript. When a user requests a page, the server returns a static HTML file that contains the SPA's basic layout and some JavaScript code. The JavaScript code then fetches data from the server and generates the page's content dynamically. This means that the data we want to scrape is not available in the initial HTML file but generated later by the SPA.

To scrape data from a SPA, we need to emulate a user's interactions with the page. We have to load the page, wait for the SPA to generate the data, and then extract the data we want.

Switch to headless Chrome

Traditional scraping tools are designed to work with static web pages. Single-page applications, on the other hand, dynamically load content and update the page without requiring a full page reload. This can make it difficult to extract data using traditional scraping methods.

Switching to Playwright with headless Chrome provides a more powerful and flexible way to scrape SPAs. You can automate browser interactions, wait for dynamic content updates, intercept API requests, and extract data using the DOM API. This allows us to extract data from SPAs that would otherwise be difficult or impossible to scrape using traditional tools.

➡️ This Playwright crawler example demonstrates how to use PlaywrightCrawler in combination with RequestQueue to recursively scrape the Hacker News website using headless Chrome / Playwright.

Here's an example of how to use Playwright with headless Chrome to scrape data from a single-page application:

const { chromium } = require("playwright");

(async () => {
    const browser = await chromium.launch();
    const page = await browser.newPage();

    // Navigate to the single-page application
    await page.goto("https://example.com/spa");

    // Wait for the content to load
    await page.waitForSelector("div > h1");

    // Extract the data using the DOM API
    const data = await page.evaluate(() => {
        const element = document.querySelector("div > h1");
        return element.innerText;
    });

    console.log(data);

    await browser.close();
})();

In this example, we use Playwright's chromium module to launch headless Chrome. We then create a new page and navigate to the single-page application. We use the waitForSelector() method to wait for the content to load, and we then use the DOM API to extract the data we want. Finally, we log the data to the console and close the browser.

In real-world scenarios, we’d likely need to modify this example to handle the common problems associated with scraping SPAs. Let’s go through a few of them.

Common problems and solutions

Problem 1: Content loading asynchronously

In single-page applications, content is often loaded asynchronously using AJAX requests or WebSocket connections. This means that the data we want to scrape may not be available when the page loads. To solve this problem, we need to wait for the content to load before extracting the data we want.

Solution: Waiting for the content to load

Playwright provides a waitForSelector() method that waits for a specific selector to appear on the page. We can use this method to wait for the element containing the data we want to appear on the page.

await page.waitForSelector('#content');

We can also use the waitForFunction() method to wait for a JavaScript function to return a truthy value. This can be useful if we need to wait for a specific condition to be met before scraping the data.

await page.waitForFunction(() => window.myApp.isDataLoaded);

Problem 2: Dynamic content updates

In single-page applications, content may be updated dynamically without refreshing the page. This can be challenging when scraping the data, as the content we want may not be available when the page loads.

Solution: Observing mutations

Playwright provides a MutationObserver API that can be used to observe changes to the DOM. We can use this API to detect when the content we want has been added to the page.

const observeMutations = () => {
    return new Promise((resolve) => {
        const observer = new MutationObserver((mutations) => {
            observer.disconnect(); // stop observing after the first mutations
            resolve(mutations.map(mutation => mutation.type)) // return the mutation types
        });
        // start observing
        observer.observe(document, { childList: true, subtree: true });
    });
};

await page.evaluate(observeMutations)

You can learn more about the kind of information you can extract from the mutations array in the MDN Web Docs.

However, a much more straightforward solution is to use the waitForFunction() method to wait for a specific condition to be met before scraping the data.

await page.waitForFunction(() => window.myApp.isDataLoaded);

Problem 3: AJAX requests and APIs

Single-page applications often use AJAX requests and APIs to fetch data from the server. Scraping this data can be challenging as the data may not be available when the page loads.

Solution: Intercepting network requests

Playwright provides a route() method that can be used to intercept network requests. We can use this method to intercept the AJAX requests and APIs used by the SPA and return the data we want.

await page.route('**/api/data', (route) => {
  route.fulfill({
    status: 200,
    contentType: 'application/json',
    body: JSON.stringify({ data: 'my data' }),
  });
});

We can also use the waitForResponse() method to wait for a specific network response before scraping the data.

const response = await page.waitForResponse((response) =>
  response.url().endsWith('/api/data')
);
const data = await response.json();

Let’s put it all together

Let’s update our first code example so that it waits for dynamic content updates and intercepts AJAX requests and APIs.

❗

The code below is only intended for educational purposes and is not a fully functional piece of code. In order to turn it into a functional script, you should apply the learnings from this tutorial and replace the relevant fields with code applicable to your particular use case.

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  // Navigate to the single-page application
  await page.goto('https://example.com/spa');

  // Wait for the content to load
  await page.waitForSelector('#content');

  // Wait for dynamic content updates
  await page.waitForFunction(() => {
    const element = document.querySelector('#content');
    if (element) {
      return element.innerText.includes('Data Loaded');
    }
    return false;
  });

  // Intercept AJAX requests and APIs
  await page.route('**/api/data', (route) => {
    route.fulfill({
      status: 200,
      contentType: 'application/json',
      body: JSON.stringify({ data: 'my data' }),
    });
  });

  // Extract the data using the DOM API
  const data = await page.evaluate(() => {
    const element = document.querySelector('#content');
    return element.innerText;
  });

  console.log(data);

  await browser.close();
})();

In this final example, we first navigate to the SPA and wait for the content to load using the waitForSelector() method.

Next, we wait for dynamic content updates using the waitForFunction() method. In this example, we check for an element with the ID #content to contain the text Data Loaded. This ensures that the dynamic content we’re interested in has been loaded and rendered by the SPA.

We then intercept AJAX requests and APIs using the route() method. In this example, we intercept requests to the /api/data endpoint and return a dummy response containing the data we want to scrape.

Finally, we extract the data we want using the DOM API and log it to the console. Once the scraping is complete, we close the browser using the close() method.

Now you’re ready to begin

This example should help you get started with using Playwright to scrape single-page applications. However, it's important to note that each SPA is unique and may require different strategies to extract the desired data. You may need to experiment and adapt the code as needed to effectively scrape a particular SPA.

You can learn more about web scraping with Playwright below.