Playwright is a rising star in the web scraping and automation world. Thought Puppeteer was powerful? Playwright will blow your mind.
Playwright is a browser automation library very similar to Puppeteer. Both allow you to control a web browser with only a few lines of code. The possibilities are endless, from automating mundane tasks and testing web applications to data mining.
With Playwright, you can run Firefox and Safari (WebKit), not only Chromium-based browsers. It will also save you time because Playwright automates away repetitive code, such as waiting for buttons to appear in the page.
In this tutorial, you’ll learn how to:
- Start a browser with Playwright
- Click buttons and wait for actions
- Extract data from a website
To showcase the basics of Playwright, we will create a simple scraper that extracts data about GitHub Topics. You’ll be able to select a topic and the scraper will return information about repositories tagged with this topic.
We will use Playwright to start a browser, open the GitHub topic page, click the Load more button to display more repositories, and then extract the following information:
- Number of stars
- List of repository topics
To use Playwright you’ll need Node.js version 14 or higher and a package manager. We’ll use
npm, which comes preinstalled with Node.js. You can confirm their existence on your machine by running:
node -v && npm -v
If you’re missing either Node.js or NPM, visit the installation tutorial to get started.
Now that we know our environment checks out, let’s create a new project and install Playwright.
mkdir playwright-scraper && cd playwright-scraper npm init -y npm i playwright
The first time you install Playwright, it will download browser binaries, so the installation may take a bit longer.
Building a scraper
In your project folder, create a file called
scraper.js (or choose any other name) and open it in your favorite code editor. First, we will confirm that Playwright is correctly installed and working by running a simple script.
Now run it using your code editor or by executing the following command in your project folder.
If you saw a Chromium window open and the GitHub Topics page successfully loaded, congratulations, you just robotized your web browser with Playwright!
Loading more repositories
When you first open the topic page, the number of displayed repositories is limited to 30. You can load more by clicking the Load more… button at the bottom of the page.
There are two things we need to tell Playwright to load more repositories:
- Click the Load more… button.
- Wait for the repositories to load.
Clicking buttons is extremely easy with Playwright. By prefixing
text= to a string you’re looking for, Playwright will find the element that includes this string and click it. It will also wait for the element to appear if it’s not rendered on the page yet.
This is a huge improvement over Puppeteer and it makes Playwright lovely to work with.
After clicking, we need to wait for the repositories to load. If we didn’t, the scraper could finish before the new repositories show up on the page and we would miss that data.
page.waitForFunction() allows you to execute a function inside the browser and wait until the function returns
To find that
article.border selector, we used browser Dev Tools, which you can open in most browsers by right-clicking anywhere on the page and selecting Inspect. It means: Select the
<article> tag with the
Let’s plug this into our code and do a test run.
If you watch the run, you’ll see that the browser first scrolls down and clicks the Load more… button, which changes the text into Loading more. After a second or two, you’ll see the next batch of 20 repositories appear. Great job!
Now that we know how to load more repositories, we will extract the data we want. To do this, we’ll use the
It works like this:
page.$$evalfinds our repositories and executes the provided function in the browser. We get
repoCards which is an
Array of all the repo elements. The return value of the function becomes the return value of the
page.$$eval call. Thanks to Playwright, you can pull data out of the browser and save them to a variable in Node.js. Magic!
If you’re struggling to understand the extraction code itself, be sure to check out this guide on working with CSS selectors and this tutorial on using those selectors to find HTML elements.
And here’s the code with extraction included. When you run it, you’ll see 40 repositories with their information printed to the console.
In this tutorial we learned how to start a browser with Playwright, and control its actions with some of Playwright’s most useful functions:
page.click() to emulate mouse clicks,
page.waitForFunction() to wait for things to happen and
page.$$eval() to extract data from a browser page.
But we’ve only scratched the surface of what’s possible with Playwright. You can log into websites, fill forms, intercept network communication, and most importantly, use almost any browser in existence. Where will you take this project next? How about turning it into a command-line interface (CLI) tool that takes a topic and number of repositories on input and outputs a file with the repositories? You can do it now. Happy scraping!