Scrapy Playwright tutorial

The Scrapy Playwright library adds JavaScript rendering capabilities to Scrapy. Learn how to use it here.

Content

What is Scrapy?

Scrapy is a fast and powerful Python web scraping framework that can be used to efficiently crawl websites and extract their data. However, Scrapy might be difficult to use for websites that heavily rely on JavaScript for rendering their content. Hence, Playwright.

The purpose of this article is to show you how to combine them.

🔖
New to web scraping with Python? Check out this Python web scraping tutorial.

What is Playwright?

Playwright is an open-source automation library that is a great tool for end-to-end testing and can also perform other tasks, such as web scraping.

By combining Scrapy's web crawling capabilities with Playwright's browser interaction, you can conveniently carry out complex web scraping tasks.

This tutorial will walk you through setting up Scrapy-Playwright and help you understand its basic commands and advanced features.

💡
Crawlee's PlaywrightCrawler

Provides a simple framework for parallel crawling of web pages using headless Chromium, Firefox and Webkit browsers with Playwright. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs, enabling recursive crawling of websites.

Setting up Scrapy with Playwright

Before creating a scraping task, you need to set up your working environment.

Step 1

Ensure you have Python installed on your computer. Run python --version or python3 --version to confirm.

Python3 for Scrapy-Playwright environment

Step 2

Choose your desired directory (desktop, document, or downloads), create a folder scrapy-python for the project, and change your directory into it.

The command below creates the folder and changes your working directory into the folder.

mkdir scrapy-python and cd scrapy-python

Step 3

Create a separate Python environment for your project using Virtualenvpython3 -m venv myenv and source myenv/bin/activate to activate the virtual environment.

Step 4

Install Scrapy (pip install scrapy), Playwright (pip install playwright), and the browser binaries using playwright install.

Step 5

Initialize a Scrapy Project: scrapy startproject apify_sp. Change your directory to the newly created folder using cd apify_spider.

Step 6

Install scrapy-playwright. This library bridges the gap between Scrapy and Playwright. Install it using pip install scrapy-playwright.

📌
If python 3 doesn't work for you while running the commands in the terminal/cmd, replace it with python. If you encounter other installation errors, look up the error on the GitHub issues pages of Scrapy, Playwright, and Scrapy-Playwright for similar issues and possible solutions.
Screenshot 2024-05-07 at 13.53.50.png
Scrapy-Playwright myenv

Step 7

Open your project in your desired code editor, navigate to settings.py and add the configuration below if it's not there.

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
DOWNLOADER_MIDDLEWARES = {
    'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler': 543,
}

Twisted_Reactor=... configures Scrapy to use a specific event loop provided by Twisted (an event-driven networking engine) that is compatible with Python’s asynchronous I/O framework, asyncio. While Downloader_Middlewares=... configures a specific downloader middleware within Scrapy's crawling process.

These settings are essential for making sure that both asynchronous operations from Playwright and Scrapy's architecture work together effectively.

Troubleshooting installation issues

If you encounter any installation issues, it might be due to any of the points highlighted below:

  • Virtual environment: you should keep every project separate. So, create a folder and install a virtualenv in it. Don't forget to activate the virtual environment using source myenv/bin/activate
  • Verify that your current Python version is compatible with the versions of Scrapy, Playwright, and Scrapy-Playwright that you're installing.
  • Playwright requires additional dependencies and binaries, which doesn't come along with just running pip install playwright. You install these binaries by running playwright install.
🔖
Scrapy vs. Crawlee

Learn when to use Scrapy and when it would be better to use Crawlee instead.

Basic Scrapy Playwright usage

After setting up your project, you need to test if it's perfectly set up. Below is a basic script that tests your project setup.

To create this script, open the spiders folder, create a file called event.py, and paste the code below into it.

Scrapy-Playwright Spiders

To run the script, open your terminal or command prompt. Navigate to the directory where scrapy.cfg is.

If you're following this article's project structure, you should be in the scrapy-playwright/apify_sp directory.

Run scrapy crawl clickable output.json. This will run the script and save the result in a file output.json

from playwright.async_api import Dialog, Response as PlaywrightResponse
from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod

class EventsSpider(Spider):
    """Handle page events and extract the first clickable link."""

    name = "clickable"
   
    def start_requests(self):
        yield Request(
            url="<https://apify.com/store>",
            meta={
                "playwright": True,
                # Include the page object in the response
                "playwright_include_page": True,
                "playwright_page_methods": [
                     # Wait for at least one <a> tag to be loaded
                    PageMethod("wait_for_selector", "a"),
                ],
            },
            callback=self.parse
        )

    async def handle_dialog(self, dialog: Dialog) -> None:
        self.logger.info(f"Handled dialog with message: {dialog.message}")
        await dialog.dismiss()

    async def handle_response(self, response: PlaywrightResponse) -> None:
        self.logger.info(f"Received response with URL {response.url}")

    async def parse(self, response, **kwargs):
    
        # Use the Playwright page object included in the response to find the first clickable link
        page = response.meta['playwright_page']
        
         # Find all <a> tags on the page
        links = await page.query_selector_all('a')
        link_urls = []
        
        # Take up to the first five links
        for link in links[:5]:
            url = await link.get_attribute('href')
            link_urls.append(url)
            
        await page.close()
        
        return {"first_five_link_urls": link_urls}

This script will visit https://apify.com/store, wait for the page to load all the anchor tags that have clickable links, extract the first five link URLs, and handle any dialogs that may appear during this process.

📌
The use of asynchronous methods (async) ensures that the spider can perform multiple operations concurrently.

Examples of using Playwright's API

Playwright's API allows you to interact with web elements in a very precise and controlled manner. Below are some basic examples demonstrating how to use Playwright's API to interact with different types of elements on a web page:

1. Navigating to a URL

from playwright.async_api import async_playwright

async def navigate_to_url():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        
        await page.goto('https://apify.com')
        
        await browser.close()

2. Typing text into an input field

async def type_into_field():
    async with async_playwright() as ap:
        browser = await ap.chromium.launch()
        page = await browser.new_page()
        
        await page.goto('https://apify.com/store')
        
        # Fill the search field which has the `data-test` attribute
        await page.fill('[data-test="actor-store-search"]', 'playwright')
        await browser.close()

3. Extracting text from an element

async def get_text():
    async with async_playwright() as ap:
        browser = await ap.chromium.launch()
        page = await browser.new_page()
        
        await page.goto('https://apify.com/store')
        
         # locate the p element with a class of 'storeDescription'
        text = await page.text_content('p.storeDescription')
        print(text)
        await browser.close()

4. Taking a screenshot

async def take_screenshot():
    async with async_playwright() as ap:
        browser = await ap.chromium.launch()
        
        page = await browser.new_page()
        await page.goto('https://blog.apify.com')
        
        # Saves a screenshot of the page as apify.png
        await page.screenshot(path='apify.png')
        await browser.close()

Advanced techniques in Scrapy-Playwright

Now let's go through different advanced techniques in scrapy-playwright that can be used to handle complex web scraping tasks efficiently.

Handling JavaScript-rendered sites with Playwright

Websites that rely heavily on JavaScript can be difficult to navigate because of their dynamic content-loading methods. Scrapy-playwright executes JavaScript, allowing elements to be fully loaded, similar to what a user's browser would display. This approach ensures that the extracted data matches what users actually see on their screens.

To effectively handle JavaScript-rendered sites, ensure that your Scrapy settings enable Playwright and configure the spider to use Playwright's browser capabilities:

class JavaScriptHeavySpider(scrapy.Spider):
    name = 'js_spider'
    
    def start_requests(self):
        yield scrapy.Request(
            url='https://github.com/topics',
            meta={'playwright': True}
        )

Waiting for dynamic content with waitForSelector

Dynamic content that loads at different times (e.g., through JavaScript) can be managed using the waitForSelectormethod. This Playwright method waits for a specific element to appear in the DOM, ensuring that the content is fully loaded before proceeding with data extraction.

Here’s how to implement waitForSelector in a Scrapy Playwright spider:

from scrapy_playwright.page import PageMethod

def start_requests(self):
    yield scrapy.Request(
        url='https://github.com/topics',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                PageMethod('wait_for_selector', selector='button.ajax-pagination-btn')
            ]
        }
    )

This code tells the spider to wait until the <button> with the class ajax-pagination-btn appears, indicating that the button is ready to be clicked.

Managing Playwright sessions and concurrency in Scrapy

Managing sessions and concurrency is important when handling multi-page websites or when scraping at scale. Playwright sessions should be handled carefully to avoid resource leakage and ensure that each browser instance is properly closed after use.

To manage concurrency, you can control the number of Playwright instances that Scrapy runs simultaneously by setting CONCURRENT_REQUESTS in your Scrapy settings settings.py.

# Adjust this number based on your system's capability
CONCURRENT_REQUESTS = 8 

Each request with playwright set to True in its meta opens a new browser instance, which can be resource-intensive.

It is also essential to properly close each Playwright page and browser session to free up resources. This can be done in the spider’s parse method or in dedicated middleware.

async def parse(self, response, **kwargs):
    page = response.meta['playwright_page']
    # Perform scraping tasks here
    await page.close()

Implementing these advanced techniques will improve the capability to scrape complex and dynamic websites.

Utilizing Playwright features

Using Playwright in Scrapy brings more than just dynamic content handling to your project. It extends to other capabilities of Playwright, such as taking screenshots, running browsers in headless mode for efficiency, and automating interactions like form submissions.

Below are a few examples of how you can utilize Playwright's capabilities in scrapy-playwright:

Taking screenshots of web pages

Screenshots are useful for debugging, archiving, or even content verification purposes.

import scrapy
from scrapy_playwright.page import PageMethod

class ScreenshotSpider(scrapy.Spider):
    name = 'screenshot_blog'

    def start_requests(self):
        yield scrapy.Request(
            url="https://blog.apify.com",
            meta={
                'playwright': True,
                'playwright_page_methods': [
                    PageMethod('screenshot', path='apify_blog.png', full_page=True)
                ]
            }
        )

In this example, after navigating to "https://blog.apify.com", Playwright takes a full-page screenshot and saves it as apify_blog.png. This is especially helpful for ensuring that the page renders content as expected.

Running Playwright in headless mode

Headless mode means running the browser without a graphical user interface. This mode is significantly faster than the headful mode and is particularly advantageous for running scrapers on servers or in environments without a display.

def start_requests(self):
    yield scrapy.Request(
        url="https://apify.com",
        meta={
            'playwright': True,
            'playwright_browser_type': 'chromium',
            'playwright_browser_context_args': {
                'headless': new
            }
        }
    

This configuration is great for automated tasks where visual rendering is not needed, thereby enhancing performance and resource efficiency.

Automating form submissions and capturing AJAX data

Automating form submissions and handling AJAX-driven data is critical for interacting with modern web applications. Playwright can simulate user interactions such as filling out forms and clicking buttons, enabling the capture of dynamically loaded data.

Here's an example:

async def parse_form(self, response, **kwargs):
    page = response.meta['playwright_page']

    await page.fill('input[name="username"]', 'myusername')
    await page.fill('input[name="password"]', 'mypassword')
    await page.click('button#login')

    # Wait for the user data element to appear, where 'user-data' is the id of the div element
    await page.wait_for_selector('div#user-data')

    # Get the text content of the user data element
    user_data = await page.text_content('div#user-data')

    # Close the page
    await page.close()
    
    yield {'user_data': user_data}

In this code, the spider navigates to a page, fills out the login form, submits it, and waits for the AJAX call to complete before extracting the user data.

Recap and further reading

In this article, you've learned how to create a Python environment, install Scrapy and Playwright, and configure the necessary settings in Scrapy to work with Playwright. Also, you’ve learned different approaches, with examples of how the two frameworks can operate together efficiently.

If you want to deepen your knowledge of Scrapy or Playwright for web scraping, explore the other tutorials below.

Ayodele Aransiola
Ayodele Aransiola
Ayodele is a Developer Relations engineer with experience in few other tech skills such as frontend, technical writing, early stage startup advisory, product management and consulting.

Get started now

Step up your web scraping and automation