Scrapy Playwright tutorial (Step-by-step guide for 2024)

The Scrapy Playwright library adds JavaScript rendering capabilities to Scrapy. Learn how to use it here.

Content

Scrapy is a fast and powerful Python web scraping framework that can be used to efficiently crawl websites and extract their data. But for websites that heavily rely on JavaScript for rendering their content, you need to combine it with a tool that can handle it. Hence, Playwright.

Playwright is an open-source automation library that is a great tool for end-to-end testing and can also perform other tasks, such as web scraping.

By combining Scrapy's web crawling capabilities with Playwright's browser interaction, you can conveniently carry out complex web scraping tasks.

This tutorial will walk you through setting up Scrapy-Playwright and help you understand its basic commands and advanced features.

đź’ˇ
Crawlee's PlaywrightCrawler

Provides a simple framework for parallel crawling of web pages using headless Chromium, Firefox and Webkit browsers with Playwright. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs, enabling recursive crawling of websites.

Setting up Scrapy with Playwright

Before creating a scraping task, you need to set up your working environment.

Ensure you have Python installed on your computer. Run python --version or python3 --version to confirm.

Python3 for Scrapy-Playwright environment

Choose your desired directory (desktop, document, or downloads), create a folder scrapy-python for the project, and change your directory into it.

The command below creates the folder and changes your working directory into the folder.

# Create the project directory
mkdir scrapy-python

# Change to newly created directory
cd scrapy-python

Create a separate Python environment for your project using Virtualenvpython3 -m venv myenv and source myenv/bin/activate to activate the virtual environment.

Install Scrapy (pip install scrapy), Playwright (pip install playwright), and the browser binaries using playwright install.

Initialize a Scrapy Project: scrapy startproject apify_sp. Change your directory to the newly created folder using cd apify_spider.

Install scrapy-playwright. This library bridges the gap between Scrapy and Playwright. Install it using pip install scrapy-playwright.

đź“Ś
If python 3 doesn't work for you while running the commands in the terminal/cmd, replace it with python. If you encounter other installation errors, look up the error on the GitHub issues pages of Scrapy, Playwright, and Scrapy-Playwright for similar issues and possible solutions.
Screenshot 2024-05-07 at 13.53.50.png
Scrapy-Playwright myenv

Open the project in your desired code editor, navigate to settings.py and add the configuration below.

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

DOWNLOAD_HANDLERS = {
    'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
    'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}

TWISTED_REACTOR configures Scrapy to use a specific event loop provided by Twisted (an event-driven networking engine) that is compatible with Python’s asynchronous I/O framework, asyncio. Meanwhile, the DOWNLOAD_HANDLERS settings tell Scrapy to use Playwright’s download handler for both HTTP and HTTPS requests.

These settings are essential for making sure that both asynchronous operations from Playwright and Scrapy's architecture work together effectively.

đź”–
Scrapy vs. Crawlee

Learn when to use Scrapy and when it would be better to use Crawlee instead.

How to use Scrapy Playwright

After setting up your project, you need to test if it's perfectly set up. Below is a basic script that tests your project setup.

To create this script, open the spiders folder, create a file called event.py, and paste the code below into it.

Scrapy-Playwright Spiders

To run the script, open your terminal or command prompt. Navigate to the directory where scrapy.cfg is.

If you're following this article's project structure, you should be in the scrapy-playwright/apify_sp directory.

Run scrapy crawl clickable -o output.json. This will run the script and save the result in a file output.json .

from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod

class EventsSpider(Spider):
    """Handle page events and extract the first clickable link."""
    
    name = 'clickable'

    def start_requests(self):
        yield Request(
            url='https://apify.com/store',
            meta={
                "playwright": True,
                # Include the page object in the response
                "playwright_include_page": True,
                "playwright_page_methods": [
                     # Wait for the search input to be loaded
                    PageMethod("wait_for_selector", 'input[data-test="actor-store-search"]'),
                    PageMethod("fill", 'input[data-test="actor-store-search"]', 'tiktok'),
                    PageMethod("click", 'input[data-test="actor-store-search"] + svg'),
                    # Wait 1 second for the next page to load
                    PageMethod("wait_for_timeout", 1000),
                ],
            },
            callback=self.parse
        )

    async def parse(self, response, **kwargs):
        # Use the Playwright page object included in the response
        page = response.meta['playwright_page']

        # Find all Actor Cards on the page
        actor_cards = await page.query_selector_all('a[data-test="actor-card"]')
        actor_data = []

        # Get data from to the first five Actors
        for actor_card in actor_cards[:5]:
            url = await actor_card.get_attribute('href')
            actor_name_element = await actor_card.query_selector('div.ActorStoreItem-title h3')
            actor_name = await actor_name_element.inner_text()
            actor_data.append(
                {
                    "url": url,
                    "actor_name": actor_name
                }
            )

        yield {'actors_data': actor_data}
        
        # Close the page
        await page.close()

This script starts by visiting https://apify.com/store, where it waits for the search input to show up, types in "tiktok," and clicks the search button. After a quick pause to let the page load any new content, it grabs the URLs and names of the first five resulting Actors it finds and yields the extracted data as a Python dictionary.

Even though some of the syntax looks familiar if you’ve used the Playwright API before, scrapy-playwright tweaks things a bit to fit nicely within the Scrapy framework. So next, we’ll break down each part into smaller, more manageable pieces to help understand what is happening in each section of the code.

đź“Ś
The use of asynchronous methods (async) ensures that the spider can perform multiple operations concurrently.

1. Navigating to a URL

from scrapy import Spider, Request

class SimpleSpider(Spider):
    name = 'clickable'

    def start_requests(self):
        yield Request(
            url='https://apify.com/store',
            meta={
                "playwright": True,
            },
            callback=self.parse
        )

    async def parse(self, response):
        # Handle the response after interacting with the page
        pass

The example above is a simplified version of what we did in the main code. Here, we’re just accessing the target website using scrapy_playwright. Notice the meta parameter with the option "playwright": True in the Request. This tells Scrapy that we want to use Playwright to load the page.

2. Typing text into an input field

from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod

class EventsSpider(Spider):
    name = 'clickable'

    def start_requests(self):
        yield Request(
            url='https://apify.com/store',
            meta={
                "playwright": True,
                # Include the page object in the response
                "playwright_include_page": True,
                "playwright_page_methods": [
                    # Wait for the search input to be loaded
                    PageMethod("wait_for_selector", 'input[data-test="actor-store-search"]'),
                    # Type some text in the input and click the search button
                    PageMethod("fill", 'input[data-test="actor-store-search"]', 'tiktok'),
                    PageMethod("click", 'input[data-test="actor-store-search"] + svg'),
                    # Wait 2 seconds for the next page to load
                    PageMethod("wait_for_timeout", 2000),
                ],
            },
            callback=self.parse
        )
        
        async def parse(self, response):
		        # Handle the response after interacting with the page
		        pass

Continuing from the previous example, this code takes things a step further by actually interacting with the page. It’s set up to visit https://apify.com/store, wait for the search input to appear, type “tiktok” into the search bar, and then click the search button.

A key part of making all this work is the PageMethod. In the context of scrapy-playwright, PageMethod allows you to define specific actions that Playwright will perform on the page. In this example, we use PageMethod to wait for the search input to load (wait_for_selector), fill in the input field with “tiktok” (fill), and then click the search button (click). We also add a brief delay (wait_for_timeout) to ensure the page has enough time to load the results.

Playwright Automation GIF

3. Extracting text from an element

# rest of the code

	async def parse(self, response, **kwargs):
        # Use the Playwright page object included in the response
        page = response.meta['playwright_page']

        # Find all Actor Cards on the page
        actor_cards = await page.query_selector_all('a[data-test="actor-card"]')
        actor_data = []

        # Get data from to the first five Actors
        for actor_card in actor_cards[:5]:
            url = await actor_card.get_attribute('href')
            actor_name_element = await actor_card.query_selector('div.ActorStoreItem-title h3')
            actor_name = await actor_name_element.inner_text()
            actor_data.append(
                {
                    "url": url,
                    "actor_name": actor_name
                }
            )

        yield {'actors_data': actor_data}
        
        # Close the page
        await page.close()

When using scrapy-playwright, you’ll often want to extract text from an element within the parse method. You can easily do this by using a variation of page.query_selector("<selector>") that best fits your needs.

In our parse method, we start by grabbing the Playwright page object from response.meta['playwright_page']. This page object is super important because it lets us interact directly with the fully rendered page.

Next, we use page.query_selector_all() to find all the <a> tags that match 'a[data-test="actor-card"]'. These tags are the links to different Actor pages on the site. We loop through the first five of these links, using get_attribute('href') to pull the URL from each one and store it in a list called link_urls.

Finally, we yield an actors_data dictionary with the first five Actor URLs and names. You can save the data to a JSON file by adding -o output.json to your crawl command, like this: scrapy crawl clickable -o output.json.

4. Taking a screenshot

from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod

class EventsSpider(Spider):
    """Handle page events and extract the first clickable link."""
    
    name = 'clickable'

    def start_requests(self):
        yield Request(
            url='https://apify.com/store',
            meta={
                "playwright": True,
                # Include the page object in the response
                "playwright_include_page": True,
                "playwright_page_methods": [
                     # Wait for the search input to be loaded
                    PageMethod("wait_for_selector", 'input[data-test="actor-store-search"]'),
                    PageMethod("fill", 'input[data-test="actor-store-search"]', 'tiktok'),
                    PageMethod("click", 'input[data-test="actor-store-search"] + svg'),
                    # Wait 1 second for the next page to load
                    PageMethod("wait_for_timeout", 1000),  
	                  # Take a screenshot
                    PageMethod("screenshot", path="apify_store_screenshot.png", full_page=False),
                ],
            },
            callback=self.parse
        )
        
        async def parse(self, response, **kwargs):
					  # Data extraction logic...
					  pass

As you can see from the code above, taking a screenshot with scrapy-playwright is pretty simple. You just use PageMethod as before, but this time with the "screenshot" option, specifying the path where you want to save the screenshot. You can also decide whether to capture the entire page or just a portion of it. Below you can see the screenshot generated by the script:

Apify Store Screenshot

5. Handling JavaScript-rendered sites with Playwright

One of the main reasons to use scrapy-playwright is to scrape data from dynamic websites since Scrapy alone can’t render JavaScript.

If you’re using Playwright to load a dynamic webpage, it means you can scrape the data from it. To do this with scrapy-playwright, simply pass the meta dictionary to your Request and set "playwright": True, as shown in the code below.

from scrapy import Spider, Request

class SimpleSpider(Spider):
    name = 'dynamic'

    def start_requests(self):
        yield Request(
            url='https://phones.mintmobile.com/',
            meta={
                "playwright": True,
            },
            callback=self.parse
        )

    async def parse(self, response):
        # Handle the response after interacting with the page
        pass

6. Waiting for dynamic content

When scraping JavaScript-rendered content, it’s important to wait for the dynamic elements to load before grabbing the data.

With scrapy-playwright, this is easy to do. Just add the "playwright_page_methods" array to your Request, and use PageMethod to tell Playwright what to wait for — like using "wait_for_selector" with the element’s selector. This way, Playwright waits until the content is fully loaded before scraping, as shown in the example below.

from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod

class EventsSpider(Spider):
    name = 'dynamic'

    def start_requests(self):
        yield Request(
            url='https://phones.mintmobile.com/',
            meta={
                "playwright": True,
                # Include the page object in the response
                "playwright_include_page": True,
                "playwright_page_methods": [
                    # Wait for the search input to be loaded
                    PageMethod("wait_for_selector", 'div[data-product-category="Phones"]'),
                ],
            },
            callback=self.parse
        )
        
        async def parse(self, response):
		        # Handle the response after interacting with the page
		        pass

7. Scrolling down infinite scroll pages

To handle infinite scrolling, we need to evaluate the scroll height and create a loop that keeps scrolling until we reach the bottom of the page and can’t scroll any further.

Since scrapy-playwright’s PageMethod doesn’t support loops, we’ll need to use some native Playwright syntax and implement the scrolling logic directly in the parse function of our scraper. The example below shows how to scroll through Nike’s shoe collection until all the shoes are loaded.

from scrapy import Spider, Request

class EventsSpider(Spider):
    """Handle page events and extract the first clickable link."""
    
    name = 'nike'

    def start_requests(self):
        yield Request(
            url='https://www.nike.com/w/',
            meta={
                "playwright": True,
                # Include the page object in the response
                "playwright_include_page": True,
            },
            callback=self.parse
        )

    async def parse(self, response, **kwargs):
        # Use the Playwright page object included in the response
        page = response.meta['playwright_page']

        # Scroll the page to load more content
        previous_height = await page.evaluate("document.body.scrollHeight")
        while True:
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(1000)  # Wait for 2 seconds for new content to load
            new_height = await page.evaluate("document.body.scrollHeight")
            if new_height == previous_height:
                break
            previous_height = new_height

        # Close the page
        await page.close()
Playwright Infinite Scroll Example

8. Scraping multiple pages

When scraping data from websites with multiple pages, the goal is to automate the process of navigating through each page and extracting the relevant data.

With scrapy-playwright, you can easily interact with elements like "next" buttons to navigate from one page to the next. While this is just one technique for paginating through websites, it’s surprisingly effective in a wide range of scenarios despite its simplicity, as shown in the example below.

from scrapy import Spider, Request

class EventsSpider(Spider):
    name = 'pages'

    def start_requests(self):
        yield Request(
            url='https://apify.com/store/categories/ai',
            meta={
                "playwright": True,
                # Include the page object in the response
                "playwright_include_page": True,
            },
            callback=self.parse
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]

        # Scrape the current page
        async for item in self.scrape_page(page):
            yield item

        # Handle pagination
        while True:
            next_button = await page.query_selector('button[data-test="page-next-button"]')
            if next_button:
                is_disabled = await next_button.get_attribute('disabled')
                if is_disabled:
                    break
                await next_button.click()
                await page.wait_for_timeout(2000)  # Wait for 2 seconds for the next page to load
                async for item in self.scrape_page(page):
                    yield item
            else:
                break

        # Close the page to ensure resources are released
        await page.close()

    async def scrape_page(self, page):
        # Extract data from the current page
        actors = await page.query_selector_all('a[data-test="actor-card"]')
        for actor in actors:
            actor_name = await (await actor.query_selector('div.ActorStoreItem-title h3')).inner_text()
            actor_description = await (await actor.query_selector('p.ActorStoreItem-desc')).inner_text()
            developer_name = await (await actor.query_selector('p.ActorStoreItem-user-fullname')).inner_text()
            print(actor_name, actor_description, developer_name)
            yield {
                'name': actor_name,
                'description': actor_description,
                'developer': developer_name,
            }

The script starts by loading the initial page with Playwright enabled, ensuring the content fully renders. The parse function then scrapes data from the current page and checks for a “next” button. If the button is present and clickable, the script clicks it, waits for the next page to load, and continues scraping. This loop repeats until there are no more pages left, ensuring that you extract data from all pages.

9. Managing Playwright sessions and concurrency in Scrapy

Managing sessions and concurrency is important when handling multi-page websites or when scraping at scale. Playwright sessions should be handled carefully to avoid resource leakage and ensure that each browser instance is properly closed after use.

To manage concurrency, you can control the number of Playwright instances that Scrapy runs simultaneously by setting CONCURRENT_REQUESTS in your Scrapy settings settings.py. This setting determines how many requests Scrapy will handle at once, allowing you to adjust the load based on your system’s capability.

In addition to CONCURRENT_REQUESTS, you can also set PLAYWRIGHT_MAX_PAGES_PER_CONTEXT to limit the number of concurrent Playwright pages.

# Adjust this number based on your system's capability
CONCURRENT_REQUESTS = 8

# Set the maximum number of concurrent Playwright pages
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 4

Each request with playwright set to True in its meta opens a new browser instance, which can be resource-intensive.

It is also essential to properly close each Playwright page and browser session to free up resources. This can be done in the spider’s parse method or in dedicated middleware.

async def parse(self, response, **kwargs):
    page = response.meta['playwright_page']
    # Perform scraping tasks here
    await page.close()

10. Running Playwright in headless mode

By default, scrapy-playwright runs Playwright in headless mode, so you won’t see your script as it executes. If you want to watch the script in action, you can change that by adding the following configuration to settings.py:

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": False
    }

11. Automating form submissions

Playwright can mimic user interactions like filling out forms and clicking buttons, making it possible to perform tasks like logging into a website and extracting data afterward.

In the example below, we use scrapy-playwright’s PageMethod to access a test website with a login form, fill in the credentials, and then take a screenshot after logging in to show that the automation worked.

from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod

class EventsSpider(Spider):
    name = 'form'

    def start_requests(self):
        yield Request(
            url='https://practicetestautomation.com/practice-test-login/',
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    # Fill the username and password fields and click the submit button
                    PageMethod("fill", 'input#username', "student"),
                    PageMethod("fill", 'input#password', "Password123"),
                    PageMethod("click", 'button#submit'),
                    PageMethod("wait_for_selector", 'h1.post-title'),
                    PageMethod("screenshot", path="login.png", full_page=False),
                ],
            },
            callback=self.parse
        )

    async def parse(self, response):
        pass
Playwright login example

12. Capturing AJAX data

AJAX (Asynchronous JavaScript and XML) is a widely used technique that allows webpages to update specific parts without reloading the entire page by fetching data dynamically from the server. For example, that’s how social media sites load new content as you scroll, without needing to reload the entire page.

Capturing AJAX data can make your web scraping process much more efficient. Often, AJAX requests retrieve data in a well-structured format like JSON, meaning you might not need to scrape the HTML at all.

By directly accessing AJAX data from the server’s API, you can avoid the complexity of parsing HTML and work with organized, reliable data that's less prone to changes in the website's structure. In the example below, we access the Nike website, which uses AJAX to load new data as you scroll. Instead of scraping the data from the page, we'll capture the AJAX data from the server.

from scrapy import Spider, Request

class EventsSpider(Spider):
    name = 'nike'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.captured_ajax_data = []

    def start_requests(self):
        yield Request(
            url='https://www.nike.com/w/jordan-shoes-37eefzy7ok',
            meta={
                "playwright": True,
                "playwright_include_page": True,
            },
            callback=self.parse
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]

        # Function to handle AJAX requests
        async def handle_route(route):
            request = route.request
            if "api.nike.com" in request.url:
                # Continue the request and get the response
                await route.continue_()
                response = await request.response()
                if response:
                    # Capture the AJAX data
                    ajax_data = await response.body()
                    self.captured_ajax_data.append(ajax_data)
                else:
                    self.logger.warning(f"No response for request: {request.url}")
            else:
                await route.continue_()

        # Intercept all network requests
        await page.route("**/*", handle_route)

        # Function to scroll the page and load more data
        async def scroll_page():
            previous_height = await page.evaluate("document.body.scrollHeight")
            while True:
                await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                await page.wait_for_timeout(1000)  # Wait for 1 second
                new_height = await page.evaluate("document.body.scrollHeight")
                if new_height == previous_height:
                    break
                previous_height = new_height

        # Scroll the page to load more data
        await scroll_page()

        # Close the page
        await page.close()

        # Yield the captured AJAX data
        for ajax_data in self.captured_ajax_data:
            yield {'ajax_data': ajax_data.decode('utf-8')}  # Decode bytes to string

This spider starts by loading the Nike webpage using Playwright, which allows us to interact with the page and capture dynamic content.

Within the parse method, we define a handle_route function that intercepts all network requests made by the page. If an AJAX request is detected (in this case, requests to api.nike.com), the response data is captured and stored in self.captured_ajax_data.

The script then scrolls the page to trigger more AJAX requests, ensuring all dynamic content is loaded. After scrolling is complete, the captured AJAX data is yielded as the output of the spider, decoded from bytes to a readable string format.

You can get the full AJAX data output in JSON by running the command scrapy crawl nike -o nike_data.json .

How to use proxies with Scrapy Playwright

Using proxies with scrapy-playwright is quite easy. Simply pass the dictionary playwright_context_kwargs with your proxy configuration.

from scrapy import Spider, Request

class EventsSpider(Spider):
    name = 'proxy'
    
    def start_requests(self):
        yield Request(
            url='https://httpbin.org/get',
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_context_kwargs": {
                    "proxy": {
                        "server": "http://your-proxy-server:port",
                        "username": "your-username",
                        "password": "your-password",  
                    }
                },
            },
            callback=self.parse
        )

    async def parse(self, response):
        pass

Conclusion: Scrapy Playwright great for dynamic content

Overall, scrapy-playwright is a great addition to the Scrapy ecosystem, especially for websites that rely on JavaScript to load content.

While vanilla Scrapy works well for static pages, it can’t deal with dynamic content. scrapy-playwright solves this by incorporating Playwright's capabilities to simulate user interactions and capture fully rendered pages. This allows us handle complex tasks like capturing AJAX data, loading content from infinite scroll pages, and managing login forms.

Percival Villalva
Percival Villalva
Developer Advocate on a mission to help developers build scalable, human-like bots for data extraction and web automation.

Get started now

Step up your web scraping and automation