How to scrape dynamic websites with Python

To scrape dynamic content, you need tools like Playwright. This guide shows you how to use it.

Content

Scraping dynamic websites that load content through JavaScript after the initial page load can be a pain in the neck, as the data you want to scrape may not exist in the raw HTML source code.

I'm here to help you with that problem.

In this article, you'll learn how to scrape dynamic websites with Python and Playwright. By the end, you'll know how to:

  • Setup and install Playwright
  • Create a browser instance
  • Navigate to the page
  • Interact with the page
  • Scrape the data you need

What are dynamic websites?

Dynamic websites load content dynamically using client-side scripting languages like JavaScript. Unlike static websites, where the content is pre-rendered on the server, dynamic websites generate content on the fly based on user interactions, data fetched from APIs, or other dynamic sources. This makes them more complex to scrape compared to static websites.

What's the difference between a dynamic and static web page?

Static web pages are pre-rendered on the server and delivered as complete HTML files. Their content is fixed and does not change unless the underlying HTML file is modified. Dynamic web pages, on the other hand, generate content on the fly using client-side scripting languages like JavaScript.

Dynamic content is often generated using JavaScript frameworks and libraries like React, Angular, and Vue.js. These frameworks manipulate the Document Object Model (DOM) based on user interactions or data fetched from APIs using technologies like AJAX (Asynchronous JavaScript and XML).

The dynamic content is not initially present in the HTML source code and requires additional processing to be captured.

Tools and libraries for scraping dynamic content

To scrape dynamic content, you need tools that can execute JavaScript and interact with web pages like a real browser. One such tool is Playwright, a Python library for automating Chromium, Firefox, and WebKit browsers.

Playwright allows you to simulate user interactions, execute JavaScript, and capture the resulting DOM changes.

In addition to Playwright, you may also need libraries like Beautiful Soup for parsing HTML and extracting relevant data from the rendered DOM.

Step-by-step guide to using Playwright

1. Setup and installation

  • Install the Python Playwright library: pip install Playwright
  • Install the required browser binaries (e.g., Chromium): Playwright install chromium

2. Create a browser instance

Import the necessary Playwright modules and create a browser instance.

from Playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()

3. Launch a new browser context and create a new page

page = browser.new_page()

4. Navigate to the target website

page.goto("https://example.com/infinite-scroll")

5. Interact with the page as needed

Scroll, click buttons, fill forms, etc., to trigger dynamic content loading.

# Scroll to the bottom to load more content
while True:
    page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
    new_content_loaded = page.wait_for_selector(".new-content", timeout=1000)
    if not new_content_loaded:
        break

6. Wait for content to load

Wait for the desired content to load using Playwright's built-in wait mechanisms.

new_content_loaded = page.wait_for_selector(".new-content", timeout=1000)

7. Extract the data

Extract the desired data from the rendered DOM using Playwright's evaluation mechanisms or in combination with Beautiful Soup.

content = page.inner_html("body")

Here's the complete example of scraping an infinite scrolling page using Playwright:

from Playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Launch a new Chromium browser instance
    browser = p.chromium.launch()

    # Create a new page object
    page = browser.new_page()

    # Navigate to the target website with infinite scrolling
    page.goto("https://example.com/infinite-scroll")

    # Scroll to the bottom to load more content
    while True:
        # Execute JavaScript to scroll to the bottom of the page
        page.evaluate("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for new content to load (timeout after 1 second)
        new_content_loaded = page.wait_for_selector(".new-content", timeout=1000) # Check for a specific class

        # If no new content is loaded, break out of the loop
        if not new_content_loaded:
            break

    # Extract the desired data from the rendered DOM
    content = page.inner_html("body")

    # Close the browser instance
    browser.close()

Challenges and solutions

Web scraping dynamic content can present several challenges, such as handling CAPTCHAs, IP bans, and other anti-scraping measures implemented by websites.

Here are some common solutions:

  • CAPTCHAs: Playwright provides mechanisms to solve CAPTCHAs using third-party services or custom solutions. You can use libraries like python-anticaptchacloud or python-anti-captcha to solve CAPTCHAs programmatically.
  • IP bans: Use rotating proxies or headless browsers to avoid IP bans and mimic real user behavior. Libraries like requests-html and selenium can be used in conjunction with proxy services like Bright Data or Oxylabs.
  • Anti-scraping measures: Implement techniques like randomized delays, user agent rotation, and other tactics to make your scraper less detectable. Libraries like fake-useragent and scrapy-fake-useragent can help with user agent rotation.

Summary and next steps

Due to anti-scraping measures implemented by websites, web scraping dynamic content can be more challenging than scraping static websites. So, in addition to tools like Playwright that can execute JavaScript, you may need to employ additional techniques like rotating proxies, handling CAPTCHAs, and mimicking real user behavior to avoid detection and ensure successful scraping.

For further learning and additional resources, consider exploring Playwright's official documentation or one of our more in-depth tutorials:

Saurav Jain
Saurav Jain
Developer Community Manager at Apify. A developer and content writer. Love to play with new dev tools and manage developer communities.

TAGS

Get started now

Step up your web scraping and automation