Scraping dynamic websites that load content through JavaScript after the initial page load can be a pain in the neck, as the data you want to scrape may not exist in the raw HTML source code.
I'm here to help you with that problem.
In this article, you'll learn how to scrape dynamic websites with Python and Playwright. By the end, you'll know how to:
- Setup and install Playwright
- Create a browser instance
- Navigate to the page
- Interact with the page
- Scrape the data you need
What are dynamic websites?
Dynamic websites load content dynamically using client-side scripting languages like JavaScript. Unlike static websites, where the content is pre-rendered on the server, dynamic websites generate content on the fly based on user interactions, data fetched from APIs, or other dynamic sources. This makes them more complex to scrape compared to static websites.
Deploy your scraping code to the cloud
Headless browsers, infrastructure scaling, sophisticated blocking.
Meet Apify - the full-stack web scraping and browser automation platform that makes it all easy.
What's the difference between a dynamic and static web page?
Static web pages are pre-rendered on the server and delivered as complete HTML files. Their content is fixed and does not change unless the underlying HTML file is modified. Dynamic web pages, on the other hand, generate content on the fly using client-side scripting languages like JavaScript.
Dynamic content is often generated using JavaScript frameworks and libraries like React, Angular, and Vue.js. These frameworks manipulate the Document Object Model (DOM) based on user interactions or data fetched from APIs using technologies like AJAX (Asynchronous JavaScript and XML).
The dynamic content is not initially present in the HTML source code and requires additional processing to be captured.
Tools and libraries for scraping dynamic content
To scrape dynamic content, you need tools that can execute JavaScript and interact with web pages like a real browser. One such tool is Playwright, a Python library for automating Chromium, Firefox, and WebKit browsers.
Playwright allows you to simulate user interactions, execute JavaScript, and capture the resulting DOM changes.
In addition to Playwright, you may also need libraries like Beautiful Soup for parsing HTML and extracting relevant data from the rendered DOM.
Step-by-step guide to using Playwright
1. Setup and installation
- Install the Python Playwright library:
pip install Playwright
- Install the required browser binaries (e.g., Chromium):
Playwright install chromium
2. Create a browser instance
Import the necessary Playwright modules and create a browser instance.
from Playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
3. Launch a new browser context and create a new page
page = browser.new_page()
4. Navigate to the target website
page.goto("https://example.com/infinite-scroll")
5. Interact with the page as needed
Scroll, click buttons, fill forms, etc., to trigger dynamic content loading.
# Scroll to the bottom to load more content
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
new_content_loaded = page.wait_for_selector(".new-content", timeout=1000)
if not new_content_loaded:
break
6. Wait for content to load
Wait for the desired content to load using Playwright's built-in wait mechanisms.
new_content_loaded = page.wait_for_selector(".new-content", timeout=1000)
7. Extract the data
Extract the desired data from the rendered DOM using Playwright's evaluation mechanisms or in combination with Beautiful Soup.
content = page.inner_html("body")
Here's the complete example of scraping an infinite scrolling page using Playwright:
from Playwright.sync_api import sync_playwright
with sync_playwright() as p:
# Launch a new Chromium browser instance
browser = p.chromium.launch()
# Create a new page object
page = browser.new_page()
# Navigate to the target website with infinite scrolling
page.goto("https://example.com/infinite-scroll")
# Scroll to the bottom to load more content
while True:
# Execute JavaScript to scroll to the bottom of the page
page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
# Wait for new content to load (timeout after 1 second)
new_content_loaded = page.wait_for_selector(".new-content", timeout=1000) # Check for a specific class
# If no new content is loaded, break out of the loop
if not new_content_loaded:
break
# Extract the desired data from the rendered DOM
content = page.inner_html("body")
# Close the browser instance
browser.close()
Challenges and solutions
Web scraping dynamic content can present several challenges, such as handling CAPTCHAs, IP bans, and other anti-scraping measures implemented by websites.
Here are some common solutions:
- CAPTCHAs: Playwright provides mechanisms to solve CAPTCHAs using third-party services or custom solutions. You can use libraries like
python-anticaptchacloud
orpython-anti-captcha
to solve CAPTCHAs programmatically. - IP bans: Use rotating proxies or headless browsers to avoid IP bans and mimic real user behavior. Libraries like
requests-html
andselenium
can be used in conjunction with proxy services like Bright Data or Oxylabs. - Anti-scraping measures: Implement techniques like randomized delays, user agent rotation, and other tactics to make your scraper less detectable. Libraries like
fake-useragent
andscrapy-fake-useragent
can help with user agent rotation.
Blocked again? Apify Proxy will get you through
Improve the performance of your scrapers by smartly rotating datacenter and residential IP addresses.
Available on all Apify plans
Summary and next steps
Due to anti-scraping measures implemented by websites, web scraping dynamic content can be more challenging than scraping static websites. So, in addition to tools like Playwright that can execute JavaScript, you may need to employ additional techniques like rotating proxies, handling CAPTCHAs, and mimicking real user behavior to avoid detection and ensure successful scraping.
For further learning and additional resources, consider exploring Playwright's official documentation or one of our more in-depth tutorials: