Hi, we're Apify, a full-stack web scraping and browser automation platform. We're experts at web data collection, so we know how to deal with the challenges of data extraction. This article focuses on one of those challenges: JavaScript-rendered pages.
Introduction
Web scraping can be pretty challenging, especially when the website you're trying to scrape contains dynamic web pages, aka JavaScript-rendered pages.
Traditional scraping methods often fail to capture dynamically loaded content, which leaves a lot of valuable data out of reach.
Here's an example of what I'm talking about:
The purpose of this article is to guide you through effective methods to scrape JavaScript-rendered pages like this.
Understanding JavaScript-rendered web pages
"Static web pages are pre-rendered on the server and delivered as complete HTML files. Dynamic content is not initially present in the HTML source code and requires additional processing to be captured."
Because JavaScript-rendered pages load content dynamically after the initial HTML has been retrieved, the source code you see when you inspect the page often doesn't contain the data you're trying to scrape. Instead, JavaScript executes in the browser to fetch and display content.
📌
The difference between static and JavaScript-rendered pages
Static pages Content is fully present in the initial HTML
JavaScript-rendered pages Content is loaded or generated after the initial page load
Tools for scraping JavaScript-rendered pages
To scrape JavaScript-rendered pages, we need a tool that can execute JavaScript and interact with the page. Four popular Python options are:
Selenium: A widely-used web automation tool that can control a real web browser (e.g., Chrome, Firefox) programmatically. Selenium is capable of executing JavaScript, rendering dynamic content, and interacting with web elements.
Playwright: A modern, high-performance library for automating browsers. Playwright offers a Python API for controlling browsers like Chromium, Firefox, and WebKit.
Scrapy Playwright: A library that combines Scrapy (a popular Python web scraping framework) with Playwright's browser interaction capabilities. Scrapy can't scrape JavaScript-rendered content on its own, so this combination remedies that limitation.
Scrapy Splash: A combination of Scrapy and Splash (a lightweight browser rendering service). This approach offloads JavaScript rendering to a separate service, making Scrapy more efficient for large-scale scraping operations.
Crawlee for Python: An all-in-one approach to web scraping. Crawlee is a complete web crawling and browser automation library built on top of BeautifulSoup and Playwright. It lets you easily switch between these libraries depending on whether you need to scrape JavaScript-rendered or static pages.
Scraping JavaScript-rendered web pages with Selenium
Selenium is the oldest and one of the most popular options for scraping JavaScript-rendered web pages, so we'll use that framework for this tutorial.
We'll use the Mint Mobile product page as an example.
⚙️
Prerequisites
Python 3.x installed on your system Basic knowledge of Python and HTML
Step 1. Set up your environment
a). Create a new directory for your project:
mkdir selenium_scraper
cd selenium_scraper
b). Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows, use venv\\Scripts\\activate
c). Install the required packages:
pip install selenium webdriver-manager
🏹
Troubleshooting
On macOS, you'll need to use pip3 instead of pip and python3 instead of python when running commands.
Step 2. Import required modules
Create a new file named main.py and add the following import statements:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import time
🏹
Troubleshooting
If you encounter issues with Chrome driver versions, the webdriver_manager should handle this automatically. If problems persist, try updating Chrome to the latest version.
Step 3. Create the SeleniumScraper class
Add the class definition and initialization method:
This extracts specific data points from the web page.
📌
Dynamic class names
In some web applications, especially those built with frameworks like React or Angular, class names may be dynamically generated. That is, they may change every time the page is loaded.
Dynamic class names, like the attribute data-qa (short for 'data quality assurance') counteract this. These attributes are stable and don't change dynamically, which makes them reliable selectors for automation scripts.
Step 7. Execute the web scraping script
Finally, add this code to make sure the scraper is properly initialized and executed when you run the script:
if __name__ == "__main__":
with SeleniumScraper() as scraper:
scraper.start_requests()
Complete script and output
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import time
class SeleniumScraper:
# Initialize the web driver for Chrome
def __init__(self):
chrome_options = Options()
self.driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
# Allow use of 'with' statement for clean resource management
def __enter__(self):
return self
# Ensure the web driver is closed after use
def __exit__(self, exc_type, exc_val, exc_tb):
self.driver.quit()
# Start the web scraping process
def start_requests(self):
# Open the target webpage
self.driver.get('https://www.mintmobile.com/product/google-pixel-7-pro-bundle/')
self.parse()
# Helper method to find elements and extract data
def find_element_by_css(self, css_selector, attribute="text", child_index=None):
# Wait for the element to be present before finding it
element = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, css_selector)))
# Extract text or attribute based on parameters
if attribute == "text":
return element.text.replace('\n', ' ').strip() if child_index is None else element.find_elements(By.CSS_SELECTOR, "*")[child_index].strip()
return element.get_attribute(attribute)
# Parse the webpage to extract data
def parse(self):
# Wait a bit for all elements to fully load
time.sleep(5)
# Extract and print data from the page
data = {
'name': self.find_element_by_css('h1[data-qa="device-name"]', 'text'),
'memory': self.find_element_by_css('span[class^="simtype_selectedMemory__"]', 'text'),
'pay_monthly_price': self.find_element_by_css('p[class^="paymentOptions_subtotalContent__"] > span:nth-child(1)', 'text'),
'pay_today_price': self.find_element_by_css('p[class^="paymentOptions_subtotalContent__"] > span:nth-child(2) > span', 'text') + "/mo",
}
print(data)
# Close the web driver manually if needed
def close_spider(self):
self.driver.quit()
# Main execution
if __name__ == "__main__":
# Use 'with' statement for automatic cleanup
with SeleniumScraper() as scraper:
scraper.start_requests()
The script will open a Chrome browser and navigate to the Mint Mobile product page:
It will then print a dictionary containing the scraped data:
This tutorial demonstrated how to use Selenium to scrape data from a JavaScript-rendered web page. The key advantages of the above approach are:
The ability to interact with dynamic content
Automatic handling of the Chrome driver installation
Clean code structure using a class-based approach
But you don’t need to do all the heavy lifting.
Apify provides code templates for Python, including Selenium and Playwright. These templates save you development time.
What’s more, with these templates, you can turn your scraping scripts into Apify Actors to give you immediate access to scheduling, integrations, storage, proxies, and all the other features the Apify cloud platform has to offer.
I used to write books. Then I took an arrow in the knee. Now I'm a technical content marketer, crafting tutorials for developers and conversion-focused content for SaaS.