Web scraping with Selenium (Python tutorial)

How to locate and wait for elements, take screenshots, and execute JavaScript code.

Content

Apify is no.1 in the web scraping software category on Capterra in 2024, based on 145 reviews.

97% of users recommend Apify for web scraping.

Python scraping with Selenium

In this straightforward example of using Selenium with Python for web scraping, we'll target the Monty Python online store. You'll learn about the following:

What is Selenium?

Until 2003, the word selenium was known only as a chemical element, but in 2004, it became the name of one of the most popular software testing frameworks. Initially designed for cross-browser end-to-end tests, Selenium is a powerful open-source browser automation platform that supports Java, Python, C#, Ruby, JavaScript, and Kotlin.

🧪
Why is it called Selenium?

The name came from a joke in an email by its creator, Jason Huggins. Wishing to mock his competitor, Mercury Interactive Corporation, Huggins quipped that you could cure mercury poisoning by taking selenium supplements. Thus the name Selenium caught on, and the rest, as they say, is history.

Why use Selenium with Python for web scraping?

Python is by far the most popular choice of programming language for web scraping. Combining it with Selenium WebDriver provides an easy API to write functional tests and web scraping scripts.

Selenium offers several ways to interact with websites, such as clicking buttons, filling in forms, scrolling pages, taking screenshots, and executing JavaScript code. That means Selenium can be used to scrape dynamically loaded content. Add to this its cross-language and cross-browser support, and it's little wonder that Selenium is one of the preferred frameworks for web scraping in Python.

😃
Fun fact

The Python language was not named after the snake. When Guido van Rossum was implementing the language, he wanted a name for it that would be short, unique, and somewhat mysterious. It just so happened that he was reading the published scripts of Monty Python and the Flying Circus at the time. That influenced him to go with the name Python.

Using Selenium for dynamically loaded content

Dynamically loaded content refers to web pages where the DOM (Document Object Model) gets updated somehow after the initial load. Some websites (like X/Twitter, for example) return content on the first load and then render more content "dynamically" based on certain actions (scrolling, clicking, hovering). This is the definition of a dynamic web page.

If a web page meets any of these criteria, it's likely to be dynamic:

  • Written in a JavaScript library
  • Lazy loads content
  • Makes "Fetch/XHR" requests for page data to be rendered

An HTML parser (like Beautiful Soup) won't be enough to scrape such content. You'll need to generate a browser instance to load the page’s JavaScript with a browser automation tool like Selenium.

However, a word of warning.

While Selenium can extract data from virtually any web page, it's not a good idea to always use it for web scraping. Generating a browser instance is more resource-intensive than retrieving a page’s HTML. This can become a performance bottleneck for large scraping jobs, as it will take longer to complete and become considerably more expensive. So you should limit the use of Selenium to the necessary tasks and use it together with another Python library like Beautiful Soup or Scrapy whenever possible.

How to scrape a website with Selenium in Python

With that brief introduction out of the way, it’s time to show you how to scrape a website using Python with Selenium.

Scraping a website with Selenium in Python (code and Python logo)
Let's start coding!

⚒️ Setting up the environment for web scraping

To follow this tutorial, you’ll need to have the following installed:

  1. Python 3.8 or later
  2. Selenium package (pip install selenium==4.8.3)
  3. Chrome web browser
  4. The Chrome driver that matches your Chrome browser version

You'll have to import the necessary packages for your Selenium script. For this tutorial, you'll need:

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.ui import WebDriverWait

🚀 Launching the browser and navigating to the website

Once you have the required packages, you can launch the browser with the webdriver module and navigate to the website you want to scrape. In this case, we'll use Chrome as the browser and navigate to the Monty Python online store: https://montypythononlinestore.com.

DRIVER_PATH = '/usr/local/bin/chromedriver'  # This path works for macos
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://montypythononlinestore.com')

🧢 Switching to headless mode

If you want to switch to headless Chrome, you need to instantiate an Options object and set add_argument to --headless=new.

options = webdriver.ChromeOptions()
options.add_argument('--headless=new')
driver = webdriver.Chrome(options=options)
driver.get('https://montypythononlinestore.com')

🔎 Locating and interacting with elements

Now that you've navigated to the website, you'll need to locate elements on the page and interact with them. For example, you might want to search for a product in the e-shop.

search_box = driver.find_element(By.ID, 'search-field')
search_box.send_keys('t-shirt')
search_box.send_keys(Keys.ENTER)

With Selenium WebDriver, you can use find_element for a single element or find_elements for a list of them. For example, if you want to select the <h2> element in an HTML document:

h2 = driver.find_element(By.TAG_NAME, 'h2')

If you want to select all elements with the class name 'product' on a page:

all_products = driver.find_elements(By.CLASS_NAME, 'product')

⏳ Waiting for elements to load

Sometimes, the content on the web page is dynamically loaded after the initial page load. In such cases, you can wait for the required element to load using the WebDriverWait function.

In the example below, we wait 10 seconds for the 'h2' to load.

wait = WebDriverWait(driver, 10)
element = wait.until(ec.presence_of_element_located((By.TAG_NAME, 'h2')))

Once the element is loaded, you can scrape its content using the element.text method.

element_text = element.text

📸 Taking a screenshot

If you need to screenshot the website at any point, you can do that in your script using the save_screenshot() function.

driver.save_screenshot('screenshot.png')

📜 Executing JavaScript code

To execute JavaScript code, use the execute_script() method. For example, if you want to scroll to the bottom of the page to take a screenshot:

driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

You can use time.sleep to wait for the browser to scroll down before taking the screenshot. In the example below, we wait 5 seconds for the browser to scroll down.

driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(5)
driver.save_screenshot('screenshot.png')

🚪 Closing the browser

When you’re done, you can close the browser window with the driver.quit()method. Note that the quit method is different from the close method. close() only closes the current window, but the WebDriver session will remain active, so use quit() to close all browser windows and end the WebDriver session.

Final code for scraping the Monty Python store with Selenium

Now, let’s put it all into a script for scraping the Monty Python store:

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.ui import WebDriverWait

DRIVER_PATH = '/usr/local/bin/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://montypythononlinestore.com')

search_box = driver.find_element(By.ID, 'search-field')
search_box.send_keys('t-shirt')
search_box.send_keys(Keys.ENTER)

wait = WebDriverWait(driver, 10)
element = wait.until(ec.presence_of_element_located((By.TAG_NAME, 'h2')))

element_text = element.text
print(element_text)

all_products = driver.find_elements(By.CLASS_NAME, 'product')
print(f'There are {len(all_products)} t-shirts on the page.')

driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(5)
driver.save_screenshot('screenshot.png')

driver.quit()

Using proxies with Selenium

Before we wrap up, we need to say something about using a proxy with Selenium. There's good news and bad news. I'll start with the bad: Selenium can't do proxy authentication out of the box. The good news? You can solve this drawback with the Selenium Wire package.

Installing Selenium Wire

To use Selenium Wire, you first need to install it. Run the following command in your terminal:

pip install selenium-wire

Setting up a proxy with Selenium Wire

After installing Selenium Wire, configuring a proxy is straightforward. You can specify your proxy details in the seleniumwire_options when initializing the WebDriver. Here's an example configuration:

from seleniumwire import webdriver  # Import from seleniumwire

# Define seleniumwire_options
seleniumwire_options = {
    'proxy': {
        'http': 'http://myproxy:port',
        'https': 'http://myproxy:port',
        'no_proxy': 'localhost,127.0.0.1'  # Exclude localhost and 127.0.0.1 from proxying
    }
}

# Initialize the WebDriver with seleniumwire_options
driver = webdriver.Chrome(seleniumwire_options=seleniumwire_options)

This setup directs all HTTP and HTTPS traffic through the proxy server specified by http and https keys, respectively.

Handling proxy authentication

If your proxy requires authentication, you can include the credentials directly in the proxy URL:

seleniumwire_options = {
    'proxy': {
        'http': 'http://username:password@myproxy:port',
        'https': 'http://username:password@myproxy:port',
    }
}

For more sophisticated authentication mechanisms or additional proxy configurations, refer to the Selenium Wire documentation.

Conclusion & further reading

We’ve shown you how to use Selenium with Python to scrape the Monty Python online store, but you can use what you've learned here to scrape data from any site you like with the driver.get method.

If you want to learn more about Selenium, check out the online literature below.

🔖 Selenium documentation
🔖 Selenium page object model: what is POM and how can you use it?
🔖 Selenium Webdriver: how to handle popups
🔖 Selenium Webdriver: how to handle iframes
🔖 Selenium Grid: what it is and how to set it up
🔖 Playwright vs. Selenium: which one to use for web scraping?
🔖 Puppeteer vs. Selenium for automation
🔖 Cypress vs. Selenium: choosing the right web testing and automation framework

Frequently asked questions about Selenium

Is Selenium good for web scraping?

Selenium is commonly used for scraping due to its ability to control headless browsers, render JavaScript on a page, and scrape dynamic websites. ‌‌However, because it was designed for testing rather than web scraping, it isn’t the most user-friendly option nor ideal for large-scale data extraction, as scraping large amounts of data is slow and inefficient compared to other alternatives, such as Scrapy and Playwright.

Is Selenium better than Beautiful Soup?

Selenium and Beautiful Soup were designed with different purposes in mind. Selenium is a multi-language web testing framework used for web scraping due to its ability to control headless browsers and render JavaScript. Beautiful Soup is a Python library for parsing HTML and easily navigating or modifying a DOM tree.

Beautiful Soup is easier to learn than Selenium. Extracting HTML and XML elements from a web page requires only a few lines of code, making it ideal for tackling simple scraping tasks with speed. However, its lack of asynchronous support means that it isn’t great for scalability and large web scraping projects.

Neither Selenium nor Beautiful Soup are full-fledged web scraping libraries like Scrapy or Crawlee, which are better options for large-scale data extraction.

Should I use Selenium or Scrapy?

Selenium beats Scrapy for cross-language support and efficiency in scraping dynamic content (though scraping dynamic content is possible with Scrapy via plugins). But if it is a Python web crawling framework you want, Scrapy is more powerful. It has built-in support for handling requests, processing responses, and exporting data. Scrapy makes it easy for you to post-process any data you find while crawling and scraping the web. It can handle many requests at the same time, which makes scraping runs faster. It also provides the building blocks you need to build spiders for web crawling that require minimum maintenance.

Is Selenium better than Playwright for web scraping?

Selenium and Playwright are very similar in their core functionality. Selenium benefits from a longer time on the market, resulting in a larger and more established community, extensive documentation, and vast resources due to its longer history. But Playwright is generally considered to have better performance than Selenium, especially because it uses a more modern architecture that allows for more efficient browser automation and interaction. Playwright offers helpful features like auto-awaiting and a more modern API, which many developers find more intuitive and easier to work with.

Theo Vasilis
Theo Vasilis
Writer, Python dabbler, and crafter of web scraping tutorials. Loves to inform, inspire, and illuminate. Interested in human and machine learning alike.

Get started now

Step up your web scraping and automation