Hey, we're Apify, a full-stack web scraping and browser automation platform. If you're interested in using Python for web scraping, this short article provides you with some guidance on how to use CSS selectors with two popular Python libraries.
Python CSS selectors
To interact with or scrape web pages, Selenium and Beautiful Soup are two popular Python libraries that serve slightly different purposes.
Selenium is typically used for automating web browsers, so you can interact with web pages as a user would. That includes things like clicking buttons, filling out forms, and navigating. This makes Selenium a good choice for scraping dynamic pages.
Beautiful Soup, on the other hand, is used for parsing HTML and XML documents, making it great for scraping data from static content.
Selenium selector strategies: the “By” class
Selenium’s By
class attributes are used to locate elements on a page. In essence, each attribute indicates what strategy you want to use to identify these elements.
Attribute element selection strategies vary from very specific, like selecting an element based specifically on its “ID”, to more flexible ones, like using CSS selectors or XPATH. Here is a list of all the attributes Selenium provides:
# By ID
find_element(By.ID, "id")
# By NAME
find_element(By.NAME, "name")
# By XPATH
find_element(By.XPATH, "XPath")
# By Link Text
find_element(By.LINK_TEXT, "link text")
# By Partial Link Text
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
# By Tag Name
find_element(By.TAG_NAME, "tag name")
# By Class Name
find_element(By.CLASS_NAME, "class name")
# By CSS Selector
find_element(By.CSS_SELECTOR, "css selector")
These attributes are often used together with the find_element
and find_elements
methods that you can use to find the first matching element or all matching elements, respectively.
Here's an example of how we can use the CSS selector attribute to select the first matching element with a particular “ID” and interact with it:
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
# Assuming 'driver' has been previously set up
search_box = driver.find_element(By.CSS_SELECTOR, '#search-field')
search_box.send_keys('t-shirt')
search_box.send_keys(Keys.ENTER)
In this example, we're using By.CSS_SELECTOR
to indicate that a CSS selector is being used to find the element, and '#search-field'
is the CSS selector that targets the element with the ID of search-field
. The use of the hash symbol (#
) before search-field
is standard CSS syntax for selecting elements by their ID.
The rest of the code sends the string 't-shirt'
to the input element found and then simulates pressing the Enter key, likely to submit a search query on a webpage.
Let's look at a couple more examples.
- Finding an element by tag name
# Using the specific tag attribute
h2 = driver.find_element(By.TAG_NAME, 'h2')
# Using CSS selector
h2 = driver.find_element(By.CSS_SELECTOR, 'h2')
- Finding elements by class name
# Using the specific class name attribute
all_products = driver.find_elements(By.CLASS_NAME, 'product')
# Using CSS selector
all_products = driver.find_elements(By.CSS_SELECTOR, '.product')
In these examples, By.CSS_SELECTOR
is used with appropriate CSS selector strings: h2
selects elements by tag name (just as By.TAG_NAME, 'h2'
does), and .product
selects elements by class name (similar to By.CLASS_NAME, 'product'
), with the dot (.
) prefix indicating a class name in CSS selector syntax.
CSS selectors with Beautiful Soup
While Beautiful Soup supports a wide range of CSS selectors for parsing HTML documents, the categorization and naming can vary slightly compared to Selenium WebDriver. Here are some parallels and differences based on the aforementioned Selenium CSS selectors:
1. ID selector
🧪 Selenium WebDriver: Uses #
to select elements by ID, e.g., #example
.
🥣 Beautiful Soup: Similarly supports ID selectors using the #
syntax in the .select()
method, e.g., soup.select('#example')
.
2. ClassName selector
🧪 Selenium WebDriver: Uses .
to select elements by class name, e.g., .example
.
🥣 Beautiful Soup: Also supports class name selectors using the .
syntax, e.g., soup.select('.example')
.
3. Attribute selector
🧪 Selenium WebDriver: Allows selection by any attribute, e.g., [attribute=value]
.
🥣 Beautiful Soup: Offers comprehensive support for attribute selectors, including presence [attr]
, exact value [attr=value]
, substring matches [attr*=value]
, starts with [attr^=value]
, and ends with [attr$=value]
.
4. Substring selector
🧪 Selenium WebDriver: Refers to using selectors based on substring matches within attribute values.
🥣 Beautiful soup: Directly supports substring selectors within attributes, such as [attr*=value]
for contains, [attr^=value]
for starts with, and [attr$=value]
for ends with.
Here's an example of using a CSS selector with Beautiful Soup:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://news.ycombinator.com/')
html = response.text
# Use Beautiful Soup to parse the HTML
soup = BeautifulSoup(html, 'html.parser') # It's good practice to specify the parser
# Use CSS selectors to find elements
articles = soup.select('.athing')
Here, .select('.athing')
uses a CSS selector to find all elements with the class athing
. Let's adapt this to demonstrate using a CSS selector to target elements with the class name, 'title':
import requests
from bs4 import BeautifulSoup
response = requests.get('https://news.ycombinator.com/')
html = response.text
# Use Beautiful Soup to parse the HTML, specifying the 'html.parser' for clarity
soup = BeautifulSoup(html, 'html.parser')
# Use CSS selectors to find elements with the class name 'title'
titles = soup.select('.titleline')
# Print the text of each title found
for title in titles:
if title.a: # Checking if the 'title' element contains an 'a' (link) element
print(title.a.text) # Printing the text of the link
else:
print(title.text) # If no link, print the text of the 'title' directly
In this example, .select('.titleline')
is used to find all elements with the class name 'title'. Since titles on Y Combinator’s Hacker News often contain links (<a>
elements), the script checks if each title element contains an <a>
element and prints the text accordingly. This sees to it that the actual text of the title is printed, whether it's wrapped in a link or not.
Choosing between Selenium and Beautiful Soup
Beautiful Soup offers a rich set of CSS selector capabilities similar to those in Selenium, so the decision of which to use comes down to your particular use case.
As a rule of thumb, when web scraping, you should always start by inspecting the website to understand if its content is generated dynamically or not. If not, BeautifulSoup would be a good choice due to its friendly syntax, minimum setup requirements, and speed.
If the website does require JavaScript to display its content, then your best bet would be to go with Selenium due to its ability to spawn and control an actual browser, as this means it's able to load and scrape dynamically generated content.
You can learn more about both Beautiful Soup and Selenium in the web scraping tutorials below.