Hey, we're Apify, a full-stack web scraping and browser automation platform. If you're interested in using Python for web scraping, this short article provides you with some guidance on how to use CSS selectors with two popular Python libraries.
Python CSS selectors
To interact with or scrape web pages, Selenium and Beautiful Soup are two popular Python libraries that serve slightly different purposes.
Selenium is typically used for automating web browsers, so you can interact with web pages as a user would. That includes things like clicking buttons, filling out forms, and navigating. This makes Selenium a good choice for scraping dynamic pages.
Beautiful Soup, on the other hand, is used for parsing HTML and XML documents, making it great for scraping data from static content.
Selenium selector strategies: the “By” class
Selenium’s By
class attributes are used to locate elements on a page. In essence, each attribute indicates what strategy you want to use to identify these elements.
Attribute element selection strategies vary from very specific, like selecting an element based specifically on its “ID”, to more flexible ones, like using CSS selectors or XPATH. Here is a list of all the attributes Selenium provides:
These attributes are often used together with the find_element
and find_elements
methods that you can use to find the first matching element or all matching elements, respectively.
Here's an example of how we can use the CSS selector attribute to select the first matching element with a particular “ID” and interact with it:
In this example, we're using By.CSS_SELECTOR
to indicate that a CSS selector is being used to find the element, and '#search-field'
is the CSS selector that targets the element with the ID of search-field
. The use of the hash symbol (#
) before search-field
is standard CSS syntax for selecting elements by their ID.
The rest of the code sends the string 't-shirt'
to the input element found and then simulates pressing the Enter key, likely to submit a search query on a webpage.
Let's look at a couple more examples.
- Finding an element by tag name
- Finding elements by class name
In these examples, By.CSS_SELECTOR
is used with appropriate CSS selector strings: h2
selects elements by tag name (just as By.TAG_NAME, 'h2'
does), and .product
selects elements by class name (similar to By.CLASS_NAME, 'product'
), with the dot (.
) prefix indicating a class name in CSS selector syntax.
CSS selectors with Beautiful Soup
While Beautiful Soup supports a wide range of CSS selectors for parsing HTML documents, the categorization and naming can vary slightly compared to Selenium WebDriver. Here are some parallels and differences based on the aforementioned Selenium CSS selectors:
1. ID selector
🧪 Selenium WebDriver: Uses #
to select elements by ID, e.g., #example
.
🥣 Beautiful Soup: Similarly supports ID selectors using the #
syntax in the .select()
method, e.g., soup.select('#example')
.
2. ClassName selector
🧪 Selenium WebDriver: Uses .
to select elements by class name, e.g., .example
.
🥣 Beautiful Soup: Also supports class name selectors using the .
syntax, e.g., soup.select('.example')
.
3. Attribute selector
🧪 Selenium WebDriver: Allows selection by any attribute, e.g., [attribute=value]
.
🥣 Beautiful Soup: Offers comprehensive support for attribute selectors, including presence [attr]
, exact value [attr=value]
, substring matches [attr*=value]
, starts with [attr^=value]
, and ends with [attr$=value]
.
4. Substring selector
🧪 Selenium WebDriver: Refers to using selectors based on substring matches within attribute values.
🥣 Beautiful soup: Directly supports substring selectors within attributes, such as [attr*=value]
for contains, [attr^=value]
for starts with, and [attr$=value]
for ends with.
Here's an example of using a CSS selector with Beautiful Soup:
Here, .select('.athing')
uses a CSS selector to find all elements with the class athing
. Let's adapt this to demonstrate using a CSS selector to target elements with the class name, 'title':
In this example, .select('.titleline')
is used to find all elements with the class name 'title'. Since titles on Y Combinator’s Hacker News often contain links (<a>
elements), the script checks if each title element contains an <a>
element and prints the text accordingly. This sees to it that the actual text of the title is printed, whether it's wrapped in a link or not.
Choosing between Selenium and Beautiful Soup
Beautiful Soup offers a rich set of CSS selector capabilities similar to those in Selenium, so the decision of which to use comes down to your particular use case.
As a rule of thumb, when web scraping, you should always start by inspecting the website to understand if its content is generated dynamically or not. If not, BeautifulSoup would be a good choice due to its friendly syntax, minimum setup requirements, and speed.
If the website does require JavaScript to display its content, then your best bet would be to go with Selenium due to its ability to spawn and control an actual browser, as this means it's able to load and scrape dynamically generated content.
You can learn more about both Beautiful Soup and Selenium in the web scraping tutorials below.