There's a reason Beautiful Soup is one of the most popular Python libraries for web scraping. Even for a mere dabbler in Python like myself, it's a straightforward and lightweight tool.
Of course, it has its limitations, but when it comes to parsing and extracting HTML and XML documents, Beautiful Soup is the go-to tool for web scraping in Python. Its find-by-class feature saves a lot of time when collecting data from the web.
In this tutorial, you'll learn how to use Beautiful Soup to find elements by their class attributes so you can apply these learnings to your own data extraction projects.
Prerequisites
- You should have Python installed on your system.
- You need to install
BeautifulSoup
,requests
, andlxml
if you haven’t already. You can install them using pip:
pip install beautifulsoup4 requests lxml
How to find by class in Beautiful Soup
For the impatient copy-pasters amongst you, let's begin with the final code. Then, I'll break it down for the rest of you so you can understand what I did and why it looks like this.
import requests
from bs4 import BeautifulSoup
url = "https://realpython.github.io/fake-jobs/"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'lxml')
job_cards = soup.find_all('div', class_='card-content')
for card in job_cards:
title_element = card.find('h2', class_='title is-5')
company_element = card.find('h3', class_='subtitle is-6 company')
location_element = card.find('p', class_='location')
date_element = card.find('time')
title = title_element.text.strip() if title_element else "No Title Found"
company = company_element.text.strip() if company_element else "No Company Found"
location = location_element.text.strip() if location_element else "No Location Found"
date_posted = date_element.text.strip() if date_element else "No Date Found"
print(f"Job Title: {title}")
print(f"Company: {company}")
print(f"Location: {location}")
print(f"Date Posted: {date_posted}\n")
else:
print("Failed to retrieve the webpage, status code:", response.status_code)
As you may have noticed, we're targeting the sandbox, Fake Python. The class element we're hunting for is card-content
, so we can extract all the job cards.
Let's break it down.
Step 1. Import libraries
First, you need to import the necessary libraries.
from bs4 import BeautifulSoup
import request
Step 2: Make a request
url = "https://realpython.github.io/fake-jobs/"
response = requests.get(url)
Using requests.get()
, we send an HTTP GET request to the specified URL and store the response in the response
variable. This contains the webpage content we want to scrape.
Step 3: Create a Beautiful Soup object and find elements by class
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'lxml')
job_cards = soup.find_all('div', class_='card-content')
Before proceeding, we check the HTTP status code. A status code of 200 indicates success. If the status is not 200, we output an error message.
If the request is successful, we pass the page content to BeautifulSoup
, specifying the lxml
parser. This creates a soup
object, which represents the document as a nested data structure.
Here, we use soup.find_all()
searches for all elements that match the given criteria. The class_
parameter looks for elements with the class card-content
.
- Single element: If you're looking for the first occurrence, use
.find()
.
element = soup.find('div', class_='your-class-name')
print(element)
- Multiple elements: If you want to retrieve all elements with that class, use
.find_all()
.
elements = soup.find_all('div', class_='your-class-name')
for element in elements:
print(element)
- CSS selectors: Alternatively, use the
.select()
method to find elements using CSS selectors. You can learn more about Python CSS selectors here.
Step 4. Process each element
for card in job_cards:
title_element = card.find('h2', class_='title is-5')
company_element = card.find('h3', class_='subtitle is-6 company')
location_element = card.find('p', class_='location')
date_element = card.find('time')
We extract Individual pieces of information using the find
method on each card
, which represents a job posting. For each job card found, we extract the job title, company name, location, and date posted.
Step 5. Extract text data and handle missing data
Next, we scrape the text from each element and handle cases where any information might be missing:
title = title_element.text.strip() if title_element else "No Title Found"
company = company_element.text.strip() if company_element else "No Company Found"
location = location_element.text.strip() if location_element else "No Location Found"
date_posted = date_element.text.strip() if date_element else "No Date Found"
.text
retrieves the text part of each HTML element. .strip()
removes extra whitespace from the strings. Conditional expressions handle the possibility that an element might not be found (None
) to prevent the script from crashing.
Step 6. Output the results
Finally, print the extracted data:
print(f"Job Title: {title}")
print(f"Company: {company}")
print(f"Location: {location}")
print(f"Date Posted: {date_posted}\n")
This prints the job title, company, location, and date posted for each job listing:
Step 7. Handle unsuccessful requests
The final part of our script provides feedback if something goes wrong with the HTTP request.
else:
print("Failed to retrieve the webpage, status code:", response.status_code)
Complex class scenarios
In practical scenarios, you might encounter situations where simply using find_all(class_='class-name')
isn’t enough. For example:
- An element might have multiple classes, and you want to select it based on a combination of these. If you need to match an element with more than one class, you can pass a list of classes to the
class_
parameter. - You need to find elements that have one class name out of many but are also children or siblings of elements with another specific class. In such cases, BeautifulSoup’s
.select()
method can be used to utilize CSS selectors, as it can target elements thatfind_all()
might not easily select. - You might only know a part of the class name, or the class name might have dynamic parts that change. In those instances, you can use CSS attribute selectors with
^
,$
, or to match elements whose class attribute begins with (^), ends with ($), or contains (*) a specific string.
Introduction to CSS selectors
CSS selectors are patterns used to select elements based on their attributes, classes, IDs, and the structure of the HTML document. Unlike the find()
and find_all()
methods which select elements by tags and classes, .select()
allows you to use these patterns, giving you the power to navigate complex HTML structures with precision.
When to use .select()
.select()
becomes particularly useful when:
- You want to select elements that are descendants of another element.
- You need to select children elements directly within a parent element.
- You want to select elements based on attributes other than class or ID.
Using .select()
in Beautiful Soup
Let's apply .select()
to our previous code to see how it can be used to scrape data with more specific requirements.
import requests
from bs4 import BeautifulSoup
url = "https://realpython.github.io/fake-jobs/"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'lxml')
job_cards = soup.select('div.card-content') # Using CSS selector to find all 'div' elements with the class 'card-content'
for card in job_cards:
title_element = card.select_one('h2.title.is-5') # Using CSS selector to find the 'h2' with the specific classes
company_element = card.select_one('h3.subtitle.is-6.company') # Using CSS selector to find the 'h3' with the specific classes
location_element = card.select_one('p.location') # Using CSS selector to find the 'p' with the class 'location'
date_element = card.select_one('time') # Using CSS selector to find the 'time' element
title = title_element.text.strip() if title_element else "No Title Found"
company = company_element.text.strip() if company_element else "No Company Found"
location = location_element.text.strip() if location_element else "No Location Found"
date_posted = date_element.text.strip() if date_element else "No Date Found"
print(f"Job Title: {title}")
print(f"Company: {company}")
print(f"Location: {location}")
print(f"Date Posted: {date_posted}\n")
else:
print("Failed to retrieve the webpage, status code:", response.status_code)
The use of CSS selectors via the .select()
method offers much greater flexibility than .find()
and .find_all()
when targeting elements. CSS selectors can match elements based on their relationship to other elements, attributes, and more complex patterns. This allows you to extract data from web pages that have a more nested or intricate HTML structure without much hassle.
Learn more about scraping and parsing with Beautiful Soup
Now you know the basics of using the .find()
and .find_all()
methods with Beautiful Soup and when to use CSS selectors via the .select()
method. But if you want to learn more about using Beautiful Soup for web scraping and data parsing, let me refer you to greater experts than myself in the tutorials below.