What are the best Python web scraping libraries?

Python developers have access to some of the best web scraping libraries and frameworks available. See how they work in practice and how to choose.

Content

Introduction

Web scraping is essentially a way to automate the process of extracting data from the web. As a Python developer, you have access to some of the best libraries and frameworks available to help you get the job done.

We're going to take a look at some of the most popular Python libraries and frameworks for web scraping and compare their pros and cons so you know exactly what tool to use to tackle any web scraping project you might come across.

HTTP Libraries - Requests and HTTPX

First up, let's talk about HTTP libraries. These are the foundations of web scraping since every scraping job starts by making a request to a website and retrieving its contents, usually as HTML.

Two popular HTTP libraries in Python are Requests and HTTPX.

Requests are easy to use and great for simple scraping tasks, while HTTPX offers some advanced features like async and HTTP/2 support.

Their core functionality and syntax are very similar, so I would recommend HTTPX even for smaller projects since you can easily scale up in the future without compromising performance.

FeatureHTTPXRequests
Asynchronous
HTTP/2 support
Timeout support
Proxy support
TLS verification
Custom exceptions

Parsing HTML with Beautiful Soup

Once you have the HTML content, you need a way to parse it and extract the data you're interested in.

Beautiful Soup is the most popular HTML parser in Python, allowing you to navigate and search through the HTML tree structure easily. Its straightforward syntax and easy setup also make Beautiful Soup a great option for small- to medium-sized web scraping projects as well as web scraping beginners.

The two major drawbacks of Beautiful Soup are its inability to scrape JavaScript-heavy websites and its limited scalability, which results in low performance in large-scale projects. For large projects, you would be better off using Scrapy, but more about that later.

Web scraping with Beautiful Soup and Requests
Detailed tutorial with code examples. And some handy tricks.

Next, let’s take a look at how Beautiful Soup works in practice:

from bs4 import BeautifulSoup
import httpx

# Send an HTTP GET request to the specified URL using the httpx library
response = httpx.get("https://news.ycombinator.com/news")

# Save the content of the response
yc_web_page = response.content

# Use the BeautifulSoup library to parse the HTML content of the webpage
soup = BeautifulSoup(yc_web_page)

# Find all elements with the class "athing" (which represent articles on Hacker News) using the parsed HTML
articles = soup.find_all(class_="athing")

# Loop through each article and extract relevant data, such as the URL, title, and rank
for article in articles:
    data = {
        "URL": article.find(class_="titleline").find("a").get('href'),  # Find the URL of the article by finding the first "a" tag within the element with class "titleline"
        "title": article.find(class_="titleline").getText(),  # Find the title of the article by getting the text content of the element with class "titleline"
        "rank": article.find(class_="rank").getText().replace(".", "")  # Find the rank of the article by getting the text content of the element with class "rank" and removing the period character
    }
    # Print the extracted data for the current article
    print(data)

Explaining the code:

1 - We start by sending an HTTP GET request to the specified URL using the HTTPX library. Then, we save the retrieved content to a variable.

2 - Now, we use the Beautiful Soup library to parse the HTML content of the webpage.

3 - This enables us to manipulate the parsed content using Beautiful Soup methods, such as find_all to find the content we need. In this particular case, we are finding all elements with the class athing, which represents articles on Hacker News.

4- Next, we simply loop through all the articles on the page and then use CSS selectors to specify furtherwant what data we would like to extract from each article. Finally, we print the scraped data to the console.

Browser automation libraries - Selenium and Playwright

What if the website you're scraping relies on JavaScript to load its content? In that case, an HTML parser won't be enough, as you'll need to generate a browser instance to load the page’s JavaScript using a browser automation tool like Selenium or Playwright.

These are primarily testing and automation tools that allow you to control a web browser programmatically, including clicking buttons, filling out forms, and more. However, they are also often used in web scraping as a means to access dynamically generated data on a webpage.

While Selenium and Playwright are very similar in their core functionality, Playwright is more modern and complete than Selenium.

For example, Playwright offers some unique built-in features, such as automatically waiting on elements to be visible before making actions and an asynchronous version of its API using asyncIO.

What is Playwright and why use it?
Learn why Playwright is ideal for web scraping and automation.

To exemplify how we can use Playwright to do web scraping, let’s quickly walk through a code snippet where we use Playwright to extract data from an Amazon product and save a screenshot of the page while at it.

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        browser = await p.firefox.launch(headless=False)
        page = await browser.new_page()

        await page.goto("https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C")

        # Create a dictionary with the scraped data
        selectors = ['#productTitle', 'span.author a', '#productSubtitle', '.a-size-base.a-color-price.a-color-price']

        book_data = await asyncio.gather(*(page.query_selector(sel) for sel in selectors))

        book = {}

        book["book_title"], book["author"], book["edition"], book["price"] = [await elem.inner_text() for elem in book_data if elem]

        print(book)

        await page.screenshot(path="book.png")

        await browser.close()

asyncio.run(main())

Explaining the code:

  1. Import the necessary modules: asyncio and async_playwright from Playwright's async API.
  2. After importing the necessary modules, we start by defining an async function called main that launches a Firefox browser instance with headless mode set to False so we can actually see the browser working. We create a new page in the browser using the new_page method and finally navigate to the Amazon website using the gotomethod.
  3. Next, we define a list of CSS selectors for the data we want scraped. Then, we can use the method asyncio.gather to simultaneously execute the page.query_selector method on all the selectors in the list, and store the results in a book_data variable.
  4. Now, we can iterate over book_data to populate the book dictionary with the scraped data. Note that we also check that the element is not None and only add the elements that exist. This is considered good practice since websites can make small changes that will affect your scraper. You could even expand on this example and write more complex tests to ensure the data being extracted is not missing any values.
  5. Finally, we print the book dictionary contents to the console and take a screenshot of the scraped page, saving it as a file called book.png.
  6. As a last step, we make sure to close the browser instance.

Learn more about how to scrape the web with Playwright:

But wait! If browser automation tools can be used to scrape virtually any webpage and, on top of that, can also make it easier for you to automate tasks, test and visualize your code working, why don’t we always use Playwright or Selenium for web scraping?

Well, despite being powerful scraping tools, these libraries and frameworks have noticeable drawbacks. It turns out that generating a browser instance is a very resource-heavy action when compared to simply retrieving the page’s HTML. This can easily become a huge performance bottleneck for large scraping jobs, which will take longer to complete and become considerably more expensive. For that reason, we usually want to limit the usage of these tools to only the necessary tasks and, when possible, use them together with Beautiful Soup or Scrapy.

Web Scraping with Scrapy
A hands-on guide for web scraping with Scrapy.

Scrapy

Next, we have the most popular and, arguably, powerful web scraping framework for Python.

If you need to scrape large amounts of data regularly, then Scrapy could be a great option.

The Scrapy framework offers a full-fledged suite of tools to aid you even in the most complex scraping jobs.

On top of its superior performance compared to Beautiful Soup, Scrapy can be easily integrated into other data-processing Python tools and even other libraries, such as Playwright.

Not only that, but it comes with a handy collection of built-in features catered specifically to web scraping, such as:

FeatureDescription
Powerful and flexible spidering frameworkScrapy provides a built-in spidering framework that allows you to easily define and customize web crawlers to extract the data you need.
Fast and efficientScrapy is designed to be fast and efficient, allowing you to extract data from large websites quickly and with minimal resource usage.
Support for handling common web data formatsExport data in multiple formats such as HTML, XML, and JSON.
Extensible architectureEasily add custom functionality through middleware, pipelines, and extensions.
Distributed scrapingScrapy supports distributed scraping, allowing you to scale up your web scraping operation across multiple machines.
Error handlingScrapy has robust error-handling capabilities, allowing you to handle common errors and exceptions that may occur during web scraping.
Support for authentication and cookiesSupports handling authentication and cookies to scrape websites that require login credentials.
Integration with other Python toolsScrapy can be easily integrated with other Python tools, such as data processing and storage libraries, making it a powerful tool for end-to-end data processing pipelines.

Here's an example of how to use a Scrapy Spider to scrape data from a website:

import scrapy

class HackernewsSpiderSpider(scrapy.Spider):
    name = 'hackernews_spider'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['http://news.ycombinator.com/']

    def parse(self, response):
        articles = response.css('tr.athing')
        for article in articles:
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
        }

We can use the following command to run this script and save the resulting data to a JSON file:

scrapy crawl hackernews -o hackernews.json

Explaining the code:

The code example uses Scrapy to scrape data from the Hacker News website (news.ycombinator.com). Let's break down the code step by step:

After importing the necessary modules, we define the Spider class we want to use:

class HackernewsSpiderSpider(scrapy.Spider):

Next, we set the Spider properties:

  • name: The name of the spider (used to identify it).
  • allowed_domains: A list of domains that the spider is allowed to crawl
  • start_urls: A list of URLs to start crawling from.
name = 'hackernews_spider'
allowed_domains = ['news.ycombinator.com']
start_urls = ['http://news.ycombinator.com/']

Then, we define the parse method: This method is the entry point for the spider and is called with the response of the URLs specified in start_urls.

def parse(self, response):

In the parse method, we will extract data from the HTML response: The response object represents the HTML page received from the website. The spider uses CSS selectors to extract relevant data from the HTML structure.

articles = response.css('tr.athing')

Now, we use a for loop to iterate over each article found on the page.

for article in articles:

Finally, the spider extracts the URL, title, and rank information for each article using CSS selectors and yields a Python dictionary containing this data.

yield {
    "URL": article.css(".titleline a::attr(href)").get(),
    "title": article.css(".titleline a::text").get(),
    "rank": article.css(".rank::text").get().replace(".", "")
}
Python & JavaScript alternatives to Scrapy
5 Scrapy alternatives for web scraping you need to try.
🕸
Did you know you can now deploy your Scrapy spiders to Apify? Learn more here.

Which Python scraping library is right for you?

So, which library should you use for your web scraping project? The answer depends on the specific needs and requirements of your project. Each web scraping library and framework presented here has a unique purpose in an expert scraper's toolkit. Learning to use each one will give you the flexibility to select the best tool for each job, so don't be afraid to try them before deciding!

Whether you are scraping with BeautifulSoup, Scrapy, Selenium, or Playwright, the Apify Python SDK helps you run your project in the cloud at any scale.

Percival Villalva
Percival Villalva
Developer Advocate on a mission to help developers build scalable, human-like bots for data extraction and web automation.

Get started now

Step up your web scraping and automation