8 best Python web scraping libraries in 2024

Python developers have access to some of the best web scraping libraries and frameworks available. Learn what they do and when to use them.

Content

This article has been updated to include the new Crawlee for Python library. See how it compares to other web scraping libraries and frameworks in the Python ecosystem.


Web scraping is essentially a way to automate the process of extracting data from the web. Python has some of the best libraries and frameworks available to help you get the job done.

We're going to take a look at some of the most popular libraries and frameworks for web scraping in Python and compare their pros and cons so you know exactly what tool to use to tackle any web scraping project you might come across.

1. Crawlee

Crawlee is a complete web scraping and browser automation library designed to β€œhelp you build reliable crawlers fast”. Crawlee for Node.js and TypeScript was launched in the summer of 2022. Crawlee for Python was released in July 2024 and has already garnered over 3,500 stars on GitHub.

✨ Features

  • A unified interface for HTTP and headless browsers.
  • Type hint coverage and code maintainability.
  • Automatic parallel crawling.
  • Persistent queue for URLs to crawl.
  • Integrated proxy rotation and session management.
  • Configurable request routing.
  • Automatic error handling.
  • Pluggable storage of both tabular data and files.

πŸ‘ Pros

  • Unlike the other full-fledged web crawling and scraping library in this list (Scrapy), Crawlee is quite easy to set up and learn. It provides ready-made templates and only a single file to add the code.
  • Combines multiple web scraping features and techniques.
  • Facilitates clean, maintainable code.

πŸ‘Ž Cons

πŸ€” Alternatives

Scrapy, Playwright, Beautiful Soup

πŸ”° Install Crawlee

To get started with Crawlee for Python, run the following command:

pipx run crawlee create my-crawler

πŸ“œ Code example

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    # Create a crawler instance
    crawler = PlaywrightCrawler(
        # headless=False,
        # browser_type='firefox',
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        data = {
            "request_url": context.request.url,
            "page_url": context.page.url,
            "page_title": await context.page.title(),
            "page_content": (await context.page.content())[:10000],
        }
        await context.push_data(data)

    await crawler.run(["https://crawlee.dev"])


if __name__ == "__main__":
    asyncio.run(main())

Example of using Crawlee’s built-in PlaywrightCrawler to crawl a website title and its content.

2. Requests

Every scraping job starts by making a request to a website and retrieving its contents, usually as HTML. Requests is an HTTP library designed to make this task simple, earning its tagline, "HTTP for humans." That's why the Python Requests library is the most downloaded Python package.

✨ Features

  • Simple and intuitive API for making HTTP requests.
  • Handles GET, POST, PUT, DELETE, HEAD, and OPTIONS requests.
  • Automatically decodes content based on the response headers.
  • Allows for persistent connections across requests.
  • Built-in support for SSL/TLS verification, with the option to bypass it.
  • Easily add headers, parameters, and cookies to requests.
  • Set timeouts and retry policies for requests.
  • Supports large file downloads by streaming responses in chunks.
  • Supports proxy configuration.

πŸ‘ Pros

  • Simplifies complex HTTP tasks with a clean and readable syntax.
  • Large user base and community support.
  • Well-documented with numerous examples and guides.

πŸ‘Ž Cons

  • Not as fast as some lower-level libraries like http.client or urllib3 for highly performance-sensitive applications.
  • Lacks built-in asynchronous capabilities, requiring additional libraries like asyncio or aiohttp for non-blocking requests.
  • The library can be considered heavy for minimalistic environments or resource-constrained applications.

πŸ€” Alternatives

httpx, urlib3, http.client, aiohttp

πŸ”° Install Requests

To install the Requests library, use pip, the Python package manager:

pip install requests

πŸ“œ Code example

import requests

response = requests.get('https://api.example.com/data')
if response.status_code == 200:
    data = response.json()  # Parse JSON response
    print(data)
else:
    print(f"Request failed with status code: {response.status_code}")

Simple example of making a GET request and handling the response

3. HTTPX

HTTPX is another HTTP library, but what makes it different from Requests is it offers some advanced features like async and HTTP/2 support. HTTPX and Requests have a very similar core functionality. So we recommend HTTPX even for smaller projects since you can easily scale up in the future without compromising performance.

✨ Features

  • Built-in async capabilities using asyncio, allowing for non-blocking HTTP requests.
  • Natively supports HTTP/2 for improved performance over HTTP/1.1.
  • Offers both sync and async interfaces to provide flexibility based on your needs.
  • Efficient management of connections with automatic connection pooling.
  • Automatically follows redirects, similar to Requests, but with more control over redirection behavior.
  • Ability to customize the HTTP transport, including the use of custom connection pools and proxies.
  • Supports streaming responses, cookie management, and multipart uploads.

πŸ‘ Pros

  • Allows for non-blocking requests, which makes it ideal for I/O-bound tasks or applications requiring high concurrency.
  • Built with modern web standards and practices in mind, including HTTP/2 support.

πŸ‘Ž Cons

  • For developers unfamiliar with asynchronous programming, there may be a steeper learning curve compared to Requests.
  • While rapidly gaining popularity, it is newer than Requests and may have a smaller community and fewer resources available.

πŸ€” Alternatives

Requests, aiohttp, urlib3, http.client

πŸ”° Install HTTPX

To install the HTTPX library, use pip, the Python package manager:

pip install httpx

πŸ“œ Code example

import httpx
import asyncio

async def fetch_data():
    async with httpx.AsyncClient() as client:
        response = await client.get('https://api.example.com/data')
        if response.status_code == 200:
            data = response.json()  # Parse JSON response
            print(data)
        else:
            print(f"Request failed with status code: {response.status_code}")

# Run the asynchronous function
asyncio.run(fetch_data())

Simple example of making an asynchronous GET request and handling the response

4. Beautiful Soup

Once you have HTML content, you need a way to parse it and extract the data you're interested in. Enter Beautiful Soup, one of the most popular Python HTML parsers. It lets you navigate and search through the HTML tree structure easily. Its straightforward syntax and easy setup also make Beautiful Soup a great option for small- to medium-sized web scraping projects and web scraping beginners.

✨ Features

  • HTML/XML parsing
  • Navigation of parse trees
  • Handles different encodings and automatically converts documents to Unicode, ensuring compatibility.
  • Works with multiple parsers like lxml, html.parser, and html5lib, offering flexibility in handling different parsing needs.
  • Easily access and modify tags, attributes, and text within the document.

πŸ‘ Pros

  • Designed to be simple and easy to use, even for beginners, with a gentle learning curve.
  • Works well with a variety of parsing libraries and is adaptable to different scraping tasks.
  • Comprehensive documentation and numerous tutorials available, making it easy to get started.
  • Effectively parses and extracts data from poorly structured HTML, which is common on the web.
  • Popular in the web scraping community, ensuring plenty of resources and community-driven solutions.

πŸ‘Ž Cons

  • Limited scalability
  • Inability to scrape JavaScript-heavy websites

πŸ€” Alternatives

lxml, html5lib

πŸ”° Install Beautiful Soup

To install Beautiful Soup, use pip to install the package beautifulsoup4. We also recommend installing lxml or html5lib for better parsing capabilities:

pip install beautifulsoup4 lxml

πŸ“œ Code example

from bs4 import BeautifulSoup
import httpx

# Send an HTTP GET request to the specified URL using the httpx library
response = httpx.get("https://news.ycombinator.com/news")

# Save the content of the response
yc_web_page = response.content

# Use the BeautifulSoup library to parse the HTML content of the webpage
soup = BeautifulSoup(yc_web_page)

# Find all elements with the class "athing" (which represent articles on Hacker News) using the parsed HTML
articles = soup.find_all(class_="athing")

# Loop through each article and extract relevant data, such as the URL, title, and rank
for article in articles:
    data = {
        "URL": article.find(class_="titleline").find("a").get('href'),  # Find the URL of the article by finding the first "a" tag within the element with class "titleline"
        "title": article.find(class_="titleline").getText(),  # Find the title of the article by getting the text content of the element with class "titleline"
        "rank": article.find(class_="rank").getText().replace(".", "")  # Find the rank of the article by getting the text content of the element with class "rank" and removing the period character
    }
    # Print the extracted data for the current article
    print(data)

Example of using the Beautiful Soup library to parse the HTML content of the webpage

5. Mechanical Soup

Mechanical Soup is a Python library that acts as a higher-level abstraction over the popular Requests and BeautifulSoup libraries. It simplifies the process of web scraping by combining the ease of use of Requests with the HTML parsing capabilities of Beautiful Soup.

✨ Features

  • Streamlines the process of making HTTP requests to websites and makes it easy to fetch web pages and interact with them
  • Integrates with Beautiful Soup's powerful HTML parsing capabilities to allow easy data extraction from websites
  • Has convenient methods for submitting HTML forms on web pages, which simplifies automated interaction with websites that require form submission
  • Supports session management and helps maintain stateful interactions with websites across multiple requests
  • Like Requests, Mechanical Soup offers support for proxy configuration and allows to scrape data anonymously or bypass IP restrictions

πŸ‘ Pros

  • Provides a simplified interface for web scraping tasks
  • Seamless integration with Beautiful Soup for HTML parsing
  • Supports form submission and session handling
  • Offers proxy support for anonymity and bypassing restrictions

πŸ‘Ž Cons

  • Limited advanced features compared to Crawlee, Scrapy, or Playwright.
  • May not be suitable for complex or large-scale scraping projects.

πŸ€” Alternatives

Selenium, Playwright, Beautiful Soup

πŸ”° Install Mechanical Soup

To install MechanicalSoup, run this command in your terminal or command prompt:

pip install MechanicalSoup

πŸ“œ Code example

import mechanicalsoup

# Create a MechanicalSoup browser instance
browser = mechanicalsoup.StatefulBrowser()

# Perform a GET request to a webpage
browser.open("https://example.com")

# Extract data using BeautifulSoup methods
page_title = browser.get_current_page().title.text

print("Page Title:", page_title)

Simple example of using Mechanical Soup to open a web page and extract its title

6. Selenium

Selenium is a widely used web automation tool that allows developers to programmatically interact with web browsers. It is commonly used for testing web applications, but it also serves as a powerful tool for web scraping, especially when dealing with JavaScript-rendered websites that require dynamic content loading.

✨ Features

  • Provides the ability to control a web browser programmatically, simulating user interactions like clicking, typing, and navigating between pages.
  • Supports a wide range of browsers (Chrome, Firefox, Safari, Edge, etc.) and platforms, allowing for cross-browser testing and scraping.
  • Handles dynamic content generated by JavaScript, making it ideal for scraping modern web applications.
  • Offers comprehensive support for capturing screenshots, managing cookies, and executing custom JavaScript code.
  • Supports headless mode, which allows for automated browsing without a GUI, making scraping faster and less resource-intensive.

πŸ‘ Pros

  • Excellent for scraping and automating interactions on dynamic, JavaScript-heavy websites.
  • Supports multiple programming languages (Python, Java, C#, etc.).
  • Capable of simulating complex user interactions and handling sophisticated web applications.
  • Cross-browser and cross-platform compatibility.

πŸ‘Ž Cons

  • Slower compared to headless scraping libraries like Scrapy, Crawlee, or Playwright due to full browser automation.
  • Requires additional setup for different browsers (e.g., installing WebDriver).
  • More resource-intensive, especially for large-scale scraping tasks.

πŸ€” Alternatives

Playwright, Mechanical Soup, Crawlee, Scrapy

πŸ”° Install Selenium

To install Selenium, run this command in your terminal or command prompt:

pip install selenium

πŸ“œ Code example

from selenium import webdriver

# Setup the WebDriver (using Chrome in this example)
driver = webdriver.Chrome()

# Navigate to a web page
driver.get("<https://example.com>")

# Interact with the page (e.g., click a button)
button = driver.find_element_by_id("submit")
button.click()

# Extract data
content = driver.page_source

# Close the browser
driver.quit()

Example of Selenium navigating to and interacting with a webpage and extracting the data

7. Playwright

Playwright is a modern web automation framework developed by Microsoft. It offers powerful capabilities for interacting with web pages, supporting multiple browsers (Chromium, Firefox, WebKit) with a single API. Playwright is highly favored for testing and automation due to its speed, reliability, and ability to handle complex web applications. Like Selenium, it's a powerful tool for web scraping when dealing with websites that require dynamic content loading.

✨ Features

  • Supports multiple browser engines (Chromium, Firefox, WebKit) in both headless and headed modes.
  • Provides built-in capabilities for handling modern web features such as file uploads/downloads, network interception, and browser contexts.
  • Facilitates automated testing and scraping of websites that rely heavily on JavaScript for rendering content.
  • Offers robust tools for handling scenarios like auto-waiting for elements, taking screenshots, and capturing videos of sessions.
  • Supports parallel execution, which enhances performance for large-scale scraping or testing tasks.

πŸ‘ Pros

  • Superior performance in handling JavaScript-heavy sites compared to Selenium.
  • Supports all major browser engines with a single API.
  • Provides more advanced features for browser automation, including network interception and parallelism.
  • Reliable and less flaky for testing and automation compared to other tools.

πŸ‘Ž Cons

  • Slightly steeper learning curve due to its wide range of features.
  • Less community support compared to Selenium, although it is growing rapidly.

πŸ€” Alternatives

Selenium, Crawlee, Scrapy

πŸ”° Install Playwright

To install Playwright, run this command in your terminal or command prompt:

pip install playwright

Then, you need to install the necessary browser binaries:

playwright install

πŸ“œ Code example

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")

    # Interact with the page
    page.click('button#submit')

    # Extract data
    content = page.content()

    browser.close()

Example of Playwright launching headless Chrome to interact with a page and extract data

8. Scrapy

Scrapy is a powerful and highly flexible Python framework for web scraping. Unlike Selenium and Playwright, which are often used for web automation, Scrapy is specifically designed for scraping large amounts of data from websites in a structured and scalable manner.

✨ Features

  • Provides a built-in spidering framework that allows you to easily define and customize web crawlers to extract the data you need
  • Designed to be fast and efficient, allowing you to extract data from large websites quickly and with minimal resource usage.
  • Export data in multiple formats, such as HTML, XML, and JSON.
  • Ability to add custom functionality through middleware, pipelines, and extensions
  • Supports distributed scraping, allowing you to scale up your web scraping operation across multiple machines
  • Efficiency for handling common errors and exceptions that may occur during web scraping
  • Supports handling authentication and cookies to scrape websites that require login credentials
  • Scrapy can be easily integrated with other Python tools, such as data processing and storage libraries, making it a powerful tool for end-to-end data processing pipelines

πŸ‘ Pros

  • Highly efficient for large-scale scraping due to its asynchronous request handling.
  • Comprehensive framework with extensive customization options.
  • Handles complex scraping scenarios like link following, pagination, and data cleaning with ease.
  • Built-in support for exporting data in various formats like JSON, CSV, and XML.

πŸ‘Ž Cons

  • Higher learning curve, especially for beginners.
  • Less suited for scraping dynamic JavaScript content compared to Crawlee, Selenium, or Playwright.
  • Requires more setup and configuration for smaller projects compared to simpler libraries like Beautiful Soup and Crawlee.

πŸ€” Alternatives

Crawlee, Beautiful Soup, Selenium, Playwright

πŸ”° Install Scrapy

To install Scrapy, run this command in your terminal or command prompt:

pip install scrapy

πŸ“œ Code example

import scrapy

class HackernewsSpiderSpider(scrapy.Spider):
    name = 'hackernews_spider'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['http://news.ycombinator.com/']

    def parse(self, response):
        articles = response.css('tr.athing')
        for article in articles:
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
        }

Example of how to use a Scrapy Spider to scrape data from a website

Which Python scraping library is right for you?

So, which library should you use for your web scraping project? This table summarizes the features, uses, pros, and cons of all the libraries covered here:

Complete Library Comparison Table
Library Use Case Ease of Use Features Pros Cons Alternatives
Crawlee Large-scale scraping and browser automation Easy Automatic parallel crawling, proxy rotation, persistent queues Easy setup, clean code, integrated features New, limited tutorials Scrapy, Playwright, Beautiful Soup
Requests Making HTTP requests Very Easy Simple API, SSL/TLS support, streaming Large community, well-documented No async, slower for performance-sensitive tasks httpx, urllib3, aiohttp
HTTPX HTTP requests with async support Easy Async support, HTTP/2, customizable transport Non-blocking requests, modern standards Steeper learning curve, smaller community Requests, aiohttp, urllib3
Beautiful Soup HTML/XML parsing Very Easy Tree traversal, encoding handling, multi-parser support Simple syntax, excellent for beginners Limited scalability, no JavaScript support lxml, html5lib
Mechanical Soup Form handling, simple web scraping Easy Requests + Beautiful Soup integration, form submission Simplified interface, session handling Limited advanced features Selenium, Playwright
Selenium Browser automation, JavaScript-heavy sites Moderate Cross-browser, dynamic content handling Simulates complex interactions, multi-language support Slower, resource-intensive Playwright, Crawlee, Scrapy
Playwright Advanced browser automation Moderate Multi-browser support, auto-wait, parallel execution Handles JS-heavy sites, advanced features Steeper learning curve, smaller community Selenium, Crawlee, Scrapy
Scrapy Large-scale web scraping Hard Asynchronous, distributed scraping, extensibility Highly efficient, handles complex scenarios Steeper learning curve, setup-heavy Crawlee, Playwright, Selenium

Each tool presented here has a unique purpose in an expert scraper's toolkit. Learning to use each one will give you the flexibility to select the best tool for each job, so don't be afraid to try them before deciding!

Learn how to use Crawlee in Python for your web scraping projects

Percival Villalva
Percival Villalva
Developer Advocate on a mission to help developers build scalable, human-like bots for data extraction and web automation.
Theo Vasilis
Theo Vasilis
I used to write books. Then I took an arrow in the knee. Now I'm a technical content marketer, crafting tutorials for developers and conversion-focused content for SaaS.

Get started now

Step up your web scraping and automation