Scrapy vs. Beautiful Soup for web scraping

Python dominates in data science, machine learning, and automation - and all of these depend on reliable, fresh datasets. That’s why Python devs are more likely than most to need web scraping at some point in their careers.

Two popular tools for web scraping in Python are Beautiful Soup and Scrapy. Let’s take a closer look at what each one does best so you can choose the right tool for your next project.

What are the main differences between Scrapy and BeautifulSoup?

Feature	Scrapy	BeautifulSoup
Type	Web scraping framework	Library for parsing HTML/XML
Asynchronous requests	Yes	No (requires external libraries like requests)
Built-in functionalities	Extensive (cookies, sessions, redirects, etc.)	Limited to parsing
Data pipeline	Yes	No
Middleware/Extensions support	Yes	No
Error handling and logging	Robust	Basic
Learning curve	Steeper	Easier
Suitability	Large-scale projects	Small to medium-sized projects

What is Beautiful Soup?

Beautiful Soup is a Python library that allows you to parse HTML and XML documents and extract data from them. It provides a simple and intuitive way to navigate and search the HTML tree structure, using tags, attributes, and text content as search criteria.

Main features of the BeautifulSoup library

Library for parsing HTML/XML. BeautifulSoup is a library designed for parsing HTML and XML documents. It's excellent for extracting data from a webpage after you've already downloaded it.
Simplicity and flexibility. It's simple and easy to use, making it a great choice for small to medium-sized projects or for beginners just getting into web scraping.
Requires external libraries for requests. Unlike Scrapy, BeautifulSoup doesn't handle requests on its own. You'd typically use it with the requests library in Python to fetch web pages.
Fine-grained parsing. BeautifulSoup allows for more granular and precise parsing, which is excellent for extracting data from complicated or irregular HTML.
No built-in data pipeline. Unlike Scrapy, it doesn’t have a built-in data pipeline, so you'll need to handle data storage and processing manually.
Not asynchronous. BeautifulSoup doesn’t inherently support asynchronous requests, which can be a limitation for scraping a large number of pages.
Ease of learning. It's generally easier for beginners to pick up and start using in small projects.

🔖

Related reaading: How to build a reliable scraper using Python's Beautiful Soup & Requests libraries.

How to install Beautiful Soup

Start off by using pip to install Beautiful Soup and Python Requests as project dependencies:

pip install beautifulsoup4 requests

To scrape a web page, you need to first download the HTML content of the page using an HTTP Client like requests to then parse the page content using BeautifulSoup:

import requests 
from bs4 import BeautifulSoup

url = 'https://www.example.com' 

response = requests.get(url) 

html_content = response.text

soup = BeautifulSoup(html_content, 'html.parser')

Then, you can use Beautiful Soup methods to extract the data you're interested in. For example, let's say we want to extract the website title and a list of all the URLs on the page:

title = soup.find('title').get_text()
url_list = [] 
links = soup.find_all('a')

for link in links: 
    url = link.get('href')
    url_list.append(url)
	
print(title, url_list)

This code will print out the title and a URL list of all links on the page.

🔖

Related reading: Python web scraping tutorial

What is Scrapy?

Scrapy is a Python framework for web scraping that provides a more powerful and customizable way to extract data from websites. It allows you to define a set of rules for how to navigate and extract data from multiple pages of a website and provides built-in support for handling different types of data.

Main features of Scrapy

Framework vs. library. Scrapy is a full-fledged web scraping framework, not just a library. This means it offers more built-in functionalities for managing requests, parsing, and data processing.
Asynchronous requests. Scrapy is built on Twisted, an asynchronous networking framework. This allows Scrapy to handle a large volume of requests simultaneously, making it faster and more efficient for large-scale web scraping.
Built-in features. Scrapy comes with a wide range of built-in features, including support for handling cookies, sessions, and following redirects, which can simplify complex scraping tasks.
Data pipeline. Scrapy provides a data pipeline to process and store scraped data, which is very useful for structured data extraction and storage.
Middlewares and extensions. It supports custom middlewares and extensions, allowing you to add or modify functionalities according to your needs.
Error handling and logging. Robust error handling and logging features make it easier to debug and maintain larger projects.
Learning curve. Scrapy might have a steeper learning curve compared to BeautifulSoup, especially for beginners.

🔖

Related reading: Adding JavaScript rendering capabilities to Scrapy with Playwright.

How to install Scrapy

To use Scrapy, you first need to install it using pip:

# Install Scrapy
pip install scrapy

Then, you can create a new Scrapy project using the scrapy command:

# Create Scrapy project
scrapy startproject myproject

This will create a new directory called myproject with the basic structure of a Scrapy project. You can then generate a spider, which is the main component of Scrapy that does the actual scraping:

# Generate Spider
scrapy genspider myspider https://www.example.com

Now try a simple spider that extracts the titles and URLs of all the links on a web page:

import scrapy 

class MySpider(scrapy.Spider): 
    name = 'myspider' 
    start_urls = ['https://www.example.com'] 

    def parse(self, response): 
        links = response.css('a') 
        for link in links: 
    	    	title = link.css('::text').get() 
		        url = link.attrib['href'] 
            yield { 
                'title': title, 
                'url': url, 
            }

This spider defines a parse method that is called for each page that it visits, starting from the URLs defined in start_urls. It uses Scrapy's built-in selectors to extract the title and URL of each link and yields a dictionary with this data.

To run the spider, you then use the scrapy crawl command:

# Run the spider
scrapy crawl myspider

🔖

Related reading: 5 alternatives to Scrapy that offer simpler setup, built-in browser automation and better support for dynamic websites.

Advanced Scrapy features

Queue of URLs to scrape

Scrapy can manage a queue of requests to scrape, with automatic deduplication and checking of maximum recursion depth. For example, this spider scrapes the titles of all linked pages up to a depth of 5:

import scrapy
class TitleSpider(scrapy.Spider):
    name = 'titlespider'
    start_urls = ['https://www.example.com'] 
    custom_settings = {
        "DEPTH_LIMIT": 5
    }

    def parse(self, response):
        yield {
            'url': response.url,
            'title': response.css('title::text').extract_first(),
        }
        for link_href in response.css('a::attr("href")'):
            yield scrapy.Request(link_href.get())

Multiple output formats

Scrapy directly supports saving the output to many different formats, like JSON, CSV, and XML:

# Run the spider and save output into a JSON file
scrapy crawl -o myfile -t json myspider

# Run the spider and save output into a CSV file
scrapy crawl -o myfile -t csv myspider

# Run the spider and save output into a XML file
scrapy crawl -o myfile -t xml myspider

Cookies

Scrapy receives and keeps track of cookies sent by servers and sends them back on subsequent requests as any regular web browser does.

If you want to specify additional cookies for your requests, you can add Scrapy cookies to the Scrapy Request you're creating:

request_with_cookies = scrapy.Request(
    url="http://www.example.com",
    cookies={'currency': 'USD', 'country': 'UY'},
)

User-agent spoofing

Scrapy supports setting the user-agent of all requests to a custom value, which is useful, for example, if you want to scrape the mobile version of a website. Just put the user agent in the [settings.py](<http://settings.py>) file in your project, and it will be automatically used for all requests:

# settings.py
USER_AGENT = 'Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.57 Mobile Safari/537.36'

Apify fully supports Scrapy spiders as they are. Take advantage of the platform's features without having to modify your Scrapy spider.

Get started

When to use Beautiful Soup and when to use Scrapy

Here's a quick summary of the differences to keep in mind:

Beautiful Soup is generally easier to use and more flexible than Scrapy, making it a solid choice for when you just need to extract data from a few simple web pages, and you don't expect that they will try to block you from scraping them. It’s a good choice if you need very detailed control over parsing individual pages.
Scrapy is more powerful and customizable, making it a better choice for when you want to scrape a whole website, follow links from one page to another, deal with cookies and blocking, as well as export a lot of data in multiple formats. It’s ideal for projects where efficiency, speed, and extensive built-in functionalities are required.

Your choice might also depend on the specific requirements of your project, such as the complexity of the websites you are scraping, the volume of data, and your comfort with Python programming. For some Python projects, even a combination of both web scraping libraries could be the best approach.

An alternative to both tools - Crawlee

Crawlee is a full-fledged web scraping and browser automation library. It handles everything from sending requests, managing queues, and following links to parsing HTML and exporting data.

Crawlee helps you build reliable crawlers that automatically handle concurrency, retries, queues, and data storage. It supports CheerioCrawler for fast HTML scraping (like Beautiful Soup), and PlaywrightCrawler or PuppeteerCrawler for handling JavaScript-heavy sites.

import { PlaywrightCrawler, Dataset } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
    // Uncomment this option to see the browser window.
    // headless: false,

    // Let's limit our crawls to make our tests shorter and safer.
    maxRequestsPerCrawl: 50,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['<https://crawlee.dev>']);

PlaywrightCrawler example

🔖

Take a look at Crawlee's quick start documentation

Why you should consider Crawlee for web scraping

Crawlee was designed to combine the strengths of tools like Scrapy and Beautiful Soup, while making development faster and more scalable:

It supports headless browser crawling: Crawlee for Python supports a unified interface for HTTP & headless browsers
It’s easy to set up and learn: It provides ready-made templates and only a single file to add the code. That makes it very easy to start building a scraper.
Type hint coverage: Crawlee's whole code base is fully type-hinted, and you get better autocompletion in your IDE. This not only enhances developer experience while developing scrapers with Crawlee but also reduces the number of bugs thanks to static type checking.
Based on Asyncio, making integration with other applications or the rest of your system much easier.
State persistence: Crawlee supports state persistence during interruptions. This means you can resume a scraping pipeline without restarting from the beginning.
Separated result storage: Crawlee simplifies result handling by providing built-in storage options, such as datasets and key-value stores, to organize data for each scraping run.
Easy transition for Scrapy and Beautiful Soup users:

Crawlee also supports Parsel, an HTML parser that Scrapy users are familiar with, through the ParselCrawler. This allows you to reuse your existing CSS and XPath selectors without modification, significantly lowering the barriers for transitioning from Scrapy to Crawlee.

Crawlee’s built-in Cheerio parser works much like Beautiful Soup - both let you search and extract data from HTML using familiar selectors (like .find() or CSS queries). The main difference is that Crawlee uses JavaScript syntax and adds automation for handling requests, queues, and link following, so you can scrape more efficiently with less code.

🔖

How Daltix saved costs and improved efficiency by migrating its scrapers from Scrapy to Crawlee