Scrapy vs. Beautiful Soup for web scraping

We're Apify. You can build, deploy, share, and monitor any scrapers on the Apify platform. Check us out.

Python devs are more likely than most to need web scraping at some point in their careers. In the age of Big Data, knowing how to craft bots to "open" websites and extract information is almost a requirement for anyone who deals with digital data.

Two popular tools for web scraping in Python are Beautiful Soup and Scrapy. Which one is right for your scraping needs? Let's find out.

What are the main differences between Scrapy and BeautifulSoup?

Feature	Scrapy	BeautifulSoup
Type	Web scraping framework	Library for parsing HTML/XML
Asynchronous requests	Yes	No (requires external libraries like requests)
Built-in functionalities	Extensive (cookies, sessions, redirects, etc.)	Limited to parsing
Data pipeline	Yes	No
Middleware/Extensions support	Yes	No
Error handling and logging	Robust	Basic
Learning curve	Steeper	Easier
Suitability	Large-scale projects	Small to medium-sized projects

What is Beautiful Soup?

Beautiful Soup is a Python library that allows you to parse HTML and XML documents and extract data from them. It provides a simple and intuitive way to navigate and search the HTML tree structure, using tags, attributes, and text content as search criteria.

Main features of the BeautifulSoup library

Library for parsing HTML/XML. BeautifulSoup is a library designed for parsing HTML and XML documents. It's excellent for extracting data from a webpage after you've already downloaded it.
Simplicity and flexibility. It's simple and easy to use, making it a great choice for small to medium-sized projects or for beginners just getting into web scraping.
Requires external libraries for requests. Unlike Scrapy, BeautifulSoup doesn't handle requests on its own. You'd typically use it with the requests library in Python to fetch web pages.
Fine-grained parsing. BeautifulSoup allows for more granular and precise parsing, which is excellent for extracting data from complicated or irregular HTML.
No built-in data pipeline. Unlike Scrapy, it doesn’t have a built-in data pipeline, so you'll need to handle data storage and processing manually.
Not asynchronous. BeautifulSoup doesn’t inherently support asynchronous requests, which can be a limitation for scraping a large number of pages.
Ease of learning. It's generally easier for beginners to pick up and start using in small projects.

🔖

Related: How to build a reliable scraper using Python's Beautiful Soup & Requests libraries. →

How to install Beautiful Soup

Start off by using pip to install Beautiful Soup and Python Requests as project dependencies:

pip install beautifulsoup4 requests

To scrape a web page, you need to first download the HTML content of the page using an HTTP Client like requests to then parse the page content using BeautifulSoup:

import requests 
from bs4 import BeautifulSoup

url = 'https://www.example.com' 

response = requests.get(url) 

html_content = response.text

soup = BeautifulSoup(html_content, 'html.parser')

Then, you can use Beautiful Soup methods to extract the data you're interested in. For example, let's say we want to extract the website title and a list of all the URLs on the page:

title = soup.find('title').get_text()
url_list = [] 
links = soup.find_all('a')

for link in links: 
    url = link.get('href')
    url_list.append(url)
	
print(title, url_list)

This code will print out the title and a URL list of all links on the page.

🔖

Related: Read about how to scrape a website with Selenium →

Beautiful Soup vs. Scrapy: which one to choose for web scraping?

What is Scrapy?

Scrapy is a Python framework for web scraping that provides a more powerful and customizable way to extract data from websites. It allows you to define a set of rules for how to navigate and extract data from multiple pages of a website and provides built-in support for handling different types of data.

Main features of Scrapy

Framework vs. library. Scrapy is a full-fledged web scraping framework, not just a library. This means it offers more built-in functionalities for managing requests, parsing, and data processing.
Asynchronous requests. Scrapy is built on Twisted, an asynchronous networking framework. This allows Scrapy to handle a large volume of requests simultaneously, making it faster and more efficient for large-scale web scraping.
Built-in features. Scrapy comes with a wide range of built-in features, including support for handling cookies, sessions, and following redirects, which can simplify complex scraping tasks.
Data pipeline. Scrapy provides a data pipeline to process and store scraped data, which is very useful for structured data extraction and storage.
Middlewares and extensions. It supports custom middlewares and extensions, allowing you to add or modify functionalities according to your needs.
Error handling and logging. Robust error handling and logging features make it easier to debug and maintain larger projects.
Learning curve. Scrapy might have a steeper learning curve compared to BeautifulSoup, especially for beginners.

🔖

How to install Scrapy

To use Scrapy, you first need to install it using pip:

# Install Scrapy
pip install scrapy

Then, you can create a new Scrapy project using the scrapy command:

# Create Scrapy project
scrapy startproject myproject

This will create a new directory called myproject with the basic structure of a Scrapy project. You can then generate a spider, which is the main component of Scrapy that does the actual scraping:

# Generate Spider
scrapy genspider myspider https://www.example.com

Now try a simple spider that extracts the titles and URLs of all the links on a web page:

import scrapy 

class MySpider(scrapy.Spider): 
    name = 'myspider' 
    start_urls = ['https://www.example.com'] 

    def parse(self, response): 
        links = response.css('a') 
        for link in links: 
    	    	title = link.css('::text').get() 
		        url = link.attrib['href'] 
            yield { 
                'title': title, 
                'url': url, 
            }

This spider defines a parse method that is called for each page that it visits, starting from the URLs defined in start_urls. It uses Scrapy's built-in selectors to extract the title and URL of each link and yields a dictionary with this data.

To run the spider, you then use the scrapy crawl command:

# Run the spider
scrapy crawl myspider

🔖

Related: Alternatives to Scrapy

Advanced Scrapy features

Queue of URLs to scrape

Scrapy can manage a queue of requests to scrape, with automatic deduplication and checking of maximum recursion depth. For example, this spider scrapes the titles of all linked pages up to a depth of 5:

import scrapy
class TitleSpider(scrapy.Spider):
    name = 'titlespider'
    start_urls = ['https://www.example.com'] 
    custom_settings = {
        "DEPTH_LIMIT": 5
    }

    def parse(self, response):
        yield {
            'url': response.url,
            'title': response.css('title::text').extract_first(),
        }
        for link_href in response.css('a::attr("href")'):
            yield scrapy.Request(link_href.get())

Multiple output formats

Scrapy directly supports saving the output to many different formats, like JSON, CSV, and XML:

# Run the spider and save output into a JSON file
scrapy crawl -o myfile -t json myspider

# Run the spider and save output into a CSV file
scrapy crawl -o myfile -t csv myspider

# Run the spider and save output into a XML file
scrapy crawl -o myfile -t xml myspider

Cookies

Scrapy receives and keeps track of cookies sent by servers and sends them back on subsequent requests as any regular web browser does.

If you want to specify additional cookies for your requests, you can add Scrapy cookies to the Scrapy Request you're creating:

request_with_cookies = scrapy.Request(
    url="http://www.example.com",
    cookies={'currency': 'USD', 'country': 'UY'},
)

User-agent spoofing

Scrapy supports setting the user-agent of all requests to a custom value, which is useful, for example, if you want to scrape the mobile version of a website. Just put the user agent in the [settings.py](<http://settings.py>) file in your project, and it will be automatically used for all requests:

# settings.py
USER_AGENT = 'Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.57 Mobile Safari/537.36'

Run your Scrapy spiders on Apify

Run, monitor, schedule, and scale your spiders in the cloud.

Learn more

When to use Beautiful Soup and when to use Scrapy

Okay, so you'll ultimately have to make a choice between Beautiful Soup and Scrapy yourself, but here's a quick summary of the differences for you to keep in mind:

Beautiful Soup is generally easier to use and more flexible than Scrapy, making it a solid choice for when you just need to extract data from a few simple web pages, and you don't expect that they will try to block you from scraping them.
Scrapy is more powerful and customizable, making it a better choice for when you want to scrape a whole website, follow links from one page to another, have to deal with cookies, and blocking and export a lot of data in multiple formats.

So choose BeautifulSoup if you're working on a smaller or more straightforward project, or if you need very detailed control over parsing individual pages. Choose Scrapy if you're working on larger and more complex data extraction projects where efficiency, speed, and extensive built-in functionalities are required.

Remember, your choice might also depend on the specific requirements of your project, such as the complexity of the websites you are scraping, the volume of data, and your comfort with Python programming. For some Python projects, even a combination of both web scraping libraries could be the best approach.