We're Apify. You can build, deploy, share, and monitor any scrapers on the Apify platform. Check us out.
Python devs are more likely than most to need web scraping at some point in their careers. In the age of Big Data, knowing how to craft bots to "open" websites and extract information is almost a requirement for anyone who deals with digital data.
Two popular tools for web scraping in Python are Beautiful Soup and Scrapy. Which one is right for your scraping needs? Let's find out.
What are the main differences between Scrapy and BeautifulSoup?
Feature
Scrapy
BeautifulSoup
Type
Web scraping framework
Library for parsing HTML/XML
Asynchronous requests
Yes
No (requires external libraries like requests)
Built-in functionalities
Extensive (cookies, sessions, redirects, etc.)
Limited to parsing
Data pipeline
Yes
No
Middleware/Extensions support
Yes
No
Error handling and logging
Robust
Basic
Learning curve
Steeper
Easier
Suitability
Large-scale projects
Small to medium-sized projects
What is Beautiful Soup?
Beautiful Soup is a Python library that allows you to parse HTML and XML documents and extract data from them. It provides a simple and intuitive way to navigate and search the HTML tree structure, using tags, attributes, and text content as search criteria.
Main features of the BeautifulSoup library
Library for parsing HTML/XML. BeautifulSoup is a library designed for parsing HTML and XML documents. It's excellent for extracting data from a webpage after you've already downloaded it.
Simplicity and flexibility. It's simple and easy to use, making it a great choice for small to medium-sized projects or for beginners just getting into web scraping.
Requires external libraries for requests. Unlike Scrapy, BeautifulSoup doesn't handle requests on its own. You'd typically use it with the requests library in Python to fetch web pages.
Fine-grained parsing. BeautifulSoup allows for more granular and precise parsing, which is excellent for extracting data from complicated or irregular HTML.
No built-in data pipeline. Unlike Scrapy, it doesnβt have a built-in data pipeline, so you'll need to handle data storage and processing manually.
Not asynchronous. BeautifulSoup doesnβt inherently support asynchronous requests, which can be a limitation for scraping a large number of pages.
Ease of learning. It's generally easier for beginners to pick up and start using in small projects.
Start off by using pip to install Beautiful Soup and Python Requests as project dependencies:
pip install beautifulsoup4 requests
To scrape a web page, you need to first download the HTML content of the page using an HTTP Client like requests to then parse the page content using BeautifulSoup:
Then, you can use Beautiful Soup methods to extract the data you're interested in. For example, let's say we want to extract the website title and a list of all the URLs on the page:
title = soup.find('title').get_text()
url_list = []
links = soup.find_all('a')
for link in links:
url = link.get('href')
url_list.append(url)
print(title, url_list)
This code will print out the title and a URL list of all links on the page.
Scrapy is a Python framework for web scraping that provides a more powerful and customizable way to extract data from websites. It allows you to define a set of rules for how to navigate and extract data from multiple pages of a website and provides built-in support for handling different types of data.
Main features of Scrapy
Framework vs. library. Scrapy is a full-fledged web scraping framework, not just a library. This means it offers more built-in functionalities for managing requests, parsing, and data processing.
Asynchronous requests. Scrapy is built on Twisted, an asynchronous networking framework. This allows Scrapy to handle a large volume of requests simultaneously, making it faster and more efficient for large-scale web scraping.
Built-in features. Scrapy comes with a wide range of built-in features, including support for handling cookies, sessions, and following redirects, which can simplify complex scraping tasks.
Data pipeline. Scrapy provides a data pipeline to process and store scraped data, which is very useful for structured data extraction and storage.
Middlewares and extensions. It supports custom middlewares and extensions, allowing you to add or modify functionalities according to your needs.
Error handling and logging. Robust error handling and logging features make it easier to debug and maintain larger projects.
Learning curve. Scrapy might have a steeper learning curve compared to BeautifulSoup, especially for beginners.
This will create a new directory called myproject with the basic structure of a Scrapy project. You can then generate a spider, which is the main component of Scrapy that does the actual scraping:
Now try a simple spider that extracts the titles and URLs of all the links on a web page:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://www.example.com']
def parse(self, response):
links = response.css('a')
for link in links:
title = link.css('::text').get()
url = link.attrib['href']
yield {
'title': title,
'url': url,
}
This spider defines a parse method that is called for each page that it visits, starting from the URLs defined in start_urls. It uses Scrapy's built-in selectors to extract the title and URL of each link and yields a dictionary with this data.
To run the spider, you then use the scrapy crawl command:
Scrapy can manage a queue of requests to scrape, with automatic deduplication and checking of maximum recursion depth. For example, this spider scrapes the titles of all linked pages up to a depth of 5:
import scrapy
class TitleSpider(scrapy.Spider):
name = 'titlespider'
start_urls = ['https://www.example.com']
custom_settings = {
"DEPTH_LIMIT": 5
}
def parse(self, response):
yield {
'url': response.url,
'title': response.css('title::text').extract_first(),
}
for link_href in response.css('a::attr("href")'):
yield scrapy.Request(link_href.get())
Multiple output formats
Scrapy directly supports saving the output to many different formats, like JSON, CSV, and XML:
# Run the spider and save output into a JSON file
scrapy crawl -o myfile -t json myspider
# Run the spider and save output into a CSV file
scrapy crawl -o myfile -t csv myspider
# Run the spider and save output into a XML file
scrapy crawl -o myfile -t xml myspider
Cookies
Scrapy receives and keeps track of cookies sent by servers and sends them back on subsequent requests as any regular web browser does.
If you want to specify additional cookies for your requests, you can add Scrapy cookies to the Scrapy Request you're creating:
Scrapy supports setting the user-agent of all requests to a custom value, which is useful, for example, if you want to scrape the mobile version of a website. Just put the user agent in the [settings.py](<http://settings.py>) file in your project, and it will be automatically used for all requests:
# settings.py
USER_AGENT = 'Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.57 Mobile Safari/537.36'
Run your Scrapy spiders on Apify
Run, monitor, schedule, and scale your spiders in the cloud.
Okay, so you'll ultimately have to make a choice between Beautiful Soup and Scrapy yourself, but here's a quick summary of the differences for you to keep in mind:
Beautiful Soup is generally easier to use and more flexible than Scrapy, making it a solid choice for when you just need to extract data from a few simple web pages, and you don't expect that they will try to block you from scraping them.
Scrapy is more powerful and customizable, making it a better choice for when you want to scrape a whole website, follow links from one page to another, have to deal with cookies, and blocking and export a lot of data in multiple formats.
So choose BeautifulSoup if you're working on a smaller or more straightforward project, or if you need very detailed control over parsing individual pages. Choose Scrapy if you're working on larger and more complex data extraction projects where efficiency, speed, and extensive built-in functionalities are required.
Remember, your choice might also depend on the specific requirements of your project, such as the complexity of the websites you are scraping, the volume of data, and your comfort with Python programming. For some Python projects, even a combination of both web scraping libraries could be the best approach.
I used to write books. Then I took an arrow in the knee. Now I'm a technical content marketer, crafting tutorials for developers and conversion-focused content for SaaS.