We're Apify. You can build, deploy, share, and monitor any scrapers on the Apify platform. Check us out.
Python devs are more likely than most to need web scraping at some point in their careers. In the age of Big Data, knowing how to craft bots to "open" websites and extract information is almost a requirement for anyone who deals with digital data.
Two popular tools for web scraping in Python are Beautiful Soup and Scrapy. Which one is right for your scraping needs? Let's find out.
What are the main differences between Scrapy and BeautifulSoup?
Feature | Scrapy | BeautifulSoup |
---|---|---|
Type | Web scraping framework | Library for parsing HTML/XML |
Asynchronous requests | Yes | No (requires external libraries like requests) |
Built-in functionalities | Extensive (cookies, sessions, redirects, etc.) | Limited to parsing |
Data pipeline | Yes | No |
Middleware/Extensions support | Yes | No |
Error handling and logging | Robust | Basic |
Learning curve | Steeper | Easier |
Suitability | Large-scale projects | Small to medium-sized projects |
What is Beautiful Soup?
Beautiful Soup is a Python library that allows you to parse HTML and XML documents and extract data from them. It provides a simple and intuitive way to navigate and search the HTML tree structure, using tags, attributes, and text content as search criteria.
Main features of the BeautifulSoup library
- Library for parsing HTML/XML. BeautifulSoup is a library designed for parsing HTML and XML documents. It's excellent for extracting data from a webpage after you've already downloaded it.
- Simplicity and flexibility. It's simple and easy to use, making it a great choice for small to medium-sized projects or for beginners just getting into web scraping.
- Requires external libraries for requests. Unlike Scrapy, BeautifulSoup doesn't handle requests on its own. You'd typically use it with the requests library in Python to fetch web pages.
- Fine-grained parsing. BeautifulSoup allows for more granular and precise parsing, which is excellent for extracting data from complicated or irregular HTML.
- No built-in data pipeline. Unlike Scrapy, it doesnโt have a built-in data pipeline, so you'll need to handle data storage and processing manually.
- Not asynchronous. BeautifulSoup doesnโt inherently support asynchronous requests, which can be a limitation for scraping a large number of pages.
- Ease of learning. It's generally easier for beginners to pick up and start using in small projects.
How to install Beautiful Soup
- Start off by using pip to install Beautiful Soup and Python Requests as project dependencies:
pip install beautifulsoup4 requests
- To scrape a web page, you need to first download the HTML content of the page using an HTTP Client like
requests
to then parse the page content usingBeautifulSoup
:
import requests
from bs4 import BeautifulSoup
url = 'https://www.example.com'
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
- Then, you can use Beautiful Soup methods to extract the data you're interested in. For example, let's say we want to extract the website
title
and a list of all the URLs on the page:
title = soup.find('title').get_text()
url_list = []
links = soup.find_all('a')
for link in links:
url = link.get('href')
url_list.append(url)
print(title, url_list)
This code will print out the title and a URL list of all links on the page.
What is Scrapy?
Scrapy is a Python framework for web scraping that provides a more powerful and customizable way to extract data from websites. It allows you to define a set of rules for how to navigate and extract data from multiple pages of a website and provides built-in support for handling different types of data.
Main features of Scrapy
- Framework vs. library. Scrapy is a full-fledged web scraping framework, not just a library. This means it offers more built-in functionalities for managing requests, parsing, and data processing.
- Asynchronous requests. Scrapy is built on Twisted, an asynchronous networking framework. This allows Scrapy to handle a large volume of requests simultaneously, making it faster and more efficient for large-scale web scraping.
- Built-in features. Scrapy comes with a wide range of built-in features, including support for handling cookies, sessions, and following redirects, which can simplify complex scraping tasks.
- Data pipeline. Scrapy provides a data pipeline to process and store scraped data, which is very useful for structured data extraction and storage.
- Middlewares and extensions. It supports custom middlewares and extensions, allowing you to add or modify functionalities according to your needs.
- Error handling and logging. Robust error handling and logging features make it easier to debug and maintain larger projects.
- Learning curve. Scrapy might have a steeper learning curve compared to BeautifulSoup, especially for beginners.
How to install Scrapy
- To use Scrapy, you first need to install it using pip:
# Install Scrapy
pip install scrapy
- Then, you can create a new Scrapy project using the
scrapy
command:
# Create Scrapy project
scrapy startproject myproject
- This will create a new directory called
myproject
with the basic structure of a Scrapy project. You can then generate a spider, which is the main component of Scrapy that does the actual scraping:
# Generate Spider
scrapy genspider myspider https://www.example.com
- Now try a simple spider that extracts the titles and URLs of all the links on a web page:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://www.example.com']
def parse(self, response):
links = response.css('a')
for link in links:
title = link.css('::text').get()
url = link.attrib['href']
yield {
'title': title,
'url': url,
}
This spider defines a parse
method that is called for each page that it visits, starting from the URLs defined in start_urls
. It uses Scrapy's built-in selectors to extract the title and URL of each link and yields a dictionary with this data.
- To run the spider, you then use the
scrapy crawl
command:
# Run the spider
scrapy crawl myspider
Advanced Scrapy features
Queue of URLs to scrape
Scrapy can manage a queue of requests to scrape, with automatic deduplication and checking of maximum recursion depth. For example, this spider scrapes the titles of all linked pages up to a depth of 5:
import scrapy
class TitleSpider(scrapy.Spider):
name = 'titlespider'
start_urls = ['https://www.example.com']
custom_settings = {
"DEPTH_LIMIT": 5
}
def parse(self, response):
yield {
'url': response.url,
'title': response.css('title::text').extract_first(),
}
for link_href in response.css('a::attr("href")'):
yield scrapy.Request(link_href.get())
Multiple output formats
Scrapy directly supports saving the output to many different formats, like JSON, CSV, and XML:
# Run the spider and save output into a JSON file
scrapy crawl -o myfile -t json myspider
# Run the spider and save output into a CSV file
scrapy crawl -o myfile -t csv myspider
# Run the spider and save output into a XML file
scrapy crawl -o myfile -t xml myspider
Cookies
Scrapy receives and keeps track of cookies sent by servers and sends them back on subsequent requests as any regular web browser does.
If you want to specify additional cookies for your requests, you can add Scrapy cookies to the Scrapy Request you're creating:
request_with_cookies = scrapy.Request(
url="http://www.example.com",
cookies={'currency': 'USD', 'country': 'UY'},
)
User-agent spoofing
Scrapy supports setting the user-agent of all requests to a custom value, which is useful, for example, if you want to scrape the mobile version of a website. Just put the user agent in the [settings.py](<http://settings.py>)
file in your project, and it will be automatically used for all requests:
# settings.py
USER_AGENT = 'Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.57 Mobile Safari/537.36'
Run your Scrapy spiders on Apify
Run, monitor, schedule, and scale your spiders in the cloud.
When to use Beautiful Soup and when to use Scrapy
Okay, so you'll ultimately have to make a choice between Beautiful Soup and Scrapy yourself, but here's a quick summary of the differences for you to keep in mind:
- Beautiful Soup is generally easier to use and more flexible than Scrapy, making it a solid choice for when you just need to extract data from a few simple web pages, and you don't expect that they will try to block you from scraping them.
- Scrapy is more powerful and customizable, making it a better choice for when you want to scrape a whole website, follow links from one page to another, have to deal with cookies, and blocking and export a lot of data in multiple formats.
So choose BeautifulSoup if you're working on a smaller or more straightforward project, or if you need very detailed control over parsing individual pages. Choose Scrapy if you're working on larger and more complex data extraction projects where efficiency, speed, and extensive built-in functionalities are required.
Remember, your choice might also depend on the specific requirements of your project, such as the complexity of the websites you are scraping, the volume of data, and your comfort with Python programming. For some Python projects, even a combination of both web scraping libraries could be the best approach.
Further reading
Beautiful Soup
- How to parse HTML in Python with PyQuery or Beautiful Soup
- Web scraping with Beautiful Soup and Requests
- How to parse XML in Python
- Python CSS selectors and how to use them with Beautiful Soup
Scrapy