Scrapy vs Selenium: when to use them for web scraping

Comparing Selenium and Scrapy is like comparing apples and oranges. One is a web testing automation toolset; the other is a complete web crawling framework.

And yet, both are popular choices for web scraping in Python.

Why is that? And why should you choose between them?

Both Scrapy and Selenium are used for web scraping for good reason. So, let's find out what they're suitable for and when you should use them.

What is Scrapy?

Scrapy is the preferred tool for large-scale scraping projects due to its advantages over other popular Python web scraping libraries

— Web scraping with Scrapy 101

Scrapy is an open-source framework written in Python and explicitly designed to crawl websites for data extraction. It provides an easy-to-use API for web scraping and built-in functionality for handling large-scale data scraping projects. Although it's possible to use it only in Python, it's the most powerful and versatile tool for web scraping (except for Crawlee, the Node.js alternative).

Why developers use Scrapy

Scrapy is engineered for speed and efficiency in web crawling and scraping. It utilizes an event-driven, non-blocking IO model that facilitates asynchronous request handling, which significantly boosts its performance. Scrapy provides a suite of tools for data processing and storage, making it highly suitable for large data extraction tasks.

Cloud infrastructure for your Scrapy project

Run Scrapy spiders on Apify

Scrapy features and code examples

Spiders

Scrapy provides a built-in spidering framework that allows you to easily define and customize web crawlers to collect the data you need. You can manage a queue of requests to scrape with automatic deduplication and checking of maximum recursion depth.

Here, for example, is a spider that scrapes the titles of all linked pages up to a depth of 5:

import scrapy
class TitleSpider(scrapy.Spider):
    name = 'titlespider'
    start_urls = ['https://www.example.com'] 
    custom_settings = {
        "DEPTH_LIMIT": 5
    }

    def parse(self, response):
        yield {
            'url': response.url,
            'title': response.css('title::text').extract_first(),
        }
        for link_href in response.css('a::attr("href")'):
            yield scrapy.Request(link_href.get())

Source: Scrapy vs. Beautiful Soup

Support for data handling

Scrapy supports the handling and exporting of data in multiple formats, such as JSON, CSV, and XML:

# Run the spider and save output into a JSON file
scrapy crawl -o myfile -t json myspider

# Run the spider and save output into a CSV file
scrapy crawl -o myfile -t csv myspider

# Run the spider and save output into a XML file
scrapy crawl -o myfile -t xml myspider

Middleware

Scrapy middleware gives you the ability to tailor and improve your spiders for various scenarios. You can modify requests, efficiently manage responses, and add new functionalities to your spiders:

def process_spider_input(self, response, spider):
        filter_keyword = "Apify as a data cloud platform for AI"  # Replace with the keyword you want to exclude
        if filter_keyword in response.text:
            spider.logger.info(f"Filtered response containing '{filter_keyword}': {response.url}")
            # Raise IgnoreRequest to stop processing this response
            raise spider.logger.info(f"Response contains the filtered keyword: {filter_keyword}")
        else:
            spider.logger.info(f"Response does not contain the filtered keyword: {filter_keyword}")

Source: Scrapy middleware: customizing your Scrapy spider

Item pipelines for data cleaning and storage

Scrapy provides a structured way to process scraped data by executing a series of components sequentially. You can clean, validate, and transform data to make sure it meets the required format or quality before storing:

from scrapy import Spider
from ..items import BookItem

class CleaningPipeline:

    def process_item(self, item: BookItem, spider: Spider) -> BookItem:
        number_map = {
            'one': 1,
            'two': 2,
            'three': 3,
            'four': 4,
            'five': 5,
        }
        return BookItem(
            title=item['title'],
            price=float(item['price'].replace('£', '')),
            rating=number_map[item['rating'].split(' ')[1].lower()],
            in_stock=True if item['in_stock'].lower() == 'in stock' else False,
        )

Source: Handling data in Scrapy: databases and pipelines

You can then export cleaned data to SQL databases, JSON files, or any other storage solution:

import re
import psycopg
from scrapy import Spider
from ..items import BookItem

class StoringPipeline:

    def __init__(self) -> None:
        self.conn = psycopg.connect("host='localhost' dbname='postgres' user='postgres' password='postgres' port=5432")

    def process_item(self, item: BookItem, spider: Spider) -> BookItem:
        title_escaped = re.sub(r"'", r"''", item['title'])
        with self.conn.cursor() as cursor:
            query = 'INSERT INTO books (title, price, rating, in_stock) ' \\
                    f"VALUES ('{title_escaped}', {item['price']}, {item['rating']}, {item['in_stock']});"
            cursor.execute(query)
            self.conn.commit()
        return item

Source: Handling data in Scrapy: databases and pipelines

Scrapy pros and cons

What we've seen so far is an example of what Scrapy can do and why it's such a popular choice for web scraping. But no tool is perfect. There are some things it's not so great for. So, let's balance things out a little with a simple table of Scrapy pros and cons:

SCRAPY	Pros	Cons
Speed	High-speed crawling and scraping	-
Scale	Capable of large-scale data extraction	-
Efficiency	Memory-efficient processes	-
Customization	Highly customizable and extensible	-
Dynamic content	-	Doesn't support dynamic content rendering
Browser interaction	-	Lacks browser interaction and automation
Learning curve	-	Steep learning curve

As you can see, two significant disadvantages of Scrapy are that it can't scrape dynamic content on its own (though it is possible via plugins) and it lacks browser interaction and automation.

It's precisely in these areas that Selenium shines. So, let's now turn our attention to Selenium.

▶️

Check out these other alternatives to Scrapy for web scraping

What is Selenium?

Selenium offers several ways to interact with websites, such as clicking buttons, filling in forms, scrolling pages, taking screenshots, and executing JavaScript code. That means Selenium can be used to scrape dynamically loaded content. Add to this its cross-language and cross-browser support, and it's little wonder that Selenium is one of the preferred frameworks for web scraping in Python.

— Web scraping with Selenium

Selenium's architecture is built around the WebDriver, an API providing a unified interface to interact with web browsers. This toolset supports multiple programming languages, including Java, JavaScript, Python, C#, PHP, and Ruby. As a result, it's a flexible platform for developers to automate web browser actions.

Why developers use Selenium

When it comes to web scraping, Selenium's strength lies in its ability to interact with dynamic web content rendered through JavaScript. That makes it indispensable for projects targeting AJAX-heavy websites. Whenever you have to scrape a dynamic page or a website using certain types of pagination, such as infinite scroll, you need a browser. That's when browser automation tools like Selenium or Playwright come into play.

Selenium features and code examples

Dynamic content handling

Selenium allows for the scraping of content that isn't immediately available in the page's HTML source but is loaded or altered through user interactions or after the initial page load.

Here's an example of a Selenium script scraping The Hitchhiker's Guide to the Galaxy product page on Amazon and saving a screenshot of the accessed page:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# Insert the website URL that we want to scrape
url = 'https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C'

manager = ChromeDriverManager()
executable_path = manager.install()
driver = webdriver.Chrome(executable_path)
driver.get(url)

# Create a dictionary with the scraped data
book = {
    'book_title': driver.find_element(By.ID, 'productTitle').text,
    'author': driver.find_element(By.CSS_SELECTOR, 'span.author a').text,
    'edition': driver.find_element(By.ID, 'productSubtitle').text,
    'price': driver.find_element(By.CSS_SELECTOR, '.a-size-base.a-color-price.a-color-price').text,
}

# Save a screenshot from the accessed page and print the dictionary contents to the console
driver.save_screenshot('book.png')
print(book)

Source: Python web scraping: a comprehensive guide

Browser automation

Selenium simulates real user interactions with web browsers, including clicking buttons, filling out forms, scrolling, and navigating through pages. This capability is needed for accessing content that requires interaction or simulating a human user to bypass anti-scraping measures.

Headless browser testing

Selenium supports headless browser execution, where browsers run in the background with no visible UI. This feature is particularly useful for scraping tasks on server environments or for speeding up the scraping process, as it consumes fewer resources.

Locating elements

Selenium provides various strategies for locating web elements (by ID, name, XPath, CSS selectors, etc.). This enables precise targeting of the data to be extracted, even from complex page structures.

Here's a simple example of using Selenium to locate a product in an e-shop:

search_box = driver.find_element(By.ID, 'search-field')
search_box.send_keys('t-shirt')
search_box.send_keys(Keys.ENTER)

Source: Web scraping with Selenium

Implicit and explicit waits

Selenium offers mechanisms to wait for certain conditions or a maximum time before proceeding so that dynamically loaded content is fully loaded before attempting to scrape it. This is necessary for reliable data extraction from pages where content loading is triggered by user actions or depends on asynchronous requests.

Here's a simple code example of waiting 10 seconds for an h2 to load with the WebDriverWait function:

wait = WebDriverWait(driver, 10)
element = wait.until(ec.presence_of_element_located((By.TAG_NAME, 'h2')))

JavaScript execution

With Selenium, you can execute custom JavaScript code within the context of the current page. This feature can be used to modify web page behavior, access dynamically generated data, or interact with complex elements that are not easily accessible through standard web driver commands.

Here's an example of a Selenium scraping script initializing a browser instance to parse the JavaScript in Amazon's website for data extraction:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# Insert the website URL that we want to scrape
url = "https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0" \
      "?_encoding=UTF8&qid=1642536225&sr=8-1 "

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)

# Create a dictionary with the scraped data
book = {
    "book_title": driver.find_element(By.ID,  'productTitle').text,
    "author": driver.find_element(By.CSS_SELECTOR, '.a-link-normal.contributorNameID').text,
    "edition": driver.find_element(By.ID, 'productSubtitle').text,
    "price": driver.find_element(By.CSS_SELECTOR,  '.a-size-base.a-color-price.a-color-price').text,
}

# Print the dictionary contents to the console
print(book)

Source: Web scraping with JavaScript vs. Python

Screenshot capture

Selenium lets you capture screenshots during the scraping process. This can be useful for debugging, monitoring the process, or verifying the content being scraped, especially when dealing with complex sites or when developing and testing your scraping scripts:

driver.save_screenshot('screenshot.png')

Selenium pros and cons

We did it for Scrapy, so let's give Selenium the same treatment. Here's a simple table that demonstrates the advantages and disadvantages of using Selenium for web scraping:

SELENIUM	Pros	Cons
Browser interactions	Can automate and interact with browsers	-
Dynamic content handling	Effectively handles dynamic web pages	-
Compatibility	Supports cross-browser and device testing	-
Usability	Relatively easy to use for automation tasks	-
Performance	-	Can be slow and resource-intensive
Scalability for scraping	-	Does not scale well for extensive data scraping

Scrapy vs. Selenium: comparison table

Now that we've looked at both Selenium and Scrapy one at a time let's make our assessment a little clearer with this side-by-side comparison of the two:

	SCRAPY	SELENIUM
Main purpose	Web scraping and crawling	Web testing and automation
Supported languages	Python	Java, JavaScript, Python, C#, PHP, Ruby
Execution speed	Fast	Slower, depends on browser speed
Handling of dynamic content	Limited, requires middleware	Natively supports dynamic content
Resource efficiency	High (low resource consumption)	Lower (due to browser automation)
Scalability	Highly scalable for web scraping	Less scalable for scraping, better suited for testing
Browser interaction	No direct interaction, requires plugins	Direct browser interaction and automation
When to use	Large-scale data extraction from static and semi-dynamic websites	Testing web applications and scraping dynamic content requiring interaction

The verdict: use Scrapy and Selenium for the right tasks

If you need to scrape web pages that are written in a JavaScript library, lazy load content, or make Fetch/XHR requests for data to be rendered, you're probably dealing with a dynamic website.

In those instances, Selenium is the right tool to use.

However, you shouldn't use Selenium for scraping all the time.

Generating a browser instance with Selenium is more resource-intensive than retrieving a page’s HTML with Scrapy. For large scraping jobs, Selenium will be painfully slow and become considerably more expensive.

So you should limit the use of Selenium to the necessary tasks and use it together with Scrapy whenever possible.