Python web scraping tutorial

Detailed code examples for Python libraries like Requests, Beautiful Soup, and Scrapy.

Content

Web scraping is the process of automatically extracting data from websites, and Python has been the go-to language for data extraction for years. It boasts a large community of developers and a wide range of web scraping tools to help scrapers extract almost any data from any website.

Here, we'll explore some of the best libraries and frameworks available for web scraping in Python and provide code examples for using them in different web scraping scenarios.

In this Python web scraping tutorial, you’ll learn how to:

  1. Prepare Python coding environment for web scraping
  2. Web scrape in Python using HTTP clients
  3. Parse HTML content with libraries such as BeautifulSoup, LXML, and PyQuery
  4. Handle dynamic websites using Selenium and Playwright
  5. Utilize advanced web scraping techniques with Scrapy
  6. Export the scraped data to CSV and Excel
  7. Deploy Python scrapers in the cloud
  8. Find additional learning resources for web scraping with Python
  9. Answer frequently asked questions about web scraping

But before we start with the tutorial, let’s take a quick peek at this summary table. It gives an overview of all the Python web scraping libraries we’ll cover in this article. This table will help you navigate the content and provide you with an easy way to remember the topics covered.

📚 Library 💡 Features ⚡️ Performance 👨‍💻 User-friendliness 👥 Community 📥 Installation Command
Requests • HTTP(S) Proxy Support
• Connection Timeouts
• Chunked Requests
Moderate Beginner friendly Well-established, strong community pip install requests
HTTPX • Requests-compatible API
• Integrated command-line client
• Supports synchronous and asynchronous requests
Fast Intermediate New, growing community pip install httpx
BeautifulSoup • Intuitive syntax
• Efficient DOM parsing, manipulation, and rendering
• Parse nearly any HTML or XML document
Moderate/Limited scalability Beginner friendly Well-established, strong community pip install beautifulsoup4
LXML • Fast XML/HTML processing
• Full feature set for XML, XPath, XSL
• Compatible with ElementTree API
Fast Intermediate Well-established, medium-sized community pip install lxml
PyQuery • jQuery-like syntax for DOM manipulation
• Parses HTML documents
Moderate Beginner-friendly (for devs with a jQuery background) Small, niche community pip install pyquery
Selenium • Automates web browsers
• Supports multiple browsers and OS
• Handles JavaScript-generated content for scraping dynamic pages
Slow/Resource-intensive Intermediate/Advanced Well-established, strong community pip install selenium
Playwright • Supports multiple browsers
• Handles JavaScript-generated content for scraping dynamic pages
• Synchronous and asynchronous APIs
Slow/Resource-intensive Intermediate/Advanced Fast-growing, strong community pip install playwright
Scrapy • Fast data extraction and website crawling
• Asynchronous requests
• Scraping, processing, exporting data tools
Very Fast/Highly scalable Advanced Well-established, strong community pip install scrapy

Preparing Python coding environment for web scraping

Before diving into web scraping with Python, we need to make sure our development environment is ready. To set up your machine for web scraping, you need to install Python, choose an Integrated Development Environment (IDE), and understand the basics of how to install the Python libraries necessary for efficiently extracting data from the web.

Installing Python

  1. Download Python: Visit the official Python website and download the latest version for your operating system.
Download and install Python
  1. Install Python: Run the installer and follow all the prompts until Python is properly installed on your computer.

IDE for Python

Once you have Python installed, you’ll need a place to write Python code. Basically, you'll need an IDE.

An Integrated Development Environment (IDE) provides tools to write, test, and debug your code. Some popular IDEs for Python development are:

  • PyCharm: Offers a robust environment specifically for Python development. Might be the ideal IDE for Python-exclusive developers.
  • Visual Studio Code (VS Code): A lightweight, versatile IDE that supports Python through extensions. However, VS Code doesn’t come equipped to run Python out of the box. To enable that, you'll need to follow a few extra steps described in the VS Code documentation.
  • Jupyter Notebook: This notebook is ideal for data analysis and exploratory work with Python. It requires minimum setup and allows you to run your code directly on the web.

Ultimately, the IDE choice comes down to preference. All of the options above are well-equipped to run Python code and will work fine for our web scraping purposes. So, go ahead and choose an IDE that suits your preferences.

How to install Python libraries

In essence, Python libraries are collections of pre-packed functions and methods that allow you to perform many actions without writing everything from scratch. Libraries are an integral part of software development. The most commonly used way to install Python libraries is by using pip, Python's package-management system.

Installing a library with pip is very simple:

  1. Open your command line or terminal.
  2. Use the pip install command followed by the library name. For example, to install the requests library, you would type pip install requests.

Easy, right? Now that you have the necessary basic knowledge let’s understand the Python libraries that make the language such a powerful and popular choice for web scraping.

Web scraping in Python using HTTP clients

In the context of web scraping, HTTP clients send requests to the target website and retrieve information such as the website’s HTML code or JSON payload.

Requests

Requests is the most popular HTTP library for Python. It is supported by solid documentation and has been adopted by a huge community, but it lacks support for asynchronous execution. For such cases, you can use the HTTPX library.

⚒️ Main features of Python Requests library:

  • Keep-Alive & Connection Pooling
  • Browser-style SSL Verification
  • HTTP(S) Proxy Support
  • Connection Timeouts
  • Chunked Requests

⚙️ Installing Requests

pip install requests

💡 Code sample

Send a request to the target website, retrieve its HTML code, and print the result to the console.

import requests

response = requests.get('https://news.ycombinator.com')

print(response.text)

HTTPX

HTTPX is a fully featured HTTP client library for Python, including an integrated command-line client while providing both synchronous and asynchronous APIs.

⚒️ Main features of HTTPX

  • A broadly requests-compatible API
  • An integrated command-line client
  • Standard synchronous interface, but with async support if you need it
  • Fully type annotated

⚙️ Installing HTTPX

pip install httpx

💡 Code sample

Similar to the Requests example, we will send a request to the target website, retrieve the HTML of the page and print it to the console along with the request status code.

HTTPX synchronous code sample

import httpx

response = httpx.get('https://news.ycombinator.com')

status_code = response.status_code
html = response.text

print(status_code)
print(html[:200])  # print first 200 characters of html

HTTPX asynchronous code sample utilizing AsyncIO

import asyncio
import httpx

async def main() -> None:
    async with httpx.AsyncClient() as client:
        response = await client.get('https://news.ycombinator.com')
    status_code = response.status_code
    html = response.text
    print(status_code)
    print(html[:200])  # print first 200 characters of html

if __name__ == '__main__':
    asyncio.run(main())

Web scraping with Python using HTML parsers

In web scraping, HTML and XML parsers provide an interface to obtain the data from the response we get back from our target website, often in the form of HTML code. A library such as Beautiful Soup will help us parse this response.

BeautifulSoup

BeautifulSoup is a Python library for pulling data out of HTML and XML files with just a few lines of code. BeautifulSoup is relatively easy to use and presents itself as a lightweight option for tackling simple scraping tasks with speed.

⚒️ Main features of BeautifulSoup

  • Implements a subset of core jQuery, providing developers with a familiar and easy-to-use syntax.
  • Works with a simple and consistent DOM model, making parsing, manipulating, and rendering incredibly efficient.
  • Offers great flexibility, being able to parse nearly any HTML or XML document.

⚙️ Installing Beautiful Soup

pip install beautifulsoup4

💡 Code sample

Let's now see how we can use BeautifulSoup + HTTPX to extract the title contentrank, and URL from all the articles on the first page of Hacker News.

import httpx
from bs4 import BeautifulSoup

# Function to get HTML content from a URL
def get_html_content(url: str, timeout: int = 10) -> str:
    response = httpx.get(url, timeout=timeout)
    return str(response.text)

# Function to parse a single article
def parse_article(article) -> dict:
    url = article.find(class_='titleline').find('a').get('href')
    title = article.find(class_='titleline').getText()
    rank = article.find(class_='rank').getText().replace('.', '')
    return {'url': url, 'title': title, 'rank': rank}

# Function to parse all articles in the HTML content
def parse_html_content(html: str) -> list:
    soup = BeautifulSoup(html, features='html.parser')
    articles = soup.find_all(class_='athing')
    return [parse_article(article) for article in articles]

# Main function to get and parse HTML content
def main() -> None:
    html_content = get_html_content('https://news.ycombinator.com')
    data = parse_html_content(html_content)
    print(data)

if __name__ == '__main__':
    main()

A few seconds after running the script, we will see a dictionary containing each article's URL, ranking, and title printed on our console.

LXML

lxml is widely seen as a fast library for data parsing due to its close connection with two good-quality C libraries — libxml2 and libxslt.

⚒️ Main features of LXML

  • Provides ElementTreeAPI and XSLT support.
  • Great for when you need to parse complex and large documents.
  • Offers great flexibility, being able to parse nearly any XHTML or HTML document.

⚙️ Installing LXML

pip install lxml

💡 Code sample

import requests
from lxml import html

# Function to get HTML content from a URL
def get_html_content(url: str, timeout: int = 10) -> str:
    response = requests.get(url, timeout=timeout)
    return response.text

# Function to parse HTML content and extract data
def parse_html_content(html_content: str) -> list:
    root = html.fromstring(html_content)
    articles = root.xpath('//tr[@class="athing"]')
    data = []

    for article in articles:
        # Extract url, title, and rank from each article
        url = article.xpath('.//a[@href]/@href')
        title = article.xpath('.//a[@href]/text()')
        rank = article.xpath('.//span[@class="rank"]/text()')

        data.append({
            'url': url[0] if url else '',
            'title': title[0] if title else '',
            'rank': rank[0] if rank else '',
        })

    return data

# Main function to get and parse HTML content
def main() -> None:
    html_content = get_html_content('https://news.ycombinator.com')
    data = parse_html_content(html_content)
    print(data)

if __name__ == '__main__':
    main()

PyQuery

PyQuery is built on top of lxml and is very similar to JQuery. So, if you're comfortable using JavaScript parsers but you're quite new to Python, PyQuery is a good place to start.

⚒️ Main features of PyQuery

  • Capable of web element manipulation
  • Provides element filtering and operation chaining.
  • Intuitive JQuery-like syntax and easy-to-use API.

⚙️ Installing PyQuery

pip install pyquery

💡 Code sample

import requests
from pyquery import PyQuery

# Function to get HTML content from a URL
def get_html_content(url: str, timeout: int = 10) -> str:
    response = requests.get(url, timeout=timeout)
    return str(response.text)

# Function to parse HTML content and extract data
def parse_html_content(html_content: str) -> list:
    pq = PyQuery(html_content)
    articles = pq('.athing')
    data = []

    for article in articles:
        pq_article = pq(article)
        url = pq_article('.titleline a')
        title = pq_article('.titleline a')
        rank = pq_article('.rank')

        data.append(
            {
                'url': url.attr('href') if url else '',
                'title': title.text() if title else '',
                'rank': rank.text().replace('.', '') if rank else '',
            }
        )

    return data

# Main function to get and parse HTML content
def main() -> None:
    html_content = get_html_content('https://news.ycombinator.com')

    data = parse_html_content(html_content)
    print(data)

if __name__ == '__main__':
    main()

Scraping dynamic websites with Python

Browser automation libraries and frameworks have an off-label use for web scraping. Their ability to emulate a real browser is essential for accessing data on websites that require JavaScript to load their content.

Selenium

Selenium is primarily a browser automation framework and ecosystem with an off-label use for web scraping. It uses the WebDriver protocol to control a headless browser and perform actions like clicking buttons, filling out forms, and scrolling.

Because of its ability to render JavaScript, Selenium can be used to scrape dynamically loaded content.

⚒️ Main features of Selenium

  • Multi-browser support (Firefox, Chrome, Safari, Opera…).
  • Multi-language compatibility
  • Automate manual user interactions, such as UI testing, form submissions, and keyboard inputs.
  • Dynamic web elements handling

⚙️ Installing Selenium

# Install Selenium
pip install selenium

# We will also need to install webdriver-manager to run the code sample below
pip install webdriver-manager

💡 Code sample

To demonstrate some of Selenium's capabilities, let's go to Amazon, scrape The Hitchhiker's Guide to the Galaxy product page, and save a screenshot of the accessed page.

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# Insert the website URL that we want to scrape
url = 'https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C'

manager = ChromeDriverManager()
executable_path = manager.install()
driver = webdriver.Chrome(executable_path)
driver.get(url)

# Create a dictionary with the scraped data
book = {
    'book_title': driver.find_element(By.ID, 'productTitle').text,
    'author': driver.find_element(By.CSS_SELECTOR, 'span.author a').text,
    'edition': driver.find_element(By.ID, 'productSubtitle').text,
    'price': driver.find_element(By.CSS_SELECTOR, '.a-size-base.a-color-price.a-color-price').text,
}

# Save a screenshot from the accessed page and print the dictionary contents to the console
driver.save_screenshot('book.png')
print(book)

After the script finishes its run, we will see an object containing the book's titleauthoredition, and prices logged to the console and a screenshot of the page saved as book.png .

Output example:

{
    "book_title": "The Hitchhiker's Guide to the Galaxy: The Illustrated Edition",
    "author": "Douglas Adams",
    "edition": "Kindle Edition",
    "price": "$7.99"
}

Saved image:

Saving image with Selenium

Playwright

By definition, Playwright is an open-source framework for web testing and automation developed and maintained by Microsoft.

Despite having many features in common with Selenium, Playwright is considered a more modern and capable choice for automation, testing, and web scraping with Python.

⚒️ Main features of Playwright

  • Auto-wait. By default, Playwright waits for elements to be actionable before performing actions, eliminating the need for artificial timeouts.
  • Cross-browser support. Able to drive Chromium, WebKit, Firefox, and Microsoft Edge.
  • Cross-platform support. Available on Windows, Linux, and macOS, locally or on CI, headless, or headed.

⚙️ Installing Playwright

pip install playwright==1.40.0

# Install the required browsers
playwright install

💡 Playwright code sample

To highlight Playwright's features as well as its similarities with Selenium, let's go back to Amazon's website and extract some data from The Hitchhiker's Guide to the Galaxy.

Playwright version:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C')

    # Create a dictionary with the scraped data
    book = {
        'book_title': page.query_selector('#productTitle').inner_text().strip(),
        'author': page.query_selector('span.author a').inner_text().strip(),
        'edition': page.query_selector('#productSubtitle').inner_text().strip(),
        'price': page.query_selector('.a-size-base.a-color-price.a-color-price').inner_text().strip(),
    }

    print(book)
    page.screenshot(path='book.png')
    browser.close()

After the scraper finishes its run, the Firefox browser controlled by Playwright will close, and the extracted data will be logged into the console.

Advanced web scraping techniques with Scrapy

Scrapy is a fast, high-level web crawling and web scraping framework written with Twisted, a popular event-driven networking framework that gives it asynchronous capabilities.

Unlike the tools mentioned earlier, Scrapy is a full-fledged web crawling framework designed specifically for data extraction, with built-in support for handling requests, processing responses, and exporting data.

Additionally, Scrapy provides handy out-of-the-box features, such as support for following links, handling multiple request types, and error handling, making it a powerful tool for web scraping projects of any size and complexity.

⚒️ Main features of Scrapy

  • Feed exports in multiple formats, such as JSON, CSV, and XML.
  • Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions.
  • An interactive shell console for trying out the CSS and XPath expressions to scrape data and debug your spiders.
  • Built-in extensions and middlewares for handling cookies, HTTP authentication, caching user-agent spoofing, and more.

⚙️ Installing Scrapy

pip install scrapy

📁 Setting up a project in Scrapy

To demonstrate some of Scrapy's features, we'll once again scrape data from articles displayed on Hacker News.

We'll start by scraping the top 30 articles and then use Scrapy's CrawlSpider to follow the available page links and scrape all the articles on the website.

To begin, let's create a new directory and install Scrapy to initialize the project and create a new spider:

# Start a new Scrapy project
scrapy startproject scrapydemo

# Move into the newly created folder
cd scrapydemo

# Generate the spider
scrapy genspider demo "https://news.ycombinator.com"

After our spider is generated, let's specify the encoding for the output file, which will contain the data scraped from the target website by adding FEED_EXPORT_ENCODING = "utf-8" to our settings.py file.

Scrapy project directory

💡 Scrapy code sample

Finally, go to the spiders/demo.py file and write some code:

import scrapy

class DemoSpider(scrapy.Spider):
    name = 'demo'

    def start_requests(self):
        yield scrapy.Request(url='https://news.ycombinator.com/')

    def parse(self, response):
        for article in response.css('tr.athing'):
            yield {
                'URL': article.css('.titleline a::attr(href)').get(),
                'title': article.css('.titleline a::text').get(),
                'rank': article.css('.rank::text').get().replace('.', ''),
            }

Then, let's use the following command to run the spider and store the scraped data in a results.json file.

scrapy crawl demo -o results.json

Read web scraping with Scrapy for a detailed look at the library, or compare it with Crawlee, Apify's open-source web scraping and automation library.

🕷️ Using Scrapy's CrawlSpider

Now that we know how to extract data from the articles on the first page of Hacker News let's use Scrapy's CrawlSpider to follow the next page links and collect the data from all the articles on the website.

To do that, we will make some adjustments to our spiders/demo.py file:

# Add imports CrawlSpider, Rule and LinkExtractor 👇
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

# Change the spider from "scrapy.Spider" to "CrawlSpider"
class DemoSpider(CrawlSpider):
    name = 'demo'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['https://news.ycombinator.com/news?p=1']

    # Define a rule that should be followed by the link extractor.
    # In this case, Scrapy will follow all the links with the "morelink" class
    # And call the "parse_article" function on every crawled page
    rules = (
        Rule(LinkExtractor(restrict_css='.morelink'), callback='parse_article', follow=True),
    )

    # When using the CrawlSpider we cannot use a parse function called "parse".
    # Otherwise, it will override the default function.
    # So, just rename it to something else, for example, "parse_article"
    def parse_article(self, response):
        for article in response.css('tr.athing'):
            yield {
                'URL': article.css('.titleline a::attr(href)').get(),
                'title': article.css('.titleline a::text').get(),
                'rank': article.css('.rank::text').get().replace('.', ''),
            }

Finally, let's add a small delay between each of Scrapy's requests to avoid overloading the server. We can do that by adding DOWNLOAD_DELAY = 0.5 to our settings.py file.

Adding download delay with Scrapy

Great! Now, we're ready to run our scraper and get the data from all the articles displayed on Hacker News. Just run the command scrapy crawl demo -o results.json and wait for the run to finish.

Expected output:

Scrapy output

How to scrape dynamic websites with Scrapy and Playwright

Scrapy and Playwright are one of the most efficient combos for modern web scraping in Python.

This combo allows us to benefit from Playwright's ability to access dynamically loaded content on websites and retrieve code from the web page so we can use Scrapy to scrape it.

To integrate Playwright with Scrapy, we'll use the scrapy-playwright library. Then, we'll scrape https://www.mintmobile.com/product/google-pixel-7-pro-bundle/ to demonstrate how to extract data from a website using Playwright and Scrapy.

Mint Mobile requires JavaScript to load most of the content displayed on its product page, which makes it an ideal scenario for using Playwright in the context of web scraping.

Mint Mobile product page with JavaScript disabled:

Mint Mobile with JavaScript disabled

Mint Mobile product page with JavaScript enabled:

Mint Mobile with JavaScript enabled

⚙️ Playwright project setup

Start by creating a directory to house our project and installing the necessary dependencies:

# Create new directory and move into it
mkdir scrapy-playwright
cd scrapy-playwright

Installing Scrapy and scrapy-playwright

# Install Scrapy and scrapy-playwright
pip install scrapy==2.11.0 scrapy-playwright==0.0.34

# Install the required browsers if you are running Playwright for the first time
playwright install

# Or install a subset of the available browsers you plan on using
playwright install firefox chromium

Next, start the Scrapy project and generate a spider:

# Start a new Scrapy project
scrapy startproject pwsdemo

# Move into the newly created folder
cd pwsdemo

# Generate a Scrapy Spider
scrapy genspider demo "https://www.mintmobile.com"

Now, let's activate scrapy-playwright by adding DOWNLOAD_HANDLERS and TWISTED_REACTOR to the scraper configuration in settings.py

# pwsdemo/settings.py
BOT_NAME = 'pwsdemo'
DOWNLOAD_HANDLERS = {
    'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
    'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}
FEED_EXPORT_ENCODING = 'utf-8'
NEWSPIDER_MODULE = 'pwsdemo.spiders'
REQUEST_FINGERPRINTER_IMPLEMENTATION = '2.7'
ROBOTSTXT_OBEY = True
SPIDER_MODULES = ['pwsdemo.spiders']
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

Great! We're now ready to write some code to scrape our target website.

💡 Code sample for Playwright and Scrapy

So, let's use Playwright + Scrapy to extract data from Mint Mobile.

import scrapy
from scrapy_playwright.page import PageMethod

class DemoSpider(scrapy.Spider):
    name = 'demo'

    def start_requests(self):
        yield scrapy.Request(
            'https://www.mintmobile.com/product/google-pixel-7-pro-bundle/',
            meta={
                # Use Playwright
                'playwright': True,
                # Keep the page object so we can work with it later on
                'playwright_include_page': True,
                # Use PageMethods to wait for the content we want to scrape to be properly loaded before extracting the data
                'playwright_page_methods': [
                    PageMethod('wait_for_selector', 'div.m-productCard--device'),
                ],
            },
        )

    def parse(self, response):
        yield {
            'name': response.css('div.m-productCard__heading h1::text').get().strip(),
            'memory': response.css('div.composited_product_details_wrapper > div > div > div:nth-child(2) > div.label > span::text').get().replace(':', '').strip(),
            'pay_monthly_price': response.css('div.composite_price_monthly > span::text').get(),
            'pay_today_price': response.css('div.composite_price p.price span.amount::attr(aria-label)').get().split()[0],
        }

Run the spider

Finally, run the spider using the following command to scrape the target data and store it in a results.json file.

scrapy crawl demo -o results.json

Expected content of results.json:

[
    {
        "name": "Google Pixel 7 Pro",
        "memory": "128GB",
        "pay_monthly_price": "50",
        "pay_today_price": "589"
    }
]

Exporting the data to CSV and Excel

There are many ways to export data to CSV and Excel in Python using built-in Python modules or external libraries like Pandas.

Exporting data to CSV using the CSV module

The CSV module comes built-in with Python, so there's no need to install it separately. We'll utilize the code we previously developed in the BeautifulSoup section, but instead of printing the data to the console, we'll save it as a CSV file.

In the code below, we've adjusted the main function to write the parsed data to a CSV file named data.csv using the csvmodule's DictWriter class.

import httpx
import csv
from bs4 import BeautifulSoup

# Function to get HTML content from a URL
def get_html_content(url: str, timeout: int = 10) -> str:
    response = httpx.get(url, timeout=timeout)
    return str(response.text)

# Function to parse a single article
def parse_article(article) -> dict:
    url = article.find(class_='titleline').find('a').get('href')
    title = article.find(class_='titleline').getText()
    rank = article.find(class_='rank').getText().replace('.', '')
    return {'url': url, 'title': title, 'rank': rank}

# Function to parse all articles in the HTML content
def parse_html_content(html: str) -> list:
    soup = BeautifulSoup(html, features='html.parser')
    articles = soup.find_all(class_='athing')
    return [parse_article(article) for article in articles]

# Main function to get and parse HTML content
def main() -> None:
    html_content = get_html_content('https://news.ycombinator.com')
    data = parse_html_content(html_content)

    # Write data to a CSV file
    with open('data.csv', 'w', newline='') as csvfile:
        fieldnames = ['url', 'title', 'rank']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        for row in data:
            writer.writerow(row)

if __name__ == '__main__':
    main()

Exporting data to Excel using the Pandas library

Unfortunately, there isn't a built-in method for exporting data to Excel like there is for CSV. But don't worry - we can still easily achieve this using the popular Pandas library.

We'll continue working with the code we developed in the BeautifulSoup section, but this time, we'll adjust it to save the data as an Excel (xlsx) file.

In the code below, we've modified the main function to utilize the DataFrame class from the pandas module to write the data to an xlsx file. Once you run the code, you'll find a newly created data.xlsx file saved in your directory.

import httpx
import pandas as pd
from bs4 import BeautifulSoup

# Function to get HTML content from a URL
def get_html_content(url: str, timeout: int = 10) -> str:
    response = httpx.get(url, timeout=timeout)
    return str(response.text)

# Function to parse a single article
def parse_article(article) -> dict:
    url = article.find(class_='titleline').find('a').get('href')
    title = article.find(class_='titleline').getText()
    rank = article.find(class_='rank').getText().replace('.', '')
    return {'url': url, 'title': title, 'rank': rank}

# Function to parse all articles in the HTML content
def parse_html_content(html: str) -> list:
    soup = BeautifulSoup(html, features='html.parser')
    articles = soup.find_all(class_='athing')
    return [parse_article(article) for article in articles]

# Main function to get and parse HTML content
def main() -> None:
    html_content = get_html_content('https://news.ycombinator.com')
    data = parse_html_content(html_content)

    # Convert data to a DataFrame
    df = pd.DataFrame(data)

    # Write DataFrame to an Excel file
    df.to_excel('data.xlsx', index=False)

if __name__ == '__main__':
    main()

How to deploy Python scrapers in the cloud

Next, we will learn how to deploy our scrapers to the cloud using Apify so we can configure them to run on a schedule and access many other features of the platform.

Apify uses serverless cloud programs called Actors that run on the Apify platform and do computing jobs.

To demonstrate this, we'll create a development template using the Apify SDK, BeautifulSoup, and HTTPX and adapt the generated boilerplate code to run our BeautifulSoup Hacker News scraper. So, let’s get started.

Installing the Apify CLI

Via homebrew

On macOS (or Linux), you can install the Apify CLI via the Homebrew package manager.

brew install apify/tap/apify-cli

Via NPM

Install or upgrade Apify CLI by running:

npm -g install apify-cli

Creating a new Actor

Once you have the Apify CLI installed on your computer, simply run the following command in the terminal:

apify create bs4-actor

Then, go ahead and choose Python → BeautifulSoup & HTTPX → Install template

Installing the Apify BeautifulSoup & HTTPX code template

This command will create a new folder named bs4-actor, install all the necessary dependencies, and create a boilerplate code that we can use to kickstart our development using BeautifulSoup, Requests, and the Apify SDK for Python.

Finally, move to the newly created folder and open it using your preferred code editor. In this example, I’m using VS Code.

cd bs4-actor
code .

Testing the Actor locally

The template already creates a fully functional scraper. You can run it using the command apify run if you would like to give it a try before we modify the code. The scraped results will be stored under storage/datasets.

Great! Now that we've familiarised ourselves with the template let’s go to src/main.py and modify the code there to scrape HackerNews.

With just a few adjustments, that’s how the final code looks like:

from bs4 import BeautifulSoup
from httpx import AsyncClient

from apify import Actor

async def main() -> None:
    async with Actor:
        # Read the Actor input
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('start_urls')

        if not start_urls:
            Actor.log.info('No start URLs specified in actor input, exiting...')
            await Actor.exit()

        # Enqueue the starting URLs in the default request queue
        rq = await Actor.open_request_queue()
        for start_url in start_urls:
            url = start_url.get('url')
            Actor.log.info(f'Enqueuing {url} ...')
            await rq.add_request({'url': url, 'userData': {'depth': 0}})

        # Process the requests in the queue one by one
        while request := await rq.fetch_next_request():
            url = request['url']
            Actor.log.info(f'Scraping {url} ...')

            try:
                # Fetch the URL using `httpx`
                async with AsyncClient() as client:
                    response = await client.get(url, follow_redirects=True)
                soup = BeautifulSoup(response.content, 'html.parser')
                articles = soup.find_all(class_='athing')

                for article in articles:
                    data = {
                        'URL': article.find(class_='titleline').find('a').get('href'),
                        'title': article.find(class_='titleline').getText(),
                        'rank': article.find(class_='rank').getText().replace('.', ''),
                    }
                    # Push the extracted data into the default dataset
                    await Actor.push_data(data)
            except:
                Actor.log.exception(f'Cannot extract data from {url}.')
            finally:
                # Mark the request as handled so it's not processed again
                await rq.mark_request_as_handled(request)

Finally, type the command apify run in your terminal, and you'll see the storage being populated with scraped data from HackerNews.

Before we move to the next step, go to .actor/input_schema.json and change the prefill URL to https://news.ycombinator.com/news. This will be important when we run the scraper on the Apify platform.

input schema

Deploying the Actor to Apify

Now that we know our Actor is working as expected, it is time to deploy it to the Apify platform. To follow along, you'll need to sign up for a free Apify account.

Once you have an Apify account, run the command apify login in the terminal. You will be prompted to provide your Apify API Token. Which you can find in Apify Console, under Settings → Integrations.

Apify login

The final step is to run the apify push command. This will start an Actor build, and after a few seconds, you should be able to see your newly created Actor in Apify Console under Actors → My actors.

Building an Apify Actor

Perfect! Your scraper is ready to run on the Apify platform! Just hit the Start button, and after the run is finished, you can preview and download your data in multiple formats in the Storage tab.

Apify Actor run

Bonus: Creating tasks and scheduling Actor runs

One of the most useful features of running your scrapers is the ability to save different configurations for the same Actor (Tasks) and schedule them to run at the time that is most convenient to us. So, let’s do that with our Bs4 Actor.

On the Actor page, click on Create empty task

Creating Actor task

Next, click on Actions and then Schedule. A schedule configuration modal will pop up on the screen.

Scheduling an Actor

Finally, select how often you want the Actor to run and click Create.

Actor schedule task configuration

And that’s it! Your Actor will now automatically run at the specified time. You can find and manage your schedules in the Schedules tab.

To get started with scraping with Python on the Apify platform, you can use Python code templates. These will let you build scrapers quickly with templates for Requests, Beautiful Soup, Scrapy, Playwright, and Selenium.

Additional learning resources for web scraping with Python 📚

If you want to dive deeper into some of the libraries and frameworks we presented during this post, here is a curated list of articles on web scraping:

General web scraping

Python web scraping and data parsing

Frequently asked questions

Is Python good for web scraping?

Yes, Python is excellent for web scraping due to its powerful libraries like Requests, BeautifulSoup, Scrapy, and Playwright, which simplify the process of extracting data from websites.

What is the best Python web scraping library?

The "best" library depends on your needs: BeautifulSoup and LXML for simple parsing, Requests and HTTPX for HTTP requests, Selenium and Playwright for dynamic content, and Scrapy for large-scale web scraping projects.

Is web scraping legal?

Web scraping's legality depends on the data being scraped and how it's used. It can be legal if it extracts publicly available data and complies with the website’s terms and applicable regional laws, but it's crucial to consult legal advice for specific cases.

Can you get banned for scraping?

Yes, you can get banned for scraping if you violate a website's terms of service, send too many requests too quickly, or scrape protected or private data. This can lead to IP bans, account bans, or legal actions.

Percival Villalva
Percival Villalva
Developer Advocate on a mission to help developers build scalable, human-like bots for data extraction and web automation.

Get started now

Step up your web scraping and automation