How to scrape a website (ultimate guide for 2024)

Learn how to extract data from websites using Python and no-code tools. This guide covers the basics of web scraping, connecting to a web page, parsing HTML, and storing the scraped data.

Content

A step-by-step guide to learning how to scrape a website. Learn how to connect to a web page, parse its HTML content, and extract data from it. We will cover how to scrape a website using a Python script and a no-code tool. By the end, you'll have a thorough understanding of the skills and steps required to do web scraping!

What is web scraping?

Web scraping is the process of extracting data from websites. You can do it manually, but it's typically performed using automated software or custom scripts. These tools connect to a site, access specific pages, parse their HTML content, select HTML elements, and extract data from them. The information you can scrape from a site includes text, images, links, and more. For a deeper understanding, refer to our in-depth piece on the definition of web scraping.

Prerequisites for web scraping

  • Knowledge of how the Web works, specifically HTTP and JavaScript renderingTo perform web scraping, you usually need to meet the following prerequisites:
  • Knowledge of the DOM, HTML, and CSS selectors
  • Understanding the difference between a static site and a dynamic site

Web scraping is generally performed using a script. Python is one of the best languages for scripting and it comes with several libraries for web scraping. So, it's going to be the programming language of reference for this article!

To perform web scraping in Python, you need:

Since you'll see how to deploy your script to Apify, you'll also need:

How to scrape a website

Learn how to scrape a website using Python in a complete guided example. The target site of the sample scraping script will be "Headphones" page of the Warehouse e-commerce site:

 "Headphones" page of the Warehouse e-commerce site

Note that you can easily adapt what you'll learn here to any other site.

This section will guide you through the process of scraping a website via the following steps:

  1. Inspect the target website
  2. Choose the scraping libraries
  3. Set up the Python project
  4. Connect to the site
  5. Parse the HTML content
  6. Analyze the DOM elements
  7. Select all items on the page
  8. Prepare the scraping logic
  9. Scrape the product information
  10. Store the scraped data
  11. Put it all together
  12. Deploy to Apify

Let's dive in!

Step 1: Inspect the target website

Before digging into coding, you should spend some time getting familiar with your target site.

Open your browser and connect to the website to be scraped. Visit the "Headphones" page, right-click on the background, and select the "Inspect" option. The DevTools will open:

"Inspect" option of DevTools

Reach the "Network" tab and reload the page. Now, take a look at the Fetch/XHR requests that the page makes:

Fetch/XHR requests made on page

You'll see that the page makes two requests dynamically in the browser. Expand them and notice how they return data related to your recently visited items and cart.

In other words, the items on the page aren't retrieved during the page rendering. This means the HTML page returned by the server already contains all the items. So, the target webpage isn't dynamic.

As a result, you can scrape the website with the following two tools:

  1. An HTTP client: To connect to the page and retrieve its HTML document
  2. An HTML parser: To parse the HTML document and extract data from it.

You could still use a browser automation tool like Playwright, but doing so would only add unnecessary resource overhead.

Step 2: Choose the scraping libraries

You now know that you need an HTTP client and an HTML parser to extract data from the target site. The recommended scraping libraries are:

  • HTTPX: A modern and fast Python HTTP client that supports asynchronous requests and HTTP/1.1 and HTTP/1.2.
  • Beautiful Soup: A powerful library for parsing HTML and XML documents in Python. It provides an intuitive API to navigate the parse tree and extract data from it.

Note: Both packages appear in the list of the best Python web scraping libraries.

Step 3: Set up the Python project

Web scraping is not a one-time task. Online pages keep getting updated, so you need to periodically run your data extraction scripts. What if you wanted to automate this process or retrieve the scraped data on the fly via an API? This is where Apify comes into play!

Turning your scraping task into an Apify Actor gives you access to these benefits:

  • Easy deploy: Deploy your code with just a single CLI command.
  • Automate scheduling: Set up your Actor to run at regular intervals, so you always have up-to-date data without manual intervention.
  • Access via API: Get the data scraped by your script via API.
  • Data Storage: Automatically export the scraped data in several formats, including JSON, CSV, XML, Excel, HTML Table, RSS, and JSONL.
  • Scalability: Run your Actor on Apify's infrastructure, which can handle large-scale scraping tasks without requiring you to manage the underlying servers.
  • Proxy integration: Avoid IP bans and limitations with simplified integration with rotating proxies.
  • Monitoring and logging: Take advantage of Apify's monitoring tools to track the performance of your scraping task.

To get started, install the Apify CLI and create a new Actor with:

apify create

Give your Actor a name, select Python as the programming language, and install the "BeautifulSoup + HTTPX" template.

The Apify CLI will initialize a Python project containing the Actor definition. Open this folder in your favorite Python IDE.

Your project directory will contain the following file structure:

Directory file structure

Explore the files in the src folder. Focus your attention on the async with Actor section in main.py. Right now, it contains sample Python scraping logic. Clear the section and get ready to replace it with custom scraping logic:

from bs4 import BeautifulSoup
from httpx import AsyncClient
from apify import Actor


async def main() -> None:
    async with Actor:
        # your Actor scraping logic...

Don't forget that you can run your Actor locally for testing purposes with:

apify run

Step 4: Connect to the site

Use HTTPX to retrieve the HTML document associated with the target URL:

# target page to scrape data from
url = "https://warehouse-theme-metal.myshopify.com/collections/headphones"

try:
    # make a GET request to the target URL
    async with AsyncClient() as client:
      response = await client.get(url)
except Exception:
    Actor.log.exception(f"Cannot extract data from {url}.")

Under the hood, the get() method performs an async HTTP GET request to the specified URL. The server will then respond with the HTML document of the requested web page. In particular, you can find the raw HTML code of the page in the reponse.text attribute.

In case of a network error, the raised exception is intercepted and a special message is logged.

Step 5: Parse the HTML content

Parse the HTML document returned by the server with the BeautifulSoup() function:

soup = BeautifulSoup(response.text, "html.parser")

The first argument is the raw HTML string while the second specifies the parser to use.

The soup variable now holds a BeautifulSoup object, which exposes methods to navigate the parse tree, select nodes, and scrape data.

Step 6: Analyze the DOM elements

Before jumping into the data retrieval logic, you need to inspect the DOM of the target page. Specifically, you have to understand how to select the HTML elements containing the data of interest.

So, open the target page in the browser, right-click on an item node, and select the "Inspect" option:

Inspect option of HTML elements

Here, you can notice that each item on a page is a .product-item HTML node that contains:

  • The brand name in the .product-item__vendor node
  • A title in the .product-item__title node
  • A price in the .price node
  • A rating score and review info in the .rating element
  • Stock information in the .product-item__inventory element

If you aren't familiar with this process, read our guide on inspecting elements with the DevTools.

Step 7: Select all items on the page

Use the Beautiful Soup select() method to apply the .product-item CSS selector to the page:

item_elements = soup.select('.product-item')

The item_elements list will now contain all the HTML elements on the page.

Step 8: Prepare the scraping logic

The target page contains several items, and you want to scrape them all. Initialize a new list where to store the scraped data:

items = []

Next, iterate over item_elements:

for item_element in item_elements:
    # scraping logic...

Time to retrieve data from each HTML item node and populate the items list accordingly.

Step 9: Scrape the product information

Use the methods provided by Beautiful Soup to select the nodes of interest inside each item element. Let's start by scraping an item's image!

First, select the image node:

image_element = item_element.select_one("img")

select_one() returns the first node that matches the specified CSS selector.

You can now find the image URL in the src attribute:

image_element["src"]

Beautiful Soup enables you to the access HTML attribute values of a node as in a regular Python dictionary.

If you inspect the image node again, you will see that the URLs in src have this format:

"//warehouse-theme-metal.myshopify.com/cdn/shop/products/12405_976__2_2139873a-527f-4c6c-9e7c-be6aa0eacbe4.jpg?v=1559126290&width=1378"

Append the "https:" string to get the full URL:

image_url = "https:" + image_element["src"]

For text elements, you can simply extract the data of interest as below:

brand_element = item_element.select_one(".product-item__vendor")
brand_name = brand_element.text

The text attribute returns the text information contained in the selected HTML node.

In some cases, you also have to perform some data cleaning operations:

price_element = item_element.select_one(".price")
price = price_element.text.replace("Sale price", "").strip()

This removes the "Sale price" string from the price text and trims all extra spaces or newlines. Discover more in our article on data cleaning in Python.

That's required because the price node contains a hidden text element:

Hidden text element in price node

The info of interest may also be spread across multiple nodes, as with the .rating element below:

To address this scenario, you need to access the nested nodes with some extra scraping logic:

rating_element = item_element.select_one(".rating")
rating_stars = rating_element.select_one(".rating__stars")["aria-label"]
rating_reviews = rating_element.select_one(".rating__caption").text

Extend what you learned here to extract the desired data from each item:

# scrape the image URL
image_element = item_element.select_one("img")
image_url = "https:" + image_element["src"]

# scrape the brand name
brand_element = item_element.select_one(".product-item__vendor")
brand_name = brand_element.text

# scrape the title
title_element = item_element.select_one(".product-item__title")
title = title_element.text

# scrape the price
price_element = item_element.select_one(".price")
price = price_element.text.replace("Sale price", "").strip()

# scrape the rating info
rating_element = item_element.select_one(".rating")
rating_stars = rating_element.select_one(".rating__stars")["aria-label"]
rating_reviews = rating_element.select_one(".rating__caption").text

# scrape the stock information
stock_info_element = item_element.select_one(".product-item__inventory")
stock_info = stock_info_element.text

Create a new item object with the scraped data and add it to items:

# create a new item object
item_data = {
    "image_url": image_url,
    "brand_name": brand_name,
    "title": title,
    "price": price,
    "rating_stars": rating_stars,
    "rating_reviews": rating_reviews,
    "stock_info": stock_info
}
# append the new item to the list
items.append(item_data)

Step 10: Store the scraped data

The items list contains all scraped items from the target page. It only remains to pass it to the Actor dataset through the push_data() method:

await Actor.push_data(items)

You'll now be able to retrieve the scraped data via API. Also, you'll be able export it in various formats directly on the Apify dashboard.

Step 11: Put it all together

Here's the final code of the Python Actor to scrape a website:

from bs4 import BeautifulSoup
from httpx import AsyncClient
from apify import Actor

async def main() -> None:
    async with Actor:
        # target page to scrape data from
        url = "https://warehouse-theme-metal.myshopify.com/collections/headphones"

        try:
            # make a GET request to the target URL
            async with AsyncClient() as client:
                response = await client.get(url)

             # parse the HTML document
            soup = BeautifulSoup(response.text, "html.parser")

            # select all product elements on the page
            item_elements = soup.select(".product-item")

            # where to store the scraped data
            items = []

            # iterate over each item HTML element
            # and apply the scraping logic
            for item_element in item_elements:
                # scrape the image URL
                image_element = item_element.select_one("img")
                image_url = "https:" + image_element["src"]

                # scrape the brand name
                brand_element = item_element.select_one(".product-item__vendor")
                brand_name = brand_element.text

                # scrape the title
                title_element = item_element.select_one(".product-item__title")
                title = title_element.text

                # scrape the price
                price_element = item_element.select_one(".price")
                price = price_element.text.replace("Sale price", "").strip()

                # scrape the rating info
                rating_element = item_element.select_one(".rating")
                rating_stars = rating_element.select_one(".rating__stars")["aria-label"]
                rating_reviews = rating_element.select_one(".rating__caption").text

                # scrape the stock information
                stock_info_element = item_element.select_one(".product-item__inventory")
                stock_info = stock_info_element.text

                # create a new item object
                item_data = {
                    "image_url": image_url,
                    "brand_name": brand_name,
                    "title": title,
                    "price": price,
                    "rating_stars": rating_stars,
                    "rating_reviews": rating_reviews,
                    "stock_info": stock_info
                }
                # append the new item to the list
                items.append(item_data)

              # store the scraped data in the Actor dataset
              await Actor.push_data(items)
        except Exception:
            Actor.log.exception(f"Cannot extract data from {url}.")

Step 12: Deploy to Apify

Before deploying your Actor to Apify, make sure to update the actor.json file, the Actor configuration file. This currently involves some input arguments and output specifications. You don't need any of those, so override it with the following simpler definition:

{
  "actorSpecification": 1,
  "name": "website-scraper",
  "title": "Python Scraper",
  "description": "Scraping data using HTTPX and Beautiful Soup",
  "version": "1.0",
  "meta": {
    "templateId": "python-beautifulsoup"
  },
  "dockerfile": "./Dockerfile"
}

Deploy your script to Apify with just a single command:

apify push

Log into your Apify account, and access the My Actors page. You'll now see a new Actor:

New Actor in My Actors page

Click on it, and execute your Actor by clicking the "Start" button:

Execute your Actor by clicking the "Start" button

Note that you can edit the code of your Actor directly in the web IDE provided by Apify. Wait for the script execution to finish. In the "Storage" tab, you can now export the retrieved data in the format you prefer:

Export your data in your prefered format

Export it to CSV and open the file in Excel. That's what you'll see:

Exported CSV file in Excel

Et voilĂ ! That is exactly the item information contained on the target page.

Next steps

This tutorial section has covered the fundamentals of how to scrape data from a website. Consider these advanced techniques to improve your script:

  • Crawling: Retrieve all items of a given category by scraping data from all pagination. For more guidance, check out our guide on how to do web crawling in Python.
  • URL parametrization: Customize your Actor so that it accepts the URL of the page to scrape as an input parameter.
  • Proxy management: Integrate proxies into your Actor to avoid blocks and IP bans. Discover more in the official documentation.

Scraping websites without code

You just learned how to scrape a website with a Python script. While that's a great approach, it comes with some drawbacks, such as the need to keep the code up-to-date. Apify knows that struggle, which is why it comes with a store with thousands of ready-made scrapers. Currently, there are over 2,000 Actors available targeting various websites.

For example, assume you're interested in retrieving data from TripAdvisor. Typically, you would need to build a custom scraping script for that. With Apify, you can simply use an existing Actor, eliminating the need for custom development.

Visit Apify Store and search for the "tripadvisor" keyword:

Tripadvisor in Apify Store

Select the "Tripadvisor Scraper" option, and on the Actor public page, click the "Try for free" button:

Try Tripadvisor Scraper for free

The Actor will be added to the dashboard of your personal account. Configure the Actor as desired and click "Save & Start" button to launch it:

Save and Start to launch the Actor

A few seconds later, you'll see the following data:

Tripadvisor Scraper run data

Wonderful! You just scraped data from Tripadvisor without a single line of code.

Thanks to Apify, you no longer need to be a developer to perform web scraping. However, if you're a developer, you can create useful Actors and monetize them to earn extra income.

Web scraping use cases

Web scraping is used in tons of industries as it can cover different use cases, including:

  • Price monitoring: Track competitor prices to adjust your own pricing strategy in real time or monitor prices to find great deals.
  • Market research: Gather data on consumer trends from social network sites for informed business decisions and sentiment analysis.
  • Lead generation: Extract contact information from social media and platforms like LinkedIn for sales and marketing campaigns.
  • Real estate listings: Aggregate property listings from various websites for comparison or analysis.
  • SEO analysis: Scrape keyword rankings, backlinks, and other SEO metrics to optimize your website's performance or track competitors.
  • Job listings: Automatically retrieve job postings from multiple websites to simplify the process of finding your new job.
  • News aggregation: Collect news articles from various sources for a curated newsfeed.
  • Academic research: Generate large datasets from online sources for data-driven studies and analyses.

Yes, as long as you target public data and comply with the site's Privacy Policy and Terms and Conditions, web scraping is 100% legal. To ensure ethical scraping, make also sure to respect the rules for automated crawlers defined in the site's robots.txt file.

In general, web scraping shouldn't be confused with hacking or DDoS attacks. Retrieving public information from a site using an automated script can't be considered data stealing or a malicious action. Refer to our detailed guide for a comprehensive understanding of web scraping legality.

Frequently asked questions

Let's answer some interesting questions on how to scrape a website.

How do I scrape a webpage?

To scrape a website, you need to:

  1. Retrieve the page content using an HTTP client or a browser automation tool
  2. Parse the HTML of the page
  3. Select the HTML nodes of interest and extract data from them
  4. Export the scraped data to a human-readable format, such as CSV or JSON

Is web scraping always free?

Yes, web scraping is free when it only involves launching a local script. At the same time, you may incur additional costs for server maintenance, deployment, and proxy integration. These extra services are required for high-volume scraping or when targeting complex sites.

Can ChatGPT do web scraping?

No, ChatGPT can't directly perform web scraping but can support you during the process of building a scraping script. It can suggest tools, provide code, and help you understand how to scrape a website. Learn more in our guide on how to use ChatGPT for web scraping.

What is the difference between web crawling and web scraping?

Web crawling is the process of following links to discover new pages, while web scraping is about extracting data from web pages. Web crawling relies on web scraping to extract the URLs from the links on a page. Web scraping uses web crawling to discover all pages to scrape. See the web crawling vs web scraping comparison.

What does it take to scrape a website?

To scrape a website, analyze if it's static or dynamic, choose appropriate tools, and write a script to fetch and parse HTML. This guide showed you how to connect to a site, extract data, and store it with Python and an Apify Actor without a single line of code.

Deploy your scraping code to the cloud

Headless browsers, infrastructure scaling, sophisticated blocking.
Meet Apify - the full-stack web scraping and browser automation platform that makes it all easy.

Sign up for free
Antonello Zanini
Antonello Zanini
Software engineer and technical writer with a focus on full-stack development of innovative web services.

Get started now

Step up your web scraping and automation