Web scraping a website is a way of automatically extracting information from its web pages. If you've ever copied text from a web page and pasted it into a document, you were extracting that data. Web scraping uses bots to do the same thing, but much faster and more efficiently. Web scrapers can extract huge amounts of information in seconds. Even better, the data is delivered in a structured format so that it can easily be used in spreadsheets, applications, and databases.
The website provides an API and a way to download its data, but it has quotas or rate limiting, doesn’t provide complete data, or is just too complicated to use. Web scraping is often a faster and easier way to get data.
But it’s not only about the extraction of data: you might also want to perform some actions based on the data. For example, you might want to automatically receive an email notification if something changes on the website, or you might want to automatically upload the data to your company CRM. We call this web automation and Apify is the perfect tool for that.
Is scraping a website legal?
Scraping merely automates human tasks so that you don’t need to gather information obtained from websites manually. But the collection of private or copyright-protected information can be risky, so it’s important to comply with certain regulations. To learn more about the legality of web scraping, check out our blog post on the subject.
Start web scraping with Apify
You definitely don’t need to be a software developer or be involved in web development to start scraping with Apify. But some basic knowledge of how the web works is always helpful. For example, it’s good to know what a URL is and have some basic HTML skills, or be familiar with the difference between formats like JSON or CSV. But don’t worry, in this tutorial we’ll guide you through the essentials.
The only thing you need to get started is to have an Apify account and a verified email address. You can create an account here. It only takes a minute and it’s free of charge (no credit card required). Then go to your email inbox and click on the verification link.
Apify provides a free plan that you can use for small web scraping and automation jobs, so you can get started without any commitment. For larger workloads or for access to additional services, you might need to upgrade. See our pricing page for details.
Your first web scraping task
Now that you have your Apify account ready, let’s start off with something nice and easy for your first use case: scrape the top article and URL from the CNN website's entertainment section. This is what the page looks like today, because it's Thanksgiving 🦃
In this tutorial, we’ll use Web Scraper, a basic tool for scraping websites using the Chrome web browser. Web Scraper is an Apify actor, a small program running on the Apify cloud platform. You can think of it as an app on your phone, but instead running on Apify. By the way, you can find many pre-built actors in Apify Store
So let’s start scraping. Follow these steps and you'll have your results in a few moments.
You might also like to check out our video guide to using Web Scraper before you go through the tutorial:
2. If you're already signed in to Apify, you'll be taken to Apify Console, with a new task created for the Web Scraper actor. This is where you can configure the scraper to get the data you want.
3.There are various settings, some of which are already prefilled. Before we change anything, let’s just try how it works with these. Simply hit the Start button at the bottom of the page and the actor will start scraping the apify.com website, extract the first 10 URLs and show results within about a minute. If you don't see the button, click on the Input tab.
4. But we’re here to scrape from CNN, not the Apify homepage! So let’s change some settings. Click on the link to the actor to get back to the Input tab.
The most important setting is Start URLs. This will direct the actor to the web page you want to start scraping. Since we just want to scrape the top entertainment article from CNN, we don’t need to start at the https://cnn.com homepage, but can narrow our search to the entertainment section.
That’s our start URL! So enter it in the input for the actor. Just delete the existing URL for the Apify website. While you're at it, you should also delete these other marked fields. You can use the red X on the right or just delete the text.
So you should end up with input parameters like this:
5. The second most important input setting will tell the scraper what data to extract. If you run the scraper with just the above start URL, it won't extract any articles, but will just give you the title of the web page.
And you need to add them exactly where we've highlighted them in the Page function section. Don't forget the comma after pageTitle above mainArticle!
6.Now just hit the green Start button at the bottom of the screen. Web Scraper will start running. Just wait a minute or so before the status box from the left top corner changes from Running to Succeeded.
7. And that's it. Web Scraper has extracted some basic data for you. You can find your data under the Dataset tab.
Here's how the preview should look. Don't worry about the example data. That's just leftover input parameters from the default example.
You can also download the data in JSON, XML, CSV, Excel, and other formats.
Of course, there are many more ways to perform your scraping job. You can configure the scraper to collect more than one article title and URL, you can scrape from more than the first page of results, and there are lots of other settings for you to play with as you get more confident.
Remember that example.com leftover? You could use that to enqueue other pages or URLs to scrape after you've scraped CNN, such as Fox News...
There are dozens of libraries, tools, and services that let you scrape data from websites, and many of them lead to the same result. But Apify is different in three main ways:
Apify is not just a tool for scraping a website. It’s a full-featured cloud platform that lets you run any web scraping or automation job (we call the bots that run these tasks Actors), gives you unlimited storage for your crawling data, provides access to both residential and datacenter proxies to control the geographical origin of your bots, enables scheduling of your jobs at regular intervals, and much more.
On Apify Store, you can find many pre-built scrapers for popular websites such as Google Search, Amazon, or Instagram. If you find the right tool for your job, you can start downloading your data with just a few clicks. No coding is needed and small workloads won't even cost you anything.
Unlike other tools, Apify provides excellent integrations for other services. You can download extracted data in various structured formats like CSV, JSON, XML, or Excel. You can connect scrapers running on Apify with platforms such as Zapier to integrate them with your workflows. You can control everything on Apify using an API.
For example, with Apify, you can easily set up a task that will regularly check your competitors on Amazon.com and send you an email if they change their prices. Or when your restaurant receives a new review on Google Places.
Some pages require more custom handling. For example, you might want to perform more complicated workflows, click buttons, etc. before extracting the data. For situations like that, use Puppeteer Scraper. Unlike Web Scraper, which does a lot of hand-holding, Puppeteer Scraper gives you all the power of Node.js and Puppeteer — the Node.js API for headless Chrome — ready for scraping at any scale.
There are times when none of the existing actors are just right for your task. On Apify, you can build your own actor from scratch and perform any custom scraping or web automation workflow. This will come in handy if you need to integrate extracted data to your system and be in absolute control of how and when it will be stored. Have a look at Crawlee, our open-source library for web scraping and automation.