Web scraping is an increasingly popular way to get structured data from websites. For example, you can use it to find contact details on web pages or monitor prices on an online store. The number of web scraping tools has grown over the years and many of them have made web scraping accessible to beginners or those with limited programming skills. If that sounds like you, in this blog post you’ll learn how you can use Apify to scrape data from any website and get results in a few minutes.
There are dozens of libraries, tools and service that let you scrape data from websites, and many of them lead to the same result. But Apify is different in three main ways:
- Apify is not just a tool for web scraping. It’s a full-featured cloud platform that lets you run any web scraping or automation job (we call the bots that run these tasks actors), gives you unlimited storage for your crawling data, provides access to both residential and datacenter proxies to hide the origin of your bots, enables scheduling of your jobs at regular intervals and much more.
- On Apify Store, you can find many pre-built scrapers for popular websites such as Google Search, Amazon or Instagram. If you find the right tool for your job, you can start downloading your data with just a few clicks. No coding needed and for small workloads it’s free of charge.
- Unlike other tools, Apify provides excellent integrations for other services. You can download extracted data in various formats like CSV, JSON or Excel. You can connect scrapers running on Apify with platforms such as Zapier to integrate them with your workflows. You can control everything on Apify using an API.
For example, with Apify, you can easily set up a task that will regularly check your competitors on Amazon.com and send you an email if they change their prices. Or when your restaurant receives a new review on Google Places.
Why use web scraping?
There are two main reasons to extract data from websites.
- The website doesn’t have an application programming interface (API) or doesn’t provide any way to download the data in a structured form. Although many popular websites and services nowadays have an API, the vast majority of websites do not.
- The website provides an API and a way to download its data, but it’s somehow limited, doesn’t provide complete data or is just too complicated to use. Web scraping is a faster and easier alternative to get the data.
But it’s not only about the extraction of data: you might also want to perform some actions based on the data. For example, you might want to automatically receive an email notification if something changes on the website, or you might want to automatically upload the data to your company CRM. We call this web automation and Apify is the perfect tool for that.
Start web scraping with Apify
You definitely don’t need to be a software developer to start scraping with Apify. But some basic knowledge of how the web works is always helpful. For example, it’s good to understand what a URL is and have some basic HTML skills, or be familiar with the difference between formats like JSON or CSV. But don’t worry, in this tutorial we’ll guide you through the essentials.
The only thing you need to get started is to have an Apify account and a verified email address. You can create an account here. It only takes a minute and it’s free of charge (no credit card required). Then go to your email inbox and click on the verification link.
Apify provides a free plan that you can use for small web scraping and automation jobs, so you can get started without any commitment. For larger workloads or for access to additional services, you might need to upgrade. See our pricing page for details.
Your first web scraping task
In this tutorial, we’ll use an actor called Web Scraper (apify/web-scraper), which is a basic tool for scraping websites using the Chrome web browser. Remember, an actor is just a small program running on the Apify cloud platform. You can think of it as an app on your phone, but instead running on Apify. By the way, you can find many pre-built actors in Apify Store, you can develop your own or you can order a new one on Apify Marketplace.
So let’s start scraping. Follow these steps and you will have your results in a few moments:
Step 1. Go to the Web Scraper actor page and click “Use actor”.
Step 2. The Apify app automatically creates a new task for the actor, and opens a page where you can edit its configuration.
Step 3. There are various settings, some of them are already prefilled. Before we change anything, let’s just try how it works with these. Simply hit the Run button at the bottom of the page and the actor will start scraping the apify.com website, extract the first 10 URLs and show results within about a minute.
Step 4. But we’re here to scrape news articles from CNN! So let’s go back to the INPUT tab and change some settings. The most important setting is Start URLs, which points to the web page where you want to start scraping. Since we want to scrape all articles from CNN.com about Stranger Things, we don’t need to start at the https://cnn.com homepage, but can narrow our search. Go to CNN.com and type “Stranger Things” in the search bar. You will see that you were redirected to a new page with search results:
That’s our start URL!
Step 5. The second most important input setting is Pseudo-URLs, which tells the scraper which web pages it should visit. You can enter special patterns there to make it visit only specific parts of the website. The Web Scraper will find a lot of pages and links in this search result, but you’re not interested in links to other categories in the header or footer. You only want articles and therefore you need to specify that. When you click on any of the search results, you will see that the URL of each article ends with
/index.html. This is the filter you need and if you follow the exact pattern for input you can see below, it will help you narrow your scraping. Enter the following pseudo URL:
Step 6. There will be other links ending with
/index.html that aren’t part of our search results. In order to only look for links strictly within our search results, we can use the Link selector. All search result links are part of the
cnn-search__result-thumbnail class. We can therefore limit our scrape to only these links. To do this, enter the following link selector:
Step 7. Hit the Save & Run button (or click Save and then hit Run if you prefer) and test that your input settings are valid. Wait a few minutes before the status box from the left top corner changes from “Running” to “Succeeded”.
You can download your list of URLs in the Dataset tab, which you can find in the top-right corner. Scroll down and you will see options to download your data in various formats — Excel, HTML table or JSON.
And that’s it, you have your list of articles!
Of course, there are many more ways to perform your scraping job. You can configure the scraper to collect articles from more than the first page of results and there are lots of other settings for you to play with as you get more confident. We’ve only scratched the surface. If you want to learn more, make sure you read our full tutorial for the Web Scraper actor.
What else is there on Apify?
Here’s a short overview of all the sections of the Apify app, so that you can get a better idea of how you can use it. Feel free to look around and try everything.
- Dashboard — the welcome page showing a list of your scraping jobs, links to tutorials, etc.
- Tasks — we’ve already used the Tasks tab above. Tasks stores settings for your actors, so that you can easily archive and reuse the settings, run them regularly, etc.
- Actors — the list of actors from Apify Store that you have recently used or the ones that you have built. You can start developing your own new actors from here.
- Schedules — we showed you how to scrape CNN for new articles. But as you noticed, you had to launch it manually. With schedules, you can make a task that you just created run regularly. You want to scrape CNN every 10 minutes? Every hour? Every day? No problem, just create a new schedule and set up the task.
- Proxy — sometimes the websites you want to scrape block access for bots if you want to download too much data, or they show content for a specific country. With Apify Proxy, you can bypass these protections by automatically rotating the IP address of your scraper, or target web content for a specific country.
- Storage — access all the data downloaded by your tasks and actors in one place, and export it to various formats.
- Account — manage your Apify account and subscription.
- Orders — we showed you how to build your own scraper, but we realize that scraping a complicated website can get tricky and you might not have the time. In this case, you can order a custom web scraping solution on Apify Marketplace from Apify-approved freelance developers around the world. Once you order a project, you will see it here in the Orders section
Advanced web scraping for developers
Although the Web Scraper (apify/web-scraper) actor is great for many web scraping use cases, there might be times when you’ll need a little more control or horsepower:
- When you need to scrape a lot of pages without dynamic content and need to do it fast, the Web Scraper actor might become too slow for you, because it uses the Chrome browser. For these kinds of websites we built Cheerio scraper (apify/cheerio-scraper), which downloads and parses raw HTML using the Cheerio library. Translated = it’s really fast!
- Some pages require more custom handling. For example, you might want to perform more complicated workflows, click buttons, etc. before extracting the data. For situations like that, we built Puppeteer Scraper (apify/puppeteer-scraper). Unlike Web Scraper, which does a lot of hand-holding, Puppeteer Scraper gives you all the power of Node.js and Puppeteer — the Node.js API for headless Chrome — ready for scraping at any scale.
- There are times when none of the existing actors are just right for your task. On Apify, you can build your own actor from scratch and perform any custom scraping or web automation workflow. This will come in handy if you need to integrate extracted data to your system and be in absolute control of how and when it will be stored. Have a look at Apify SDK, our open-source Node.js library for web scraping and automation.
We hope you enjoyed this tutorial. If you have any questions, feel free to post a comment or email us at firstname.lastname@example.org.