You may already be familiar with this article, which does a fantastic job of explaining precisely what web scraping is and how to use Apify’s Web Scraper in a simple use case; however, if you aren’t already familiar with web scraping, here's a quick rundown.
What is web scraping?
Every website has data. Web scraping is a way of programmatically extracting this data then storing it somewhere for our own use. Web scraping doesn’t always have to be about collecting the data from a website, though. We can also perform actions based on the data.
Why scrape?
We scrape a site when we need to extract its data, but there is no efficient or cost-effective method of obtaining it.
There are various tools to help you with your specific scraping tasks, regardless of scale. Apify offers a few beginner-friendly tools that get the job done. One of the goals of this article will be to help you decide for yourself which one is best for your purposes.
Three popular scraping tools on Apify Store
Getting started
Unlike many other options, no programming experience is required to start scraping with Apify. All you need is a free Apify account, which can be set up within a minute or two:
Click on the verification link in your email inbox
Done! 👍
Choosing your scraper
You may remember this article being mentioned earlier, which covers just one of the main web scraping tools on the Apify platform (Web Scraper). This time, we’ll be covering Cheerio Scraper, which is similar to Web Scraper; although, there are some significant differences:
Unable to load dynamic content on a page, and only gets the initial load.
Performance
Less performant, but still can be optimized.
Very fast and effective (faster than Web Scraper). Thousands of requests can be sent within minutes.
Features
Can be used to automate a website, and perform actions such as typing or clicking. Can also execute JavaScript code on a page.
Cannot perform actions or execute code on a page.
This lays it out nicely, but the most significant differences to consider for our sample use case today are the “Ability” and “Performance” rows. In general, if the website you are scraping has content that must be loaded dynamically, use Web Scraper. Otherwise, use Cheerio Scraper.
Our scraping task
For our use case, we will be scraping the old Reddit website, which does not have any dynamic content. Because of this, we’ll be using Cheerio Scraper. Cheerio web scraping is a quick way to extract data from any website.
Our goal is to visit the first page of r/cats and scrape the title of every single post. At the end of the scraping job, all of the titles will be pushed into a dataset.
Web scraping the r/cats subreddit with Cheerio Scraper
With a little bit of knowledge of basic HTML and the use of copy/paste, we’ll have our results within minutes 😎
Step-by-step guide to scraping with Cheerio Scraper
1. Make sure you’re logged into your Apify account, open up the Cheerio Scraper page, and click “Try for Free”.
Cheerio Scraper on the Apify platform
This will bring us straight to the page where we can start building our scraper.
2. On the “Input and Options” page, paste the link (https://old.reddit.com/r/cats) into “StartURLs”.
Cheerio Scraper input and options
You’re going to notice a bunch of different configuration options on this page, but don’t let that scare you away! We will only be using Start URLs and Page Function for our first scraper, so feel free to delete the other options.
Start URLs is a list of URLs that we input, which the scraper will request.
Page Function is where all of our scraping logic goes. We tell the scraper what data to grab on each page.
Currently, we are only using one URL, but later on, we’ll be able to scrape posts from thousands of subreddits with the very same scraper we’re currently creating! Cheerio web scraping FTW!
3. Locate our data.
Chrome DevTools is surprisingly (or unsurprisingly) one of the most utilized tools in the scraping world. It’s super simple to use and makes it easy to find what specific elements on the page look like within the DOM (just a fancy name for the HTML tree of the page). Also, if you already have Chrome, you’ve already got Chrome DevTools, so no install necessary!
Navigate to our r/cats subreddit page and right-click. You should see this:
Inspect r/cats subreddit with Chrome DevTools
Go ahead and click “Inspect” at the very bottom. A new window will open. There is a whole lot of tabs and buttons there, but we’re going to be focusing on this one today:
Let’s click this, then hover over one of the titles on any post we like.
Hover over titles on r/cats to find the CSS selector
If we did it correctly, Chrome DevTools will tell us exactly which “CSS Selector” we need to plug into our scraper to make it work (a.title). If we were scraping more data, we’d have to do this same process for different elements; however, since we’re just scraping the titles for now, we’re ready to go!
4. Tell the scraper what to extract.
Moving into our Page Function, we are going to write the logic required to grab each post title from the page. We’re going to use a bit of JavaScript now, but don’t be alarmed - the code for this scraper is very minimal and will be all provided here for you.
First, we will create an empty list. This is where each post title will be stored.
const postTitles = [];
postTitles now represents all of the titles that the scraper will spit out at the end.
Then, using our selector (a.title) that we found earlier and the $ that we are provided within the Page Function, we can put together a simple script that will grab the text of each title and push it to our “postTitles” list.
Finally, we want to “return” our postTitles so that they’ll be added to the dataset. We can do this by simply changing the already existing “return” to look like this:
return {
postTitles
};
Our final Page Function should now look something like this:
5. Run our scraper.
Click the big orange “Get Data” button at the bottom of the screen, and wait a few seconds. Soon, you’ll see that our scraper has succeeded and that the data is now available.
Cheerio Scraper log
Let’s view our results by clicking “1 result” and then “Preview.”
Dataset preview
This preview is in JSON format but can be downloaded and viewed in other formats as well (JSON, XML, CSV, and Excel are the formats Apify supports).
Some things to note
This very same scraping job could have been done using Web Scraper, but it would have been slightly less performant and used somewhat more computing power (only slightly, as our scraper for this example was very small and simple).
You can find more in-depth information about the Cheerio Scraper actor and all of its options on the Cheerio Scraper page itself, as well as on the Cheerio Scraper tutorial page.
For more information about why you should choose Apify for your scraping endeavors, refer to the end of the Web Scraper article that we mentioned earlier.
If you feel that you are ready for more advanced scraping tasks (building your own actors) or that this tutorial was too basic for you, refer to the documentation for the Apify SDK to get started with advanced web scraping.
Finally, a challenge for you 💪
Once you feel comfortable with the material we’ve covered in this article and (mostly) everything on the Cheerio Scraper tutorial page, feel free to complete this challenge. It is an expansion of what we’ve already built together.
Scrape the first page of multiple subreddits (hint: startURLs).
Scrape not only the title of each post but also the author (hint: Chrome DevTools).