How to efficiently scrape any website with Cheerio Scraper

Matt Stephens
Matt Stephens

You may already be familiar with this article, which does a fantastic job of explaining precisely what web scraping is and how to use Apify’s Web Scraper in a simple use case; however, if you aren’t already familiar with web scraping, here's a quick rundown.

What is web scraping?

Every website has data. Web scraping is a way of programmatically extracting this data then storing it somewhere for our own use. Web scraping doesn’t always have to be about collecting the data from a website, though. We can also perform actions based on the data.

Why scrape?

We scrape a site when we need to extract its data, but there is no efficient or cost-effective method of obtaining it.

Check out our Beginner's Guide to Web Scraping for a basic introduction to web scraping.

How does one build a web scraper?

There are various tools to help you with your specific scraping tasks, regardless of scale. Apify offers a few beginner-friendly tools that get the job done. One of the goals of this article will be to help you decide for yourself which one is best for your purposes.

Three popular scraping tools on Apify Store
Three popular scraping tools on Apify Store

Getting started

Unlike many other options, no programming experience is required to start scraping with Apify. All you need is a free Apify account, which can be set up within a minute or two:

Choosing your scraper

You may remember this article being mentioned earlier, which covers just one of the main web scraping tools on the Apify platform (Web Scraper). This time, we’ll be covering Cheerio Scraper, which is similar to Web Scraper; although, there are some significant differences:

Web Scraper Cheerio Scraper
Requests Visits websites with a web-browser, just like a normal human does. Makes HTTP requests to the links it is provided, and doesn't act like a normal browser.
Ability Can load dynamic websites that load content on scroll, on certain actions, etc. Unable to load dynamic content on a page, and only gets the initial load.
Performance Less performant, but still can be optimized. Very fast and effective (faster than Web Scraper). Thousands of requests can be sent within minutes.
Features Can be used to automate a website, and perform actions such as typing or clicking. Can also execute JavaScript code on a page. Cannot perform actions or execute code on a page.

This lays it out nicely, but the most significant differences to consider for our sample use case today are the “Ability” and “Performance” rows. In general, if the website you are scraping has content that must be loaded dynamically, use Web Scraper. Otherwise, use Cheerio Scraper.

Our scraping task

For our use case, we will be scraping the old Reddit website, which does not have any dynamic content. Because of this, we’ll be using Cheerio Scraper.

Our goal is to visit the first page of r/cats and scrape the title of every single post. At the end of the scraping job, all of the titles will be pushed into a dataset.

Web scraping the r/cats subreddit with Cheerio Scraper
Web scraping the r/cats subreddit with Cheerio Scraper

With a little bit of knowledge of basic HTML and the use of copy/paste, we’ll have our results within minutes 😎

Step-by-step guide to scraping with Cheerio Scraper

1. Make sure you’re logged into your Apify account, open up the Cheerio Scraper page, and click “Try for Free”.

Cheerio Scraper on the Apify platform
Cheerio Scraper on the Apify platform

This will bring us straight to the page where we can start building our scraper.

2. On the “Input and Options” page, paste the link (https://old.reddit.com/r/cats) into “StartURLs”.

Cheerio Scraper input and options
Cheerio Scraper input and options

You’re going to notice a bunch of different configuration options on this page, but don’t let that scare you away! We will only be using Start URLs and Page Function for our first scraper, so feel free to delete the other options.

  • Start URLs is a list of URLs that we input, which the scraper will request.
  • Page Function is where all of our scraping logic goes. We tell the scraper what data to grab on each page.

Currently, we are only using one URL, but later on, we’ll be able to scrape posts from thousands of subreddits with the very same scraper we’re currently creating!

3. Locate our data.

Chrome DevTools is surprisingly (or unsurprisingly) one of the most utilized tools in the scraping world. It’s super simple to use and makes it easy to find what specific elements on the page look like within the DOM (just a fancy name for the HTML tree of the page). Also, if you already have Chrome, you’ve already got Chrome DevTools, so no install necessary!

Navigate to our r/cats subreddit page and right-click. You should see this:

Inspect r/cats subreddit with Chrome DevTools
Inspect r/cats subreddit with Chrome DevTools

Go ahead and click “Inspect” at the very bottom. A new window will open. There is a whole lot of tabs and buttons there, but we’re going to be focusing on this one today:

select

Let’s click this, then hover over one of the titles on any post we like.

Hover over titles on r/cats to find the CSS selector
Hover over titles on r/cats to find the CSS selector

If we did it correctly, Chrome DevTools will tell us exactly which “CSS Selector” we need to plug into our scraper to make it work (a.title). If we were scraping more data, we’d have to do this same process for different elements; however, since we’re just scraping the titles for now, we’re ready to go!

4. Tell the scraper what to extract.

Moving into our Page Function, we are going to write the logic required to grab each post title from the page. We’re going to use a bit of JavaScript now, but don’t be alarmed - the code for this scraper is very minimal and will be all provided here for you.

First, we will create an empty list. This is where each post title will be stored.


const postTitles = [];

postTitles now represents all of the titles that the scraper will spit out at the end.

Then, using our selector (a.title) that we found earlier and the $ that we are provided within the Page Function, we can put together a simple script that will grab the text of each title and push it to our “postTitles” list.


$('a.title').each((_, element) => postTitles.push($(element).text()))

Finally, we want to “return” our postTitles so that they’ll be added to the dataset. We can do this by simply changing the already existing “return” to look like this:


return {

postTitles

};

Our final Page Function should now look something like this:

page function

5. Run our scraper.

Click the big orange “Get Data” button at the bottom of the screen, and wait a few seconds. Soon, you’ll see that our scraper has succeeded and that the data is now available.

Cheerio Scraper log
Cheerio Scraper log

Let’s view our results by clicking “1 result” and then “Preview.”

Dataset preview
Dataset preview

This preview is in JSON format but can be downloaded and viewed in other formats as well (JSON, XML, CSV, and Excel are the formats Apify supports).

Some things to note

  • This very same scraping job could have been done using Web Scraper, but it would have been slightly less performant and used somewhat more computing power (only slightly, as our scraper for this example was very small and simple).
  • You can find more in-depth information about the Cheerio Scraper actor and all of its options on the Cheerio Scraper page itself, as well as on the Cheerio Scraper tutorial page.
  • If you’re still not sure whether web scraping is right for you or your business, check out the pros and cons of web scraping.
  • For more information about why you should choose Apify for your scraping endeavors, refer to the end of the Web Scraper article that we mentioned earlier.
  • If you feel that you are ready for more advanced scraping tasks (building your own actors) or that this tutorial was too basic for you, refer to the documentation for the Apify SDK to get started with advanced web scraping.

Finally, a challenge for you 💪

Once you feel comfortable with the material we’ve covered in this article and (mostly) everything on the Cheerio Scraper tutorial page, feel free to complete this challenge. It is an expansion of what we’ve already built together.

  1. Scrape the first page of multiple subreddits (hint: startURLs).
  2. Scrape not only the title of each post but also the author (hint: Chrome DevTools).


Great! Next, complete checkout for full access to Apify
Welcome back! You've successfully signed in
You've successfully subscribed to Apify
Success! Your account is fully activated, you now have access to all content
Success! Your billing info has been updated
Your billing was not updated