How to efficiently scrape any website with Cheerio Scraper

Learn to use Cheerio Scraper to extract data quickly and easily from the web.

Content

What is web scraping?

You may already be familiar with this introduction to web scraping, which does a fantastic job of explaining precisely what web scraping is. However, if you aren’t already familiar with web scraping, here's a quick rundown.

Why scrape the web?

Every website has data. Web scraping is a way of programmatically extracting this data and then storing it somewhere for our own use. We scrape a site when we need to extract its data, but there is no efficient or cost-effective method of obtaining it. Web scraping doesn’t always have to be about collecting the data from a website, though. We can also perform actions based on the data.

How does one build a web scraper?

There are various tools to help you with your specific scraping tasks, regardless of scale. Apify offers a few beginner-friendly tools that get the job done. One of the goals of this article will be to help you decide for yourself which one is best for your purposes.

Three popular scraping tools on Apify Store
Three popular scraping tools on Apify Store

Getting started with Apify

Unlike many other options, no programming experience is required to start scraping with Apify. All you need is a free Apify account, which can be set up within a minute or two:

Choosing your scraper

In this beginner's guide to web scraping, we covered just one of the main web scraping tools on the Apify platform (Web Scraper). In this article, we’ll be covering Cheerio Scraper, which is similar to Web Scraper, but there are some significant differences:

Web Scraper Cheerio Scraper
Requests Visits websites with a web-browser, just like a normal human does. Makes HTTP requests to the links it is provided, and doesn't act like a normal browser.
Ability Can load dynamic websites that load content on scroll, on certain actions, etc. Unable to load dynamic content on a page, and only gets the initial load.
Performance Less performant, but still can be optimized. Very fast and effective (faster than Web Scraper). Thousands of requests can be sent within minutes.
Features Can be used to automate a website, and perform actions such as typing or clicking. Can also execute JavaScript code on a page. Cannot perform actions or execute code on a page.

This lays it out nicely, but the most significant differences to consider for our sample use case today are the “Ability” and “Performance” rows.

ℹ️
If the website you are scraping has content that must be loaded dynamically, use Web Scraper. Otherwise, use Cheerio Scraper.

Our scraping task

For our use case, we will be scraping the old Reddit website, which does not have any dynamic content. Because of this, we’ll be using Cheerio Scraper. Cheerio web scraping is a quick way to extract data from any website.

Our goal is to visit the first page of r/cats and scrape the title of every single post. At the end of the scraping job, all of the titles will be pushed into a dataset.

Web scraping the r/cats subreddit with Cheerio Scraper
Web scraping the r/cats subreddit with Cheerio Scraper

With a little bit of knowledge of basic HTML and the use of copy/paste, we’ll have our results within minutes 😎

💡
If you want a detailed tutorial on using Cheerio for web scraping, check out this ultimate guide to Cheerio.

Step-by-step guide to scraping with Cheerio Scraper

1. Go to Apify's Cheerio Scraper

Make sure you’re logged into your Apify account, open up the Cheerio Scraper page, and click “Try for Free”.

Cheerio Scraper on the Apify platform
Cheerio Scraper on the Apify platform

This will bring us straight to the page where we can start building our scraper.

2. Insert Start URLs

On the “Input and Options” page, paste the link (https://old.reddit.com/r/cats) into “StartURLs”.

Cheerio Scraper input and options
Cheerio Scraper input and options

You’re going to notice a bunch of different configuration options on this page, but don’t let that scare you away! We will only be using Start URLs and Page Function for our first scraper, so feel free to delete the other options.

  • Start URLs is a list of URLs that we input, which the scraper will request.
  • Page Function is where all of our scraping logic goes. We tell the scraper what data to grab on each page.

Currently, we are only using one URL, but later on, we’ll be able to scrape posts from thousands of subreddits with the very same scraper we’re currently creating! Cheerio web scraping FTW!

3. Locate the data

Chrome DevTools is surprisingly (or unsurprisingly) one of the most utilized tools in the scraping world. It’s super simple to use and makes it easy to find what specific elements on the page look like within the DOM (just a fancy name for the HTML tree of the page). Also, if you already have Chrome, you’ve already got Chrome DevTools, so no install necessary!

Navigate to our r/cats subreddit page and right-click. You should see this:

Inspect r/cats subreddit with Chrome DevTools
Inspect r/cats subreddit with Chrome DevTools

Go ahead and click “Inspect” at the very bottom. A new window will open. There is a whole lot of tabs and buttons there, but we’re going to be focusing on this one today:

select

Let’s click this, then hover over one of the titles on any post we like.

Hover over titles on r/cats to find the CSS selector
Hover over titles on r/cats to find the CSS selector

If we did it correctly, Chrome DevTools will tell us exactly which “CSS Selector” we need to plug into our scraper to make it work (a.title). If we were scraping more data, we’d have to do this same process for different elements; however, since we’re just scraping the titles for now, we’re ready to go!

4. Tell the scraper what to extract

Moving into our Page Function, we are going to write the logic required to grab each post title from the page. We’re going to use a bit of JavaScript now, but don’t be alarmed - the code for this scraper is very minimal and will be all provided here for you.

First, we will create an empty list. This is where each post title will be stored.


const postTitles = [];

postTitles now represents all of the titles that the scraper will spit out at the end.

Then, using our selector (a.title) that we found earlier and the $ that we are provided within the Page Function, we can put together a simple script that will grab the text of each title and push it to our “postTitles” list.


$('a.title').each((_, element) => postTitles.push($(element).text()))

Finally, we want to “return” our postTitles so that they’ll be added to the dataset. We can do this by simply changing the already existing “return” to look like this:


return {

postTitles

};

Our final Page Function should now look something like this:

page function

5. Run Cheerio Scraper

Click the Start button at the bottom of the screen, and wait a few seconds. Soon, you’ll see that our scraper has succeeded and that the data is now available.

Cheerio Scraper log
Cheerio Scraper log

Let’s view our results by clicking “1 result” and then “Preview.”

Dataset preview
Dataset preview

This preview is in JSON format but can be downloaded and viewed in other formats as well (JSON, XML, CSV, and Excel are the formats Apify supports).

Some things to note

  • This very same scraping job could have been done using Web Scraper, but it would have been slightly less performant and used somewhat more computing power (only slightly, as our scraper for this example was very small and simple).
  • You can find more in-depth information about the Cheerio Scraper Actor and all of its options on the Cheerio Scraper page itself, as well as on the Cheerio Scraper tutorial page.
  • If you’re still not sure whether web scraping is right for you or your business, check out the pros and cons of web scraping.
  • For more information about why you should choose Apify for your scraping endeavors, refer to the end of the Web Scraper article that we mentioned earlier.
  • If you feel that you are ready for more advanced scraping tasks (building your own Actors) or that this tutorial was too basic for you, refer to the documentation for the Apify SDK to get started with advanced web scraping.

Finally, a challenge for you 💪

Once you feel comfortable with the material we’ve covered in this article and (mostly) everything on the Cheerio Scraper tutorial page, feel free to complete this challenge. It is an expansion of what we’ve already built together.

  1. Scrape the first page of multiple subreddits (hint: startURLs).
  2. Scrape not only the title of each post but also the author (hint: Chrome DevTools).

Need web scraping boilerplate code to save development time?

Matt Stephens
Matt Stephens
Full-Stack developer, web-automation engineer & technical writer. I enjoy helping others and developing new solutions. I absolutely love Next.js and GraphQL. I also have fun working with data.

Get started now

Step up your web scraping and automation