What is web scraping?
You may already be familiar with this introduction to web scraping, which does a fantastic job of explaining precisely what web scraping is. However, if you aren’t already familiar with web scraping, here's a quick rundown.
Why scrape the web?
Every website has data. Web scraping is a way of programmatically extracting this data and then storing it somewhere for our own use. We scrape a site when we need to extract its data, but there is no efficient or cost-effective method of obtaining it. Web scraping doesn’t always have to be about collecting the data from a website, though. We can also perform actions based on the data.
How does one build a web scraper?
There are various tools to help you with your specific scraping tasks, regardless of scale. Apify offers a few beginner-friendly tools that get the job done. One of the goals of this article will be to help you decide for yourself which one is best for your purposes.
Getting started with Apify
Unlike many other options, no programming experience is required to start scraping with Apify. All you need is a free Apify account, which can be set up within a minute or two:
- Create a free Apify account
- Click on the verification link in your email inbox
- Done! 👍
Choosing your scraper
In this beginner's guide to web scraping, we covered just one of the main web scraping tools on the Apify platform (Web Scraper). In this article, we’ll be covering Cheerio Scraper, which is similar to Web Scraper, but there are some significant differences:
Web Scraper | Cheerio Scraper | |
---|---|---|
Requests | Visits websites with a web-browser, just like a normal human does. | Makes HTTP requests to the links it is provided, and doesn't act like a normal browser. |
Ability | Can load dynamic websites that load content on scroll, on certain actions, etc. | Unable to load dynamic content on a page, and only gets the initial load. |
Performance | Less performant, but still can be optimized. | Very fast and effective (faster than Web Scraper). Thousands of requests can be sent within minutes. |
Features | Can be used to automate a website, and perform actions such as typing or clicking. Can also execute JavaScript code on a page. | Cannot perform actions or execute code on a page. |
This lays it out nicely, but the most significant differences to consider for our sample use case today are the “Ability” and “Performance” rows.
Our scraping task
For our use case, we will be scraping the old Reddit website, which does not have any dynamic content. Because of this, we’ll be using Cheerio Scraper. Cheerio web scraping is a quick way to extract data from any website.
Our goal is to visit the first page of r/cats and scrape the title of every single post. At the end of the scraping job, all of the titles will be pushed into a dataset.
With a little bit of knowledge of basic HTML and the use of copy/paste, we’ll have our results within minutes 😎
Step-by-step guide to scraping with Cheerio Scraper
1. Go to Apify's Cheerio Scraper
Make sure you’re logged into your Apify account, open up the Cheerio Scraper page, and click “Try for Free”.
This will bring us straight to the page where we can start building our scraper.
2. Insert Start URLs
On the “Input and Options” page, paste the link (https://old.reddit.com/r/cats) into “StartURLs”.
You’re going to notice a bunch of different configuration options on this page, but don’t let that scare you away! We will only be using Start URLs and Page Function for our first scraper, so feel free to delete the other options.
- Start URLs is a list of URLs that we input, which the scraper will request.
- Page Function is where all of our scraping logic goes. We tell the scraper what data to grab on each page.
Currently, we are only using one URL, but later on, we’ll be able to scrape posts from thousands of subreddits with the very same scraper we’re currently creating! Cheerio web scraping FTW!
3. Locate the data
Chrome DevTools is surprisingly (or unsurprisingly) one of the most utilized tools in the scraping world. It’s super simple to use and makes it easy to find what specific elements on the page look like within the DOM (just a fancy name for the HTML tree of the page). Also, if you already have Chrome, you’ve already got Chrome DevTools, so no install necessary!
Navigate to our r/cats subreddit page and right-click. You should see this:
Go ahead and click “Inspect” at the very bottom. A new window will open. There is a whole lot of tabs and buttons there, but we’re going to be focusing on this one today:
Let’s click this, then hover over one of the titles on any post we like.
If we did it correctly, Chrome DevTools will tell us exactly which “CSS Selector” we need to plug into our scraper to make it work (a.title
). If we were scraping more data, we’d have to do this same process for different elements; however, since we’re just scraping the titles for now, we’re ready to go!
4. Tell the scraper what to extract
Moving into our Page Function, we are going to write the logic required to grab each post title from the page. We’re going to use a bit of JavaScript now, but don’t be alarmed - the code for this scraper is very minimal and will be all provided here for you.
First, we will create an empty list. This is where each post title will be stored.
const postTitles = [];
postTitles
now represents all of the titles that the scraper will spit out at the end.
Then, using our selector (a.title
) that we found earlier and the $
that we are provided within the Page Function, we can put together a simple script that will grab the text of each title and push it to our “postTitles” list.
$('a.title').each((_, element) => postTitles.push($(element).text()))
Finally, we want to “return” our postTitles so that they’ll be added to the dataset. We can do this by simply changing the already existing “return” to look like this:
return {
postTitles
};
Our final Page Function should now look something like this:
5. Run Cheerio Scraper
Click the Start button at the bottom of the screen, and wait a few seconds. Soon, you’ll see that our scraper has succeeded and that the data is now available.
Let’s view our results by clicking “1 result” and then “Preview.”
This preview is in JSON format but can be downloaded and viewed in other formats as well (JSON, XML, CSV, and Excel are the formats Apify supports).
Some things to note
- This very same scraping job could have been done using Web Scraper, but it would have been slightly less performant and used somewhat more computing power (only slightly, as our scraper for this example was very small and simple).
- You can find more in-depth information about the Cheerio Scraper Actor and all of its options on the Cheerio Scraper page itself, as well as on the Cheerio Scraper tutorial page.
- If you’re still not sure whether web scraping is right for you or your business, check out the pros and cons of web scraping.
- For more information about why you should choose Apify for your scraping endeavors, refer to the end of the Web Scraper article that we mentioned earlier.
- If you feel that you are ready for more advanced scraping tasks (building your own Actors) or that this tutorial was too basic for you, refer to the documentation for the Apify SDK to get started with advanced web scraping.
Finally, a challenge for you 💪
Once you feel comfortable with the material we’ve covered in this article and (mostly) everything on the Cheerio Scraper tutorial page, feel free to complete this challenge. It is an expansion of what we’ve already built together.
- Scrape the first page of multiple subreddits (hint: startURLs).
- Scrape not only the title of each post but also the author (hint: Chrome DevTools).
Need web scraping boilerplate code to save development time?