Why Reddit is one of the biggest social sharing sites on the internet and how you can use web scraping to extract useful data from subreddits.
What is Reddit?
Reddit bills itself as “the front page of the internet”. That’s a bold claim, but it is definitely true for a significant number of internet users, with the latest figures for 2021 showing that it has over 430 million monthly active users and over 100,000 active communities. Reddit was launched in 2005, but it is still popular and relevant, unlike many other early social media sites.
If you’re not familiar with Reddit, it’s basically a huge social sharing site composed of smaller communities called subreddits.
Subreddits follow a specific URL format, e.g. https://www.reddit.com/r/apify/
Any user can post links, stories, pictures, or videos to these subreddits. Other users upvote or downvote the submissions. When something is upvoted, it becomes more visible. When something is downvoted, it becomes less visible.
Each subreddit also has moderators who make sure that the submissions are relevant to the topic of the subreddit, follow the rules, and aren’t just spam. Subreddits can have their own themes and some look dramatically different from others.
Some subreddits might seem a little strange, but they can still accumulate thousands of users.
Over the years, different subreddits have grown and flourished, reflecting how both the web and Reddit users have changed.
This interactive Reddit map gives you some idea of the scope of Reddit interests. Check it out for yourself and explore how subreddits connect.
Or just dive right in and explore 10 of the more accessible subreddits:
Why scrape Reddit?
Now that you have some idea of how many users are on Reddit and how diverse their interests are, you might, if you understand web scraping, be starting to think of how you could gather some useful data from Reddit.
Here are just some of the reasons you might want to scrape Reddit:
- Keep track of how your brand or product is being discussed across the site. And if it isn’t being discussed, you might find out why—or discover who your competitors are.
- Connect with your users and make sure that their questions are being answered quickly and effectively.
- Watch for new trends, attitudes, and avert potential PR disasters. Reddit often acts like an incubator for ideas and how Redditors behave and think usually precedes mainstream channels by months or even years.
- Make sure that you keep ahead of potential profits or losses resulting from Reddit activity like the recent GameStop stock price surge. If you’re interested in a particular industry or ticker symbol, tracking mentions of it on Reddit might be a prudent move.
- Aggregate data, posts, images or videos from multiple subreddits and present them in new and interesting ways for your users.
What about the Reddit API?
Reddit has its own API designed to let developers interact in lots of useful ways with the Reddit site. It’s a great resource and every dev interested in scraping Reddit should be familiar with what it offers. So why should you use Apify to scrape Reddit rather than use the Reddit API?
Here are just some reasons you might not want to use the official API:
- Reddit requires you to be authenticated to scrape the Reddit website with their API. Apify doesn't require you to even have a Reddit account.
- Use of Reddit's API for commercial use requires special authorization. Apify doesn't enforce any restrictions on whether you are scraping Reddit for commercial or personal use.
- Reddit requires developers to register in order to get a token and use the official API. While we can't say whether Reddit ever refuses to give someone a token, they might if they don't like what you're doing with the scraped data. Apify is not interested in what you scrape and the data you collect from Reddit is yours to download and use however you like.
- Reddit has rules for how you use their API. If you scrape Reddit using our Reddit Scraper, you are not required to follow those rules (although we advise you to at least respect them and not abuse the site!).
How to scrape Reddit using Apify
- Go to http://apify.com
2. You can log in using your email account, Google, or GitHub.
5. Once you’re on the page for Reddit Scraper, just click Try me and you’ll be sent to your Apify account, with a Task automatically created for you to start using the actor. An actor task is just a way for you to configure the tool to do what you want.
6. Let’s go to Reddit and do a search to see what data might be interesting. Maybe you like dogs, so let’s search for “dogs”. If you prefer cats, feel free to search for “cats” 😻 Whatever you search for, copy the resulting URL that you can see in the address bar.
7. Now go back to Apify and enter the URL you copied into the Start URLs field. In our case, it’s https://www.reddit.com/search/?q=dogs
8. You can also change some of the other input parameters. For instance, you can filter by date, specify whether you want to search posts or communities, or decide how you want your search to be sorted.
9. When you’re happy with how you’ve set up your scraping parameters, click the Save & Run button. The actor will start scraping and you’ll see that it has a status of Running.
10. It might take a few minutes to complete the scraping run, but you should soon see that the actor has Succeeded. You can then click the Dataset tab to see your results.
11. The dataset contains your data in lots of handy formats. You can open them by clicking on View or Download. You can share the data, use it in spreadsheets or projects, or upload it anywhere you like.
Now that you know how to scrape Reddit, you can play around with the settings and see what kind of data you can get.
Note that you will usually need to use a proxy to scrape Reddit or the actor will get blocked. Your free Apify account comes with a free trial of Apify Proxy, so that should help you get started.
If you want to scrape Reddit but don’t want to do it yourself, just tell us what you need and we’ll work with you to create a custom scraping solution.