Why Reddit is one of the biggest social sharing sites on the internet and how you can use web scraping to extract useful data from subreddits.
What is Reddit?
Reddit bills itself as “the front page of the internet.” That’s a bold claim, but it is definitely true for a significant number of internet users, with the latest figures for 2021 showing that it has over 430 million monthly active users and over 100,000 active communities. Reddit was launched in 2005, but it is still popular and relevant, unlike many other early social media sites.
If you’re not familiar with Reddit, it’s basically a huge social sharing site composed of smaller communities called subreddits.
Subreddits follow a specific URL format, e.g. https://www.reddit.com/r/apify/
Any user can post links, stories, pictures, or videos to these subreddits. Other users upvote or downvote the submissions. When something is upvoted, it becomes more visible. When something is downvoted, it becomes less visible.
Each subreddit also has moderators who make sure that the submissions are relevant to the topic of the subreddit, follow the rules, and aren’t just spam. Subreddits can have their own themes and some look dramatically different from others.
Some subreddits might seem a little strange, but they can still accumulate thousands of users.
Over the years, different subreddits have grown and flourished, reflecting how both the web and Reddit users have changed.
This interactive Reddit map gives you some idea of the scope of Reddit interests. Check it out for yourself and explore how subreddits connect.
Or just dive right in and explore 10 of the more accessible subreddits:
Why scrape Reddit?
Now that you have some idea of how many users are on Reddit and how diverse their interests are, you might, if you understand web scraping, be starting to think of how you could gather some useful data from Reddit.
Here are just some of the reasons you might want to scrape Reddit:
- Keep track of how your brand or product is being discussed across the site. And if it isn’t being discussed, you might find out why—or discover who your competitors are.
- Connect with your users and make sure that their questions are being answered quickly and effectively.
- Watch for new trends, attitudes, and avert potential PR disasters. Reddit often acts as an incubator for ideas, and how Redditors behave and think usually precedes mainstream channels by months or even years.
- Make sure that you keep ahead of potential profits or losses resulting from Reddit activity like the GameStop stock price surge. If you’re interested in a particular industry or ticker symbol, tracking mentions of it on Reddit might be a prudent move.
- Aggregate data, posts, images or videos from multiple subreddits and present them in new and interesting ways for your users.
The good news is: scraping Reddit data is not all that difficult - even if you've never extracted data from websites before. Just try out following this guide and watch our short YouTube tutorial for visual aids 📹🔴▶️
Is it legal to scrape Reddit?
Web scraping is legal as long as you respect regulations such as the GDPR and the CCPA, which cover personal data protection. It’s also important to only scrape publicly available content, which is not protected by copyright. To learn more about the legality of web scraping, check out our blog post on the subject.
What about the Reddit API?
Reddit has its own API designed to let developers interact in lots of useful ways with the Reddit site. It’s a great resource and every dev interested in scraping Reddit should be familiar with what it offers. So why should you use Apify to scrape Reddit rather than use the Reddit API?
Here are just some reasons you might not want to use the official API:
- Reddit requires you to be authenticated to scrape the Reddit website with their API. Apify doesn't require you to even have a Reddit account.
- The use of Reddit's API for commercial use requires special authorization. Apify doesn't enforce any restrictions on whether you are scraping Reddit for commercial or personal use.
- Reddit requires developers to register in order to get a token and use the official API. While we can't say whether Reddit ever refuses to give someone a token, they might if they don't like what you're doing with the scraped data. Apify is not interested in what you scrape and the data you collect from Reddit is yours to download and use however you like.
- Reddit has rules for how you use their API. If you scrape Reddit using our Reddit Scraper, you are not required to follow those rules (although we advise you to at least respect them and not abuse the site!).
So if you have been wondering whether an unofficial Reddit API exists and whether there's an easy way to use it, this step-by-step tutorial is for you. We’ll use a free ready-made tool called Free Reddit Scraper to get the Reddit data. For high-scale Reddit scraping and unlimited data extraction, you can use our powerful Reddit Scraper. The steps of this tutorial can be easily replicated for both of these scraping tools. Let's get to it!
How to scrape Reddit using Apify
Find your actor in Apify Store
1. Go to the Free Reddit Scraper page and click the green Try for free button.
2. Now you're on Apify sign-up page. If you don’t have an Apify account yet, you can easily sign in by using your Gmail, another email, or GitHub account.
3. Now you’re in your web scraping workspace - Apify Console. This is where you’ll create scraping tasks for your scraping tools, like the Free Reddit Scraper. So the first thing you need to do is to tell the Reddit Scraper what data you want to get from Reddit.
4. Let’s go to the Reddit website and do a search to see what data might be interesting. Hopefully, you're a dog person, so let’s search for “dogs”. If you prefer cats, feel free to search for “cats” 😻 Whatever you search for, copy the resulting URL that you can see in the address bar on top.
5. Now go back to Apify and enter the URL you copied into the Start URLs field. In our case, it’s https://www.reddit.com/search/?q=dogs.
6. Alternatively, you can also scrape Reddit by a search term instead of inserting a URL. No need to go on Reddit for this. Just type in "dogs" in the Search Term field. You can filter scraping results by date, specify whether you want to search posts, users or communities, or decide how you want your search to be sorted: relevance, amount of comments, trending, etc.
7. When you’re happy with how you’ve set up your scraping parameters, click the Start button. The actor will start scraping and you’ll see that it has a status of Running.
8. It might take a few minutes to complete the scraping run, but you should soon see that the actor has ☑️ Succeeded. You can then click the Dataset tab to see your results.
9. The dataset contains your data in lots of handy formats. You can open them by clicking on 👁 Preview or ⤵️ Download. You can share the data, use it in spreadsheets or projects, or upload it anywhere you like.
Now that you know how to scrape Reddit, you can play around with the settings and see what kind of data you can get.
Note that you will usually need to use a proxy to scrape Reddit or the actor will get blocked. Your free Apify account comes with a free trial of Apify Proxy, so that should help you get started.
If you want to scrape Reddit but don’t want to do it yourself, just tell us what you need and we’ll work with you to create a custom scraping solution.