The official Instagram API allows you to programmatically access your own comments and posts on Instagram. However, the API doesn’t allow you to get a list of posts made by other people, comments and photos on the posts, or get a list of posts with a particular hashtag.
In this article, you’ll learn how we built the new Instagram Scraper (jaroslavhejlek/instagram-scraper) actor on Apify to scrape this data from the public Instagram website. You can try the new actor right away, free of charge.
Where to find the data
Some of Instagram is accessible even without logging in, but not as much as in the past. Once you log in, you can browse through hashtags, profiles, places, and posts. This is very encouraging, because if you can do something manually in a web browser, you can automate it on Apify 😉
Data available publicly without login
If you check the website in an incognito browser window, you’ll quickly find that there are some features that you can access freely and some that are either blocked or require you to log in. Here is what you can find without a login (note that this may have changed):
You can search for profiles, hashtags, and places, and Instagram will return the top 100 posts.
There is even a nice internal API endpoint that can be used to get the results in JSON format:
context query parameter serves as the filter and it can contain a place, user or a hashtag. The only limitation is that the endpoint returns just 100 results. If you need more, you need to enter a more detailed filter.
2) Posts from Profiles/Hashtags/Places
When you open any public Instagram page that contains posts (e.g. profile, hashtag or place), Instagram will return an HTML page with the first few posts preloaded (probably using React server-side rendering). Then, when you scroll down the page, Instagram will continue loading more posts using an XHR request to an Instagram’s GraphQL endpoint. The endpoint is protected with a token, so it’s not really possible to access it directly and we need to infinitely scroll the page. However, we can automate the infinite scrolling nicely using headless Chrome with Puppeteer.
After a few tests, I haven’t found a limit to how many posts can be loaded using the infinite scroll. There probably is one, but even a thousand posts were loaded during my testing.
3) Comments on posts
Every Instagram post has publicly visible comments and shows a Load more comments button if there are more comments that can be shown.
Clicking on the button fires an XHR request to the Instagram’s GraphQL endpoint. Again, we can easily automate this using Puppeteer’s
page.click() function and then extract the content of the comments from the web page.
Data available only after login
Unfortunately, certain content can be only accessed if you’re logged in using your Instagram account, for example:
- List of followers
- List of people a user follows
Although it would be possible to automatically log in to Instagram in order to access this data, this approach is risky since it can lead to the banning of your account by Instagram. Sure, you could create a fake Instagram account and use that instead, but that’s beyond the scope of this article.
Over time, Instagram has been increasingly limiting the data you can access without login, so you’ll need to test to see what you can scrape.
Creating an Apify actor to scrape the data
To build and bundle the web scraper for Instagram, I’ve created a new actor on Apify. Actors are cloud programs that accept input, perform their job and generate some output. They can be run manually in the app, using the API or scheduler.
The actor is written in Node.js and uses Apify SDK. On input, it takes an Instagram query or a list of direct profiles URLs, then it searches the query and scrapes page details, posts, or comments from results and the direct URLs. All the resulting data is stored in a structured form into a dataset, from which you can download it in formats such as JSON, XML, Excel, CSV, etc.
The actor is published in Apify Store as Instagram scraper (jaroslavhejlek/instagram-scraper) and you can use it free of charge, although you will need to use residential proxies on Apify Proxy. The source code is available on GitHub — pull requests and ideas for improvement are more than welcome!
Update: in 2021, Instagram changed the rules and now you always need to use a proxy for scraping 😖 The free trial of Apify Proxy given to every new Apify user won’t be enough, as you’ll need to use a residential proxy. But email firstname.lastname@example.org and tell them that you want a free trial of residential proxies so that you can scrape Instagram and they’ll sort you out!
If you run the scraper on Apify without residential proxies, there’s a good chance that Instagram will block access and not return any data, so we strongly recommend using Apify Proxy. You can also run the actor on your local computer — everything should work fine there.