Most seasoned Apify users and members of the scraping world are aware of the power of proxies. Here at Apify, proxies play a huge role in aiding us to appear as different users in different locations with different machines when making hundreds - and sometimes thousands - of requests in a single scraping job. This is done by switching (or "rotating") through a list of proxies every few requests. Making our bots appear human, or as if they are in a different location, helps them not to get blocked. This can even help encourage digital freedoms in countries with restrictive internet policies.
In order to factor away the complexities of this task for the ever-growing web scraping community, we developed Apify Proxy, which can automagically rotate through dozens of proxy groups and their corresponding proxies. Some of these proxy groups are free, and some (the super-reliable ones) are premium.
All of this is absolutely fantastic, but then I started to ask myself 🤔
"What about all of the free public proxies available on various websites such as Geonode? Those could be utilized with this intelligent proxy-rotating logic as well! The more the merrier, right?"
And so, I began my quest to build a proxy scraper and proxy group entirely from public proxies. I learned two things during this process:
There are a whole lot of public proxies out there 🎉
With one Google search of free proxy list, you'll quickly discover that there is a great deal of free public proxies available for use. Sites like Geonode and free-proxy-list.net are among the many which provide large lists of proxies which are updated and checked daily.
All proxies from these sites vary in quality based on anonymity, protocol, and speed; however, you can find some really good ones in there!
A whole lot of those proxies don't work ðŸ˜
The big bummer is that most of those proxies simply don't work. Either they're down (permanently or temporarily), or they're much too slow to use in any sort of serious project. Because of this, anyone searching through free proxy list websites might have to search for a quite while before finding even just one reliable and usable one.
Enter: Proxy Scraper 💪
Proxy Scraper is a new free public actor on the Apify platform which takes both of these important factors into account, and does two main tasks in every run:
Scrape all currently available proxies from 17 different free proxy websites and APIs
Each site request is optimized to return the highest quality proxy results.
Once all sites have been scraped, each proxy is tested by using it to send a request to a user-specified target website.
If a request fails (for any reason), the currently-being-tested proxy is removed from the list which will eventually be outputted.
All duplicate proxies are removed from the list.
With an incredibly simple configuration process, fast run-times, and reliable outputs, Proxy Scraper is the best way to quickly obtain a list of working public proxies. It makes the process of retrieving data from free proxy websites much more accessible, as it removes the obnoxious need to manually check each and every one.
How can you use Proxy Scraper to find working public proxies?
Since the end output of the actor is usable proxies, we can use it for any of the many use cases for proxies fall. However, there are some specific ways this actor can be utilized:
1. Calling the actor via API within your own project
Within your own project, you can call the Proxy Scraper actor via Apify's API. Once you call the actor and retrieve its output dataset, it's completely up to you what you do with the proxies! There are a few available options to do this:
client.actor('mstephen190/proxy-scraper') with the Apify API JavaScript client (This is recommended if you want a JavaScript API, but aren't going to use the SDK for anything else).
As a step up from #1, you could further automate your proxy retrieval process by setting up a new schedule on your Apify account, which can be configured to call Proxy Scraper at an interval (every hour, every day, etc.) with your own specified inputs.
We're currently using this method at Apify to run the actor every 30 minutes, and save the proxies to a named key-value store, which is used for the FREEPROXIES proxy group in Apify Proxy(which will be available very soon, for free). Here is an example of how we configured Proxy Scraper for our schedule:
It's important to note that, in order to reap the full rewards of running this actor on a schedule, you should set pushToKvStore to true within the "Storages" configuration. This means that every time your custom schedule runs, the previous proxy data in your named key-value store will be overwritten with the most up-to-date working public proxies. You can learn more about the difference between named and unnamed storages in Apify's Storage Documentation.
IMPORTANT: When you choose to push to a key-value-store, it will always be named. The default name is free-proxy-store. Additionally, by using the key-value-store, you are given access to the data not only in JSON format, but also text/plain format.
What's next for Proxy Scraper?
In the future, Proxy Scraper will ideally scrape proxies from more than just the 17 sources currently being used. At the moment, it finds anywhere from 20 to 60 reliable proxies out of the 2,500 that it scrapes in every run (this really demonstrates the fact that most of them really don't work). More proxies being scraped would mean more overall results.
Later versions will feature more fleshed out dataset results. Rather than just returning each proxy's host and port, the protcol and country would also be part of every result object.