We're Apify - a full-stack web scraping and browser automation cloud platform for developers. An essential part of that is Apify Proxy, which improves the performance of your web scrapers by smartly rotating datacenter and residential IP addresses.
For web scraping developers, proxies need no introduction. So, I'll spare you the typical What are proxies? preamble. What interests us here is why and how you need to rotate those proxies when extracting data from the web.
What are rotating proxies?
Rotating proxies are dynamic proxy servers that change the source IP address for each new request or after a set number of requests. They're used to evade detection, manage request rates, and access restricted content without facing blocks or throttling.
Rotating proxies and web scraping
If you're serious about web scraping on a medium to large scale, rotating proxies are indispensable. To avoid IP blocks and bypass CAPTCHAs, you need to spread requests across multiple IP addresses and handle rate-limiting, which is a common anti-scraping technique where websites limit the number of requests from a single IP.
Do you need to locate and compile lots of public data from a popular e-commerce or social media website? Good luck doing that without IP rotation!
We're not talking about collecting private data you're not supposed to access, but publicly available data for legitimate web scraping use cases: data analysis, market research, price monitoring, and whatever else you can think of.
No matter how legal or ethical your reasons, if you're trying to collect vast amounts of publicly available information from any large or complex website, you're almost certain to get blocked by website protections if you don't rotate proxies.
Different types of rotating proxy
There are two main types of proxy for web scraping: datacenter proxies (shared, dedicated, and other kinds) and residential proxies (including static and mobile proxies). Let's find out more about these and when to use them.
Datacenter proxies
Datacenter proxies are a fast and cheap way to mask your real IP address. Your request to a website will go through a server in a data center, and the target website will see the data center’s IP address instead of yours. This makes them easier to identify than residential proxies, which are installed on end-user devices like mobile phones, laptops, or televisions.
Datacenter proxies are faster, cheaper, and more stable than other proxy types, so you should always use them first. Only use residential proxies when there's no other way - when datacenter proxies are heavily blocked by the website you want to scrape.
You need to try datacenter proxies with more than 1 request to know if they’ll work for you. We recommend trying them with at least 50 requests per IP.
To find out whether a datacenter proxy will do the trick, you should try scraping the target website using a datacenter proxy provider. To do that, you'll have to create an account with such a provider (we recommend these proxy services) and then find a connection URL in their documentation.
For example, for Apify, the connection URL looks like this:
http://auto:p455w0rd@proxy.apify.com:8000
A proxy with a URL like this is usually called a super proxy because it can automatically rotate IP addresses (notice auto
in the URL), change location, and more.
Other providers might give you access in a different way, using proxy server IP addresses, which would look like this:
http://username:p455w0rd@59.82.191.190:8000
That usually gives you access to a single specific IP address (59.82.191.190
in this example). No automatic rotation and no location changes. To make use of such proxies, you need to get lots of them and rotate them in your applications.
Here are a few examples of using a proxy with common HTTP clients. Here, we're using an Apify Proxy URL as an example, but it will look very much the same for any other provider.
curl "https://example.com" \
--proxy "http://auto:p455w0rd@proxy.apify.com:8000"
import { gotScraping } from 'got-scraping';
const response = await gotScraping({
url: 'https://example.com',
proxyUrl: 'http://auto:p455w0rd@proxy.apify.com:8000',
});
import requests
proxy_servers = {
'http': 'http://proxy.apify.com:8000',
'https': 'http://proxy.apify.com:8000',
}
auth = ('auto', 'p455w0rd')
response = requests.get('https://example.com', proxies="proxy_servers, auth=auth)
With the examples and providers above, you can test whether you can scrape your target website using a datacenter proxy. If it doesn't work or works only for a few requests, don't worry. It's still possible to make it happen. You can find out how in Datacenter proxies: when to use them and how to make the most of them.
Residential, static, and mobile proxies
Datacenter proxies are easier and cheaper to start with, but they're most likely to get blocked and blacklisted. For websites with advanced bot detection mechanisms, you’ll have to use either residential or mobile proxies. These are also less likely to be served a CAPTCHA.
Residential proxies use IP addresses sourced from ISPs, linking them to real-world locations and devices across various regions. This connection to genuine residential networks renders these proxies virtually undetectable as non-human traffic. Such proxies are particularly valuable for rotating IP addresses during data extraction. However, the cost associated with acquiring these IPs from ISPs means that residential proxies can be expensive to maintain.
Specialized variations of residential proxies include static and mobile proxies, each carrying a higher price tag.
Static proxies maintain a constant IP address tied to a residential location, which provides consistency for tasks requiring persistent identification, like managing accounts on social media platforms.
Mobile proxies, which utilize cellular networks like 3G, 4G, or 5G, offer a dynamic solution for geolocation changes. This enhances their authenticity and makes them less prone to detection by sophisticated anti-scraping technologies.
Nonetheless, because residential and mobile proxies are hosted on real human devices, they can be switched off anytime. That increases the number of connection errors and becomes considerably more expensive.
The solution to this problem? Proxy rotation.
So, let's get into how you should use rotating proxies.
How to use rotating proxies
Web scrapers with proxy rotation switch the IP addresses they use to access websites. This makes the requests seem less suspect, as they appear to be coming from different users rather than one user accessing a thousand pages.
Here's an example of adding an authenticated proxy to Got (an HTTP request library for Node.js):
In this example, only proxies from the country-US
will be chosen, and groups-RESIDENTIAL
means only residential proxies will be selected.
While this example is based on Apify's super proxy, other providers' configurations are very similar. Using such a configurable proxy is much more convenient than conventional IP lists.
IP session management and smart proxy rotation
While proxies can switch your geographical footprint rapidly, they don't inherently mask your scraping activities as human-like. The key to more advanced scraping lies in IP session management.
IP sessions allow you to maintain the same IP address for a controlled number of requests, mirroring actual user behavior more closely. Unlike constant IP hopping, which can raise red flags for sophisticated site defenses, IP sessions provide a balance by reusing an IP just enough before transitioning. This strategy prevents single IPs from being overly exposed and potentially blacklisted due to excessive or suspect requests.
Effective management involves evenly distributing the load across your IP pool, retiring IPs that encounter rate limits or errors, and crafting combinations of IPs, headers, and cookies that simulate real user sessions.
This process can be intricate and requires the use of advanced tools or libraries like Crawlee or Got-Scraping, which are designed to streamline these aspects of web data extraction.
By setting up proxy sessions, you have the ability to "lock" an IP address to allow for more granular control over your scraping activities. This method sees to it that your sessions are spread out to reduce the chance of hitting limits or being flagged as suspicious.
Got-Scraping facilitates the creation of browser-realistic headers and the integration of proxy sessions, while Crawlee offers an even broader set of features. It can scrape using both HTTP requests and headless browsers and automatically handles proxy sessions, headers, and cookies using its SessionPool class:
What I want you to take from this article
Rotating proxies are a basic requirement for any serious web scraping project.
However, advanced web scraping requires implementing smart IP rotation and session management. That will make your scraper's activities appear more human and less likely to trigger anti-scraping defenses.
Crawlee simplifies and automates these advanced strategies.
I don't have to spell it out for you, do I?
Check out the Crawlee documentation and give it a go!