How anti-scraping protections work and how they get bypassed

Websites have protections to stop malicious bots. But not all bots are created equal. Find out about the methods used to avoid being blocked while scraping the web legally and ethically.

Content

The internet is the greatest repository of human knowledge in existence. At the beginning of 2021, there were approximately 4.66 billion active internet users worldwide, accounting for 59.5 percent of the global population.

Browsing and collecting data from the web has become an integral part of our lives, to the extent where we rarely stop to think how time-consuming and inefficient this activity can be when carried out manually.

Luckily, there are tools, such as web scraping bots, that we can use to automate most of these activities and drastically increase their efficiency. However, despite their clear benefits, bots can also be controversial.

This article will explore some of the techniques that are used to bypass these barriers and avoid being blocked while legally scraping data from the web.

Good bots vs. bad bots


Bots already account for a significant part of the web traffic on the internet.

Good bots often perform helpful tasks on a website that aren't detrimental to the user's experience on the page. Examples of "good bots" are search engine bots, site monitoring bots, chatbots, etc. In short, the good bots and the bots authorized by the website.

human and bots web traffic distribution
Statista - Distribution of bot and human web traffic worldwide in 2019 and 2020

Additionally, good bots will follow the rules described on the robots.txt file, a set of rules for bots accessing the hosted website or application. For example, if a website doesn't want a certain page to show up in Google search results, they can write a rule in the robots.txt file, and the Google crawler bots won't index that page. However, the rules in robots.txt are not "laws" and cannot be enforced, meaning that bad bots often disregard them.

google.com robots.txt
robots.txt file example from google.com

The so-called "bad bots" are not only bots used to conduct harmful activities to their target website and its users. Instead, all bots that the website owner does not authorize are categorized as "bad bots," including web scrapers and automation bots.

Unfortunately, not all bad bots are created for honorable purposes. Just like any other tool in the hands of ill-intentioned individuals, bots can be used for non-ethical and illegal activities, like coordinating brute force attacks and stealing users' private information.

Despite being deemed "bad bots", web scrapers are not intended to cause harm. Instead, they are often used for conducting perfectly legal activities such as automating repetitive workflows and extracting publicly available data.

Nevertheless, the abuse of web scraping and automation bots forces many service providers to have strict protective measures to prevent malicious bots from wreaking havoc on their servers.

As a side effect of these protective methods against non-ethical bots, well-intentioned bots end up also blocked from accessing websites, which makes the development of bots more challenging and expensive.

How servers identify bots


The first step in overcoming anti-scraping protections is understanding how bots are detected. Service providers use a variety of techniques to detect bots, intending to monitor non-human-like behavioral patterns to collect data to build statistical models that can identify bots.

I am not a robot
I am not a robot

IP rate limit

Bots, unlike human users, are capable of sending a large number of requests from a single IP address in a short period of time. Websites can easily monitor this unnatural occurrence and, if the number of requests exceeds a specified limit, the website can block the suspicious IP address or require a CAPTCHA test.

The "IP rate limiting" bot-mitigation method restricts the network traffic that can be generated by a unique IP address, reducing the strain on web servers and blocking the activity of potentially malicious bots. This method is particularly efficient at stopping web scraping, DDoS, and brute force attacks.

HTTP request analysis

HTTP requests can be defined as the way in which web browsers ask for the information they need to load a website.

Each HTTP request sent from a client to a web server carries a series of encoded data containing information about the client requesting the resource, such as HTTP headers and the client's IP address.

The information contained in an HTTP request can be crucial for identifying bots, as even the order of the HTTP headers can tell whether the request comes from a real web browser or a script.

The most commonly known header element that aids in detecting bots is the user-agent, which specifies what type of browser the client is using and its version.


User behavior analysis

Unlike previous methods, behavior analysis does not aim to identify bots in real-time but instead collects user behavior data over longer periods to identify specific patterns that may only be apparent once sufficient information is available.

The collected information can contain data such as the order in which pages are visited, how long the user stays on each page, mouse movements, and even how fast forms are filled in. If enough evidence indicates that the user’s behavior is not human, the client's IP address can be blocked or submitted to an anti-bot test.

Browser fingerprinting

Browser fingerprinting is the term used to describe the tracking techniques employed by websites to collect information on the users accessing their servers.

Most modern website functions require the use of scripts, which can silently work in the background to collect extensive data about the user's device, browser, operating system, installed extension, time zone, etc. This collected data, when combined, forms a user's unique online “fingerprint.”, which can then be traced back to this same user across the internet and different browsing sessions.

Fingerprinting is an effective method used to identify bots, and, to make things even more complicated, websites tend to employ various bot mitigation techniques. Nevertheless, even the most ingenious anti-scraping methods can be bypassed, as we will see further in this article.

Web scraping: how to crawl without getting blocked
Guide on how to solve or avoid anti-scraping protections.
Another in-depth piece on the topic of blocking.

How to bypass protections


Now that we know how websites identify bots and use countermeasures to prevent bots from accessing them, we can explore how these protections are in turn evaded by bots.

Robot bypassing protections

Simulating browser headers

As previously mentioned, anti-scraping protections check HTTP request headers to evaluate if the requests are coming from a real browser and, if not, the suspicious IP address will be blocked.

In order to avoid this protection, a bot must have a matching header structure with the provided user-agent (check your user-agent).

One simple way of bypassing this restriction is to start a browser with a predefined user-agent header, like in the Puppeteer example below.

const browser = puppeteer.launch({
	headless: true,
	args: [
    	`--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36`
	]
})


Since the request header now represents a real browser, the protection cannot block it as it wants to preserve human traffic and avoid false positives. However, you must be aware of other protections, such as the IP rate limit, which can still get the bot's IP address blocked if it surpasses the request limit.

Rotating IP addresses

Proxies are the holy grail of scraping. There are two main types of proxies: datacenter and residential proxies.

As its name suggests, data center proxies are hosted in data centers and usually have a shared IP range. On the other hand, residential proxies are hosted on home machines or devices.

Both types have their pros and cons. For instance, residential proxies are less likely to be served a CAPTCHA. However, these proxies are hosted in real human devices, which can be switched off anytime, increasing the number of connection errors and being more costly compared to the data center option.

Did you know that CAPTCHAs are bad UX and can be easily bypassed? 🤔

Why CAPTCHAs are bad UX and how they get bypassed

Datacenter proxies are stable and cheaper. Nevertheless, the disadvantage is that some IP ranges can be publicly available and automatically blacklisted by protections. For example, if the proxy server is hosted on Amazon Web Services (AWS), the bot might be immediately identified because the AWS IP ranges are known.

The solution to this problem is to rotate IP addresses from which requests are sent to the target websites. This can be done by using a pool of proxy servers and assigning each request to a proxy server from the pool, thus making it look like the requests are coming from different users.

The effectiveness of this method depends on various factors, such as the number of web pages that are being scraped, the sophistication of the scraping protection, and the number and type of proxies used. If too many requests are sent from a single proxy in too short a period of time, the proxy might get “burned,” which means all further requests from it are blocked.

The quality and quantity of proxies in a proxy pool can heavily influence the success rate of the scraping bot. That is why Apify Proxy provides access to an extensive pool of residential and datacenter IP addresses to find the right balance between performance and cost.

Want dive deeper into the most advanced anti-scraping protection mitigation techniques using Apify SDK?

Check this article and never get blocked again 👨‍💻

Bypassing IP rate limiting protection

Rotating IP addresses is one way that IP rate limits are bypassed, but it isn't the only method available.

Limiting the number of pages scraped concurrently on a single site with intentional delays is another way of keeping the bot's request rate under the limit and avoiding being blocked.

Apify Actors are designed to reduce the workload on websites being scraped. To lower the concurrency when using the Apify SDK, just pass the maxConcurrency option to your crawler's setup. Alternatively, if you use Actors from Apify Store, you can usually set the maximum concurrency in the Actor's input.

Reducing blocking with shared IP address emulation

IP address rotation and emulation of browser HTTP signatures can be effective for most web scraping tasks, but large-scale crawls will eventually get blocked. Using more proxies might be a solution, but the costs of this solution will also increase substantially. Another option is shared IP address emulation.

Shared IP address emulation can dramatically increase the effectiveness of large-scale scraping. The secret of this technique is to rely on the fact that websites know that different users can be behind a single IP address.

Shared IP address emulation vs. IP rotation
Performance comparison - IP rotation vs. Shared IP address emulation

For example, requests from mobile devices are usually only routed through a handful of IP addresses, while users behind a single corporate firewall might all have the same IP address. By emulating and managing these user sessions per IP address, it is possible to prevent websites from aggressive blocking.

To make it work, a single user session has to always be routed via the same IP address. A website can identify such user sessions based on cookies, authentication tokens, or a browser HTTP signature/fingerprint.

It is possible to easily benefit from this functionality by using the Apify SDK's SessionPool class. This can be used in other Apify tools such as Actors and proxy, but it also works outside of the Apify ecosystem.

The onion router (TOR)

The (in)famous TOR is a free and open-source software for enabling anonymous communication.

TOR acts like multiple connected proxy nodes which change IP addresses every time the connection session is refreshed. In other words, TOR acts as a free proxy. Since web scraping is a legal activity, there is no need to use the TOR for its original purpose of hiding the user's identity and avoiding being tracked.

Tor broswer
Tor browser logo 

In any case, there are two major downsides to using TOR for web scraping purposes.

The first one is that the IP addresses node list is publicly available, which means that these IP addresses can be easily blocked.

The second downside is ethical. TOR was originally designed to protect people's privacy and make independent media available to people in countries where there is government censorship and control over the media. Therefore, using TOR for bot-related activities, not exclusively scraping, can cause its list of IP addresses to be blacklisted, hence also blocking real users from accessing webpages.

The legality of bypassing website protections


Defining bots as being either "good" or "bad" can be misleading. Web scraping and browser automation activities are completely legal, provided that they respect the boundaries of personal data and intellectual property regulations. Bypassing website protections is not illegal, as long as your reasons for doing so are also legal.

However, keeping scraping solutions from getting blocked can be extremely time-consuming and challenging.

Scraping for data is often just a small piece of a more complex task you need to accomplish. Diverging your focus from your main goal to collect data efficiently can be a major drawback to your project. To avoid this, you can submit a custom solution request to Apify and let us handle all the web scraping challenges while freeing up your time to spend on things that matter.

Percival Villalva
Percival Villalva
Developer Advocate on a mission to help developers build scalable, human-like bots for data extraction and web automation.

Get started now

Step up your web scraping and automation