It has been said that web scraping (now a widespread automated method of data extraction) is as old as the web itself. That's not entirely accurate, though. The World Wide Web was created in 1989, the first web browser in 1991, and the first web crawler in 1993. That crawler (the first web robot) was known as the Wanderer (or World Wide Web Wanderer for those who love alliteration), and its purpose was to measure the size of the web. Later that year, the first crawler-based search engine, JumpStation (the first web search engine to rely on a bot), was unleashed upon the world.
At the turn of the century, the first Web API and API crawler were created. And shortly after 2004 and the launch of Beautiful Soup - a popular HTML parser written in the Python programming language - web scraping, as we know it today, was born.
Since then, web scraping has gone from being an outlier to part of the technological stack of just about any business that deals with big data. Real estate, e-commerce, marketing and media, research and education, AI and machine learning: data is the axis upon which these worlds turn. Without web scraping, there's no way people working and moving in these fields could retrieve and store the unfathomable amount of digital information needed to make smart decisions or feed the tools of their industry.
Is web scraping legal?
Although web scraping has become integral to so many businesses, a common question is whether it's legal. The short answer is 'yes', but that's not to say there are no limits. Just as driving is legal, but exceeding the speed limit is not, so too is web scraping legal, provided you don't breach laws regarding things like personal data or copyright. If you want to dive into more detail about the legal limits and ethical principles of web scraping, you can find out more in the articles below.
I began with the most common definition of web scraping: an automated method of web data extraction. But as ubiquitous as that definition is, it's an oversimplification. The data extraction part is pretty easy. The tricky part is all the hurdles you have to overcome to get to the data in the first place. A simple manual copy-and-paste job (the most primitive form of web scraping) is easy enough, but today developers and businesses need to extract and process vast quantities of data from thousands or even millions of web pages a day. For that, you need bots to open websites and retrieve the content relevant to your purpose. As data generation continues to increase exponentially and website security measures improve, getting your bot minions to access and copy that information is becoming more and more challenging.
So, is web scraping difficult? The answer is yes and no. Collecting data from the web is simple, but getting web scraping bots to open the sites to extract data on a large scale over an extended period of time requires a fair bit of cunning wizardry. So, with that said, let's take a quick look at the most common challenges for web data extraction today.
Websites often take protective measures against bots as a safeguard against malware and other foul malevolent creatures of the web. As a consequence, you're likely to get blocked when scraping if you send a large number of requests from a single IP address in a short span of time. Alternatively, you might receive a CAPTCHA test (and we all know how much bots suck at those).
Why might this happen? Because human beings aren't capable of sending hundreds of requests from a single IP address very quickly. Such inhuman behavior will inevitably lead to your bot getting detected. It doesn't matter that your bot works for good rather than evil. Websites don't usually discriminate: a bot is a bot. It's even possible that your IP address gets blocked before you begin if it has been marked or blacklisted for past activities.
A solution to this problem is a cloud-based web scraper that sends each request with a different IP address, and for that, you'll need a proxy provider.
Anti-scraping protections check HTTP request headers to determine whether the requests are coming from a real browser. If they're not, they'll be marked as suspicious, and the IP address will get blocked.
To bypass this protection, a bot's header structure must match the provided user-agent. A simple way of ensuring that is to start a browser with a predefined user-agent header, like this Puppeteer example:
With the recent advances in online privacy, user-agent strings are being phased out because they contain information that can be used to identify people. Nonetheless, you can still get all the same information and much more with browser fingerprinting.
Browser fingerprinting is a technique that can be used to identify a web browser. It involves collecting information about the browser and the device it's running on and creating a unique "fingerprint" based on this information. This fingerprint can then be used to track a user's activity across different websites and devices and can also identify whether a browser is a bot or a real user. That’s why changing browser fingerprints when scraping significantly reduces blocking.
The most effective solution is to change and customize browser fingerprints automatically with an open-source web scraping library like Crawlee.
‘Tools’ is a very broad term when it comes to web scraping and could refer to libraries and frameworks, HTTP clients and parsers, or pre-built web scrapers that require little to no coding knowledge to use. Let’s go through a few of these.
Requests is a popular Python library that lets you easily send HTTP requests. You need to download the HTML with an HTTP client like Requests before you can parse it with something like Beautiful Soup. With Requests, you don’t need to manually add query strings to your URLs or form-encode POST data.
Beautiful Soup
The preferred choice of web scraping tool for beginners, Beautiful Soup is a Python library for extracting HTML and XML elements from a web page with just a few lines of code. It’s great for tackling simple tasks with speed.
Scrapy
Powerful but notoriously difficult to learn, Scrapy is a full-fledged web scraping framework that can use the previously mentioned libraries to extract data from web pages. It allows you to scrape multiple pages in parallel and export the data into a format of your choice.
Selenium
Primarily developed for web testing, Selenium is a browser automation tool that’s widely used for web scraping. It’s especially popular among Pythonistas, but it’s also fully implemented and supported in JavaScript (Node.js), Ruby, Java, Kotlin, and C#. Selenium uses the WebDriver protocol to control headless browsers, and its ability to render JavaScript on a web page makes it helpful for scraping dynamic pages (most modern websites use JS to load content dynamically).
Playwright
A more versatile version of Puppeteer (which is only good for JavaScript and TypeScript), Playwright provides cross-browser and cross-language support and has rapidly become a favored web scraping tool due to its versatility, ease of use, and auto-awaiting function.
Learn more about how to scrape the web with Playwright:
Tools for web scraping with JavaScript and TypeScript
Got Scraping
Got Scraping is a modern package extension of the Got HTTP client used to send browser-like requests to a server. This turns the scraping bot into a cunning ninja that blends in with website traffic, making it less likely to get detected and blocked. Got Scraping addresses common drawbacks in modern web scraping by offering built-in tools to avoid website anti-scraping protections.
Cheerio
Cheerio is an HTML and XML parser for JavaScript which loads HTML as a string and returns an object to extract data. Because it runs directly in Node.js without a browser, it tends to be faster and more efficient than other web scraping tools. However, because it doesn’t interpret results as a browser, it can’t execute JavaScript, produce visual rendering, apply CSS, or load external resources.
Puppeteer
Puppeteer is a Node.js library that provides a high-level API to manipulate headless Chrome programmatically, and can also be configured to use a headful browser. Like Playwright, its ability to emulate a real browser allows it to render JavaScript. It can scrape single-page applications, take screenshots and generate PDFs of pages, and automate manual user interactions.
Crawlee
Crawlee is an open-source Node.js web scraping and automation library that excels at dealing with modern website anti-bot defenses and offers a complete collection of tools for data extraction and browser automation. It builds on top of many of the aforementioned web scraping tools to enhance performance and seamlessly integrate storage and proxy rotation. Crawlee works on any system and can be used as a standalone or as a serverless microservice on the Apify platform.
Pre-built web scraping tools
Speaking of Apify, another option it provides is a range of ready-made scrapers that make developers’ lives easier or don’t require coding skills to use. There are two types of such web scrapers: universal scrapers and site-specific scrapers.
Universal web scrapers
Universal scrapers can extract data from any website. They provide boilerplate code to save you deployment time, so you don’t have to build your own scrapers from scratch. These scrapers are:
There are over 1,000 web scraping and automation tools in Apify Store, but don’t worry: I’m not going to list them all. Below are just a few of the most popular, which you can try out for free. All of them are designed and configured to extract data from specific websites, meaning they’re easy to use even if you lack coding skills.
I can’t really leave the subject of web scraping without referring to AI, since the web is the most convenient source for creating and curating datasets to feed AI models, and web scraping is the quickest and most efficient way of getting the data needed for artificial intelligence and machine learning. Apify also has a range of GPT and AI-enhanced tools for lots of use cases, including:
If you want to find out how to get started with building your own scrapers or how to make the most of the Apify platform, Web Scraping Academy is a free series of courses that will take you from a humble novice to a web scraping wizard.
I used to write books. Then I took an arrow in the knee. Now I'm a technical content marketer, crafting tutorials for developers and conversion-focused content for SaaS.