When scraping very large or complex websites, a normal scraping workflow won't get the job done. Read on to find out how to overcome the challenges of large-scale scraping.
The internet is an ever-expanding universe, and as it continues to grow, so does the amount of important data you might need to extract for a wide variety of reasons. If you didn’t know it already, web scraping is the fastest and most efficient method of extracting publicly available web data and getting it into a structured format that can be used for analysis.
That being said, there are times when the amount of data and the speed at which it needs to be collected is more than the average web scraping tool can handle. If you want to extract data from a thousand or even tens of thousands of web pages, normal scraping will get the job done. But what if we are talking about millions of pages? That requires large-scale scraping.
Large-scale scraping is extracting data from huge or complex websites. If we are doing large-scale scraping, we could be extracting millions of pages monthly, weekly, or even daily. This requires a different workflow. So we are going to show you how large-scale scraping works and how to overcome the challenges of scraping large or complex websites.
Before we do that, a word of advice…
Is large-scale scraping ethical?
Be aware of what the target website can handle. There’s a big difference between scraping a large website like Amazon and scraping the site of a small, local business. A website that is not used to huge traffic may not be able to cope with a large number of requests sent by bots. Not only will this skew the company’s user statistics, but it could also cause the website to run slower and even crash. So, play nice and don‘t overburden your target website. If in doubt, do a little online research to find out how much traffic the website receives.
How normal web scraping works
So, how can you know a web scraping task requires large-scale scraping measures? We’ll answer that question by beginning with a normal web scraping workflow.
Step 1. Open the homepage of the target website
For this example, we will choose the website fashionphile.
Step 2. Enqueue top-level categories
Let’s click on the bags category and choose shop all bags ⬇️
We can see that the total number of bag items on the site is 21,487. So, we know that the maximum number of items we want to scrape is 21,487 ⬇️
If you scroll to the end of the page, you will see the total number of pages for the bags category: 359. So, 359 pages contain 21,487 items ⬇️
Step 3. Scrape each product detail
You can now scrape product details such as brand names, bag colors, price ranges, and so on. E.g. Louis Vuitton bags priced at between $1,000 and $2000.
Step 4. Run in a single server
With this information, you can run an actor on the Apify platform to extract the data you are looking for.
So, why does this not work for very large or complex websites?
Why you need large-scale scraping
There are three challenges in dealing with very large websites, such as Amazon.
- There is a limit to the number of pages shown in pagination
- A single server is not big enough
- Default proxies might not scale
Pagination limits solution
The pagination limit is usually between 1,000 and 10,000 products. Here’s a three-step solution to this limitation:
- Use search filters and go to subcategories
- Split them into price ranges (e.g. $0-10, $10-100)
- Recursively split the ranges in half (e.g. split $0-10 price range into $0-5 and $5-10)
Solution to limited memory and CPU
Since there is a limit to making the server bigger (vertical scaling), you need to add more servers (horizontal scraping). This means you have to split runs across many different servers which will run in parallel. This is how to do it:
- Collect products and redistribute them among servers
- Create servers as needed
- Merge the results back into a single dataset, and then unify and deduplicate using the Merge, Dedup & Transform Datasets actor
Your choice of proxy impacts your web scraping costs. Datacenter proxies are likely to get banned if you are scraping at a large scale. Residential proxies are expensive. So the best solution is a combination of datacenter proxies, residential proxies, and external API providers.
Let’s sum up
Large-scale scraping is complicated, so remember:
➡️ Plan before you start
➡️ Minimize the overload on web servers
➡️ Extract only valid information
➡️ Apify has extensive experience in overcoming the challenges posed by large-scale scraping. If large-scale data extraction is what you need, get in touch for a custom-made solution.