What is ethical scraping and how do you do it?

Is web scraping ethical?

Just because something is legal, that doesn’t necessarily make it ethical. People do unethical things within the confines of the law all the time. And just as some things are legal in some instances and illegal in others, so can some things be done ethically or unethically. Web scraping is one of those things.

Let’s be clear that the legality of web scraping doesn't fall within the scope of this article. If you want to learn more about that subject, we recommend you read Is web scraping legal? which is a comprehensive treatment of the issue.

How do you ethically scrape a website?

To scrape websites ethically, you need to follow some principles. We’ll provide you with a guide to ethical web scraping, but only part of it applies to web scraper users. The reason is that while web scraping ethics is not a new topic, rarely does anyone say anything about the ethics of website owners allowing or prohibiting web scraping. It’s presented as a one-sided argument, with the onus placed firmly upon the person doing the web scraping, which doesn't give us the complete picture.

It's time to balance things out and offer some guidelines for ethical web scraping with 5 principles – 2 for those doing the scraping and 3 for those being scraped.

1. Don’t overburden the target website

There are two guiding principles for those who use web scrapers, but we could sum them up in one overarching rule: do no harm. This is, after all, the second code of programming ethics.

One way web scraping can harm a company or its website is by not sending requests at a reasonable rate. Don’t get mistaken for a DDoS attack; do some research to find out what the target website can handle so you don’t cause it to have functionality issues. There’s a big difference between scraping an enormous website like Google or Amazon and scraping the site of a small, local business. Websites not used to a lot of traffic may not be able to cope with many requests sent by bots. Sending too many can skew the company’s user statistics and cause the website to run slower or even crash. So play nice, pace your requests according to what the website can manage, space them out, and consider scheduling your scraping tasks at off-peak hours.

How to schedule your web scraping tasks on the Apify platform

2. Respect the creators of the data you scrape

Everyone is generating data, and some of it is personal information, such as contact details. Tread carefully when it comes to this kind of data. Even if the information is publicly available, that doesn't mean you should extract or keep such data if you don't have a legitimate reason to do so. Treat other people's data as you would have them treat yours.

If data is public, scraping it is usually not an issue. However, information that requires a login to access is generally not public. Using or sharing such data for commercial purposes without permission could be a legal violation. So, be sure to check whether it’s legal to extract that data before you proceed.

How to use Contact Details Scraper

If you’re unsure whether you need permission to scrape a website’s data, the best way to find out is to read its terms of use. If the website prohibits automated web scraping outright, consider contacting the webmaster to explain why you want to scrape their data and ask permission. Alternatively, you could try sending a user agent string to make your web scraping intentions transparent and enable the website owner to communicate with you.

Remember that the data you collect isn’t yours. Consider what you’re using the information for, and keep only the data necessary for your purpose. Ask yourself whether what you plan to do with the data will bring added value. Where possible and relevant, credit your sources if you share the data you’ve collected.

3. Honor the open web

The ‘open web’ is a broad term that can include anything from technical concepts such as open source code and standards to democratic ideas like free expression and digital inclusion. But the idea that connects these is that the web was created by and for users, not big corporations, governments, and select gatekeepers. The web is for everyone, and the information it contains (with some exceptions, such as intellectual property) should not be the exclusive property of companies and institutions. If you’re a website owner, accept that web scraping is a reality of the open web.

4. Don’t seek to monopolize data

The fourth principle of web scraping ethics follows from the third. It’s not ethical to acquire data produced by users and claim them as one’s property. Unfortunately, it's not uncommon for big businesses to prohibit others from scraping their website for data that they have acquired through web scraping. The second principle - respect the creators of the data you scrape - applies to website owners as much as it does to those who wish to scrape those sites: the information you have is not yours any more than it is the property of people who have scraped it. Don’t claim publicly available data as your property to get an advantage over your competition.

5. Don’t block scrapers without good reason

The final principle logically follows the third and fourth principles. If you honor the open web and recognize that the data you possess is not your exclusive property, don’t block web scrapers unless you have good reason to do so. One reason could be that you must protect users' privacy if people are trying to extract personal data for unethical purposes. Another reason might be the skewing of your data statistics in cases of large-scale scraping. In such instances, consider temporarily blocking the request before banning it permanently. If someone sends a user agent string or request to extract your data, respond and communicate with the developer. If you need to block them, explain why.

What is an example of ethical web scraping?

Ethical web scraping is not just about the means but also the intention. There are many good examples of web scraping used for ethical outcomes. One of the most impressive is Spotlight, a tool used to investigate human trafficking. Spotlight turns data from escort sites into a resource for law enforcement to identify children trafficked on those sites. This achievement is due partially to Apify’s web scraping tools.

Another great example of ethical web scraping comes from Charles University, Prague. Its faculty of formal and applied linguistics also used Apify’s web scrapers to collect dialectical data from social media platforms to create a neural machine translation model for Syrian and Moroccan migrants and refugees in Europe.

Yet another very recent example comes from the European Commission and its EU directive on consumer protection to safeguard against companies tricking consumers into buying products at fake discounts. Web scraping companies, TopMonks and Apify, developed a bespoke tool to help the European Commission and national authorities of the EU Member States monitor compliance with its consumer protection directive.

Conclusion

While the principle do no harm underpins ethical web scraping, it would be a mistake to end the discourse on web scraping ethics there. We must apply a similar underlying principle to website owners: don’t be greedy. Data is information, information is power, and with great power comes great responsibility. Don’t try to keep all the power for yourself, but make sure you use and share data responsibly.

SOAX interviews Apify's COO, Ondra Urban, about ethical web scraping. Listen to Ondra on Spotify talk about his journey from law to coding, how he ended up at Apify, and his take on the challenges and approaches to responsible web scraping.

The interview is also available on Apple Podcasts: