Top 11 open source web crawlers - and one powerful web scraper

Free software libraries, packages, and SDKs for web crawling? Or is it a web scraper that you need?

Content

Hey, we're Apify. You can build, deploy, share, and monitor your scrapers and crawlers on the Apify platform. Check us out.

The amount of data online hit 40 zettabytes in 2020. And with one zettabyte being equal to a billion terabytes, that is a lot of information at our disposal. While organizations and companies need to harness big data for insights into their markets, it is estimated that over 80 percent of this data is unstructured. But to be able to use this data effectively, it needs to be in a machine-readable format, so you need structured data.

Other than internal statistics, research, and databases of organizations, an incredible source of data is the web itself. The extraction of online data can either go by the term web scraping or web crawling. What’s the difference? A web crawler is often used by search engines, which crawl websites, look for links, and pages, then extract their content in a relatively indiscriminate manner. A web scraper, on the other hand, extracts information from a website based on a certain script, which is often tailored to a specific website and its corresponding elements. It’s great for transforming unstructured data into structured databases of information. Often, these web crawlers and website scrapers are open-source, meaning they're free to use, and you can tweak them however you like.

✌️
To learn more about the differences between the two, have a read through our web crawling vs. web scraping blog post âžś

What is a web crawler used for?

Web crawlers help you index web pages, locate content, and gather data from public websites. Crawlers also look at URL links within a website and try to work out a schema for how these pages are interconnected. This crawling helps you analyze the website from a wider perspective and makes it easier for online tools such as search engines to display a simplified version of it in the form of search results.

A man holds up a magnifying glass over a search engine while a woman with a laptop inspects it from the other side.
Search engines use web crawling for indexing web pages

What are open-source web crawlers?

When software or an API is open-source, its code is available to the general public for free. It's even possible to modify and optimize the code to suit your needs. The same goes for website scrapers and open source web crawlers: you can download or use them without paying and fine-tune them based on your use case.

Crawlers or scrapers are tools to automate data extraction at scale. So, for example, instead of manually copying a product list of an e-shop, a crawler does it for you. It is legal, but you still need to be careful not to accumulate sensitive data such as personal information or copyrighted content.

⚖️
Find out more about how the law sees online data extraction âžś

Top 11 open-source web crawlers

Now that we know what web crawlers are and what they’re used for, let's explore some of the most popular open-source crawling tools available online:

1. Scrapy

Language: Python | Github: 45k+ stars | link

https://scrapy.org/

The most popular web crawling tool found online, also suitable for large-scale web scraping. And, being asynchronous, requests aren’t made one at a time, but in parallel, resulting in very efficient crawling.

đź’¸
Find out how a retail data company saved 90% on web scraping costs by migrating their scrapers from Scrapy to Apify âžś

2. Pyspider

Language: Python | Github: 15k+ stars | link

http://docs.pyspider.org/

A powerful open-source spider (crawler) package written in Python. Compared to other crawling tools, Pyspider not only provides data extraction functionality but also a script editor, task monitor, project manager, and result viewer.

3. Webmagic

Language: Java | Github: 10k+ stars | link

https://webmagic.io/en/

A scalable crawler framework to help simplify the development of a crawler. It covers the entire life cycle of a crawler, from downloading, URL management to content extraction.

4. Crawlee

Language: Node.js | Github: 7k+ stars | link

https://crawlee.dev/

Crawlee is an open-source web scraping, and automation library specifically built for the development of reliable crawlers. The library's default anti-blocking features help you disguise your bots as real human users, decreasing the chances of your crawlers getting blocked.

Want to try building a scraper with Crawlee? Follow this video tutorial and learn how to scrape Amazon:

5. Node Crawler

Language: Node.js | Github: 6k+ stars | link

http://node-crawler.org/

A popular and powerful package for crawling websites with the Node.js platform. It runs based on Cheerio and comes with many options to customize the way you crawl or scrape the web, including limiting the number of requests and time taken between them.

6. Beautiful Soup

Language: Python | Github: n/a | link

https://www.crummy.com/software/BeautifulSoup/

Beautiful Soup is an open-source Python library used for parsing HTML and XML documents. After creating a parse tree, extracting data from the web is much easier. Although not as speedy as Scrapy, its mainly praised for its ease of use and community support for when issues arise.

7. Nokogiri

Language: Ruby | Github: 5k+ stars | link

https://nokogiri.org/

Like Beautiful Soup, Nokogiri is also great at parsing HTML and XML documents, but via the programming language Ruby, which is great for beginners in web development. Nokogiri relies on native parsers such as C’s Lixml2 and Java’s xerces.

8. Crawler4j

Language: Java | Github: 4k+ stars | link

https://github.com/yasserg/crawler4j

An open-source Java web crawler with a simple interface to crawl the web. Its advantages include the ability to build a multi-threaded crawler, but it falls back on excessive memory use.

9. MechanicalSoup

Language: Python | Github: 2k+ stars | link

https://mechanicalsoup.readthedocs.io/en/stable/

A Python library used for parsing websites, based on the aforementioned BeautifulSoup, with inspiration coming from the Mechanize library. It’s great for storing cookies, and following redirects, hyperlinks, and forms on a website.

10. Apache Nutch

Language: Java | Github: 2k+ stars | link

https://nutch.apache.org/

An extensible open-source web crawler often used in fields like data analysis. It can fetch content through protocols such as HTTPS, HTTP, or FTP and extract textual information from documents formats like HTML, PDF, RSS, and ATOM.

11. Heritrix

Language: Java | Github: 2k+ stars | link

https://heritrix.readthedocs.io/en/latest/

Written by the Internet Archive, Heritrix is an open-source crawler designed mainly for web archiving. It collects extensive information, such as domains, exact site host, and URI patterns, but needs a little tuning when handling bigger tasks.

Last, but not least…

In 2015, when we started Apify, we only had 1 product - the Apify Crawler. Since then, we launched Apify Store, which has hundreds of ready-to-use scrapers and crawlers (we call them actors) capable of extracting data from various websites in a matter of minutes. But despite all these specialized scrapers, we still needed to offer our users an easy-to-use website scraper and open-source crawling tool. And so we created the Web Scraper. This actor, based on Puppeteer, a Node.js library, uses JavaScript to extract data from the web. And, with the help of our video tutorial, just input the websites you want to crawl and the content from the page you wish to extract.

So, no need to overthink it, go ahead and try it for free!

Dávid Lukáč
Dávid Lukáč
Marketing coordinator at Apify. I am passionate about media studies and love looking at the world through a camera lens.

Get started now

Step up your web scraping and automation