Web scraping with JavaScript vs. Python in 2022

Percival Villalva
Percival Villalva
Table of Contents

Python and JavaScript are two of the most popular programming languages. Find out which one you should use for web scraping in 2022.


The internet is an ocean of information that is often not easily accessible through an API, which can provide limited access to the data or not even be available.

In this context, web scraping is the art of leveraging the power of automation to open the web and extract structured web data at scale. The data collected can then be used for countless applications, such as training machine learning algorithms, price monitoring, market research, lead generation, and more.

JavaScript and Python are two of the most popular and versatile programming languages. Both languages are at the forefront of innovation in web scraping, boasting a vast selection of frameworks and libraries that offer tools to overcome even the most complex scraping scenarios.

This article will analyze some of the latest web scraping libraries and frameworks available for each language and discuss the best scraping use cases for Python and JavaScript.

Why JavaScript?

JavaScript is currently the most used programming language in the world. Its popularity is due primarily to its flexibility. JavaScript is used for web development, building web servers, game development, mobile apps, and, of course, web scraping.

Most used programming languages among developers
Source: Stack Overflow - 2021 Developer Survey - © Statista 2022

JavaScript is rightfully referred to as the language of the web. Close to 97.8% of all websites use it as a client-side programming language. Not surprisingly, some of the most advanced web scraping and browser automation libraries are also written in JavaScript, making it even more attractive for those who want to extract data from the web.

Additionally, JavaScript boasts a large and vibrant community. There's plenty of information available online, so you can easily find help whenever you feel stuck in a project.

Running JavaScript on the server with Node.js

Node.js is an open-source JavaScript runtime, enabling JavaScript to be used on the server-side to build fast and scalable network applications.

Node.js is well known for the performance and speed it provides. Node.js efficiency comes from its single-threaded structure and asynchronous nature, enabling it to execute JavaScript code in the main thread while handling input/output operations in other threads.

On top of that, Node.js uses the V8 JavaScript engine, an open-source, high-performance JavaScript and WebAssembly engine written initially for Google Chrome. The V8 engine enables Node.js to compile JavaScript code into machine code at execution by implementing a JIT (Just-In-Time) compiler, significantly improving the execution speed.

Thanks to Node.js capabilities, the JavaScript ecosystem has a variety of highly efficient web scraping libraries such as Got, Cheerio, Puppeteer, and Playwright.


Why Python?

Python, like JavaScript, is an extremely versatile language. Python can be used for developing websites and software, task automation, data analysis, and data visualization. Its easy-to-learn syntax contributed greatly to Python's popularity among many non-programmers such as accountants and scientists, for automating everyday tasks, organizing finances, and conducting research.

Python is the king of data processing. Data extracted from the web can be easily manipulated and cleaned using Python's Pandas library and visualized using Matplotlib. This makes web scraping a powerful skill in any Pythonista's toolbox.

Python is the dominant programming language in machine learning and data science. These fields benefit heavily from having access to large data sets to train algorithms and create prediction models. Consequently, Python boasts some of the most popular web scraping libraries and frameworks, such as BeautifulSoup, Selenium, Playwright, and Scrapy.

HTTP clients and HTML parsers

HTML

HTTP clients are a central piece of web scraping. Almost every web scraping tool uses an HTTP client behind the scenes to query the website server you are trying to collect data from.

Parsing, on the other hand, means analyzing and converting a program into a format that a runtime environment can run, for example, the browser parses HTML into a DOM tree.

Following this same logic, HTML parsing libraries such as Cheerio (JavaScript) and BeautifulSoup (Python) parse data directly from web pages so you can use it in your projects and applications.

Got and Got Scraping - HTTP client for JavaScript

GOT HTTP client logo


Got Scraping is a modern package extension of the Got HTTP client. Its primary purpose is to send browser-like requests to the server. This feature enables the scraping bot to blend in with the website traffic, making it less likely to be detected and blocked.

It excels at addressing common drawbacks in modern web scraping by offering built-in tools to avoid website anti-scraping protections.

As an example, the code below uses Got Scraping to retrieve the Hacker News website HTML body and print it in the terminal.

const { gotScraping } = require('got-scraping');

gotScraping
    .get('https://news.ycombinator.com/')
    .then( ({ body }) => console.log(body))

Requests - HTTP client for Python

Python Requests
Requests logo

Requests is an HTTP Python library. The goal of the project is to make HTTP requests simpler and more human-friendly, hence the title "Requests, HTTP for humans".

Requests is a widely popular Python library, to the point which it has even been proposed that Requests be distributed with Python by default.

To highlight the differences between Got Scraping and Requests, let's retrieve Hacker News website HTML body and print it in the terminal, but now using Requests.

import requests

response = requests.get('https://news.ycombinator.com/')

print(response.content)

Cheerio - HTML and XML parser for JavaScript

Cheerio
Illustrative image from Apify's Cheerio Scraper

Cheerio is a fast and flexible implementation of core jQuery designed to run on the server-side, working with raw HTML data.

To exemplify how Cheerio parses HTML, let's use it together with Got Scraping to extract data from Hacker News.

import { gotScraping } from 'got-scraping';
import cheerio from 'cheerio';

const response = await gotScraping('https://news.ycombinator.com/');
const html = response.body;

// Use Cheerio to parse the HTML
const $ = cheerio.load(html);
// Select all the elements with the class name "athing"
const entries = $('.athing');
// Loop through the entries
for (const entry of entries) {
    const element = $(entry);
	// Write each element's text to the terminal
    console.log(element.text());
}

After the script run is finished, you should see the data from the most recent news printed in your terminal.

Need help understanding the code? Find out more about querying data with Cheerio and CSS selectors on Apify's web scraping academy. 👨‍💻

However, Cheerio does have some limitations. For instance, it does not interpret results as a browser does. Thus, it is not able to do things such as:

  • Execute JavaScript

  • Produce visual rendering

  • Apply CSS or load external resources

If your use case requires any of these functionalities, you will need browser automation software like Puppeteer or Playwright, which we will explore further in this article.

If you're interested in using Cheerio, check out this step-by-step guide to scraping any website with Cheerio Scraper.

Beautiful Soup - HTML and XML parser for Python

Beautiful Soup is a Python library used to extract HTML and XML elements from a web page with just a few lines of code, making it the right choice to tackle simple tasks with speed. It is also relatively easy to set up, learn, and master, which makes it the ideal web scraping tool for beginners.

BeautifulSoup
BeautifulSoup logo

To exemplify BeautifulSoup's features and compare its syntax and approach to its Node.js counterpart, Cheerio, let's scrape Hacker News and print to the terminal the most upvoted article.

from bs4 import BeautifulSoup
import requests

response = requests.get("https://news.ycombinator.com/news")
yc_web_page = response.text

soup = BeautifulSoup(yc_web_page, "html.parser")
articles = soup.find_all(name="a", class_="titlelink")
article_texts = []
article_links = []
for article_tag in articles:
    text = article_tag.getText()
    article_texts.append(text)
    link = article_tag.get("href")
    article_links.append(link)

article_upvotes = [int(score.getText().split()[0]) for score in soup.find_all(name="span", class_="score")]

largest_number = max(article_upvotes)
largest_index = article_upvotes.index(largest_number)

print(article_texts[largest_index])
print(article_links[largest_index])

BeautifulSoup offers an elegant and efficient way of scraping websites using Python. However, there are a few significant drawbacks to Beautiful Soup, such as:

  • Slow web scraper. The library's limitations become apparent when scraping large datasets. Its performance can be improved with multithreading, but it adds another layer of complexity to the scraper which might be demotivating for some users. In this regard, Scrapy is noticeably faster than BeautifulSoup due to its ability to use asynchronous system calls.

  • Unable to scrape dynamic web pages. Beautiful Soup is unable to mimic a web client and, therefore, cannot scrape dynamic JavaScript text on websites.

To understand how to apply Beautiful Soup to real-life projects, make sure to check our "How to scrape data in Python using Beautiful Soup" tutorial.

Browser automation tools

Web automation

Browsers are a way for people to access and interact with the information available on the web. Nevertheless, a human is not always a requirement for this interaction to happen. Browser automation tools can mimic human actions and automate a web browser to perform repetitive and error-prone tasks.

The role of browser automation tools in web scraping is intimately related to their ability to render JavaScript code and interact with dynamic websites.

As previously discussed, one of the main limitations of HTML parsers is that they are not able to scrape dynamic web pages. However, by combining the power of web automation software with HTML parsers, we are able to go beyond simple automation and render JavaScript to extract data from complex web pages.

Selenium

Selenium is primarily a browser automation tool developed for web testing, which is also found in off-label use as a web scraper. It uses the WebDriver protocol to control a headless browser and perform actions like clicking on buttons, filling forms, and scrolling.

Selenium is popular in the Python community, but it is also fully implemented and supported in JavaScript (Node.js), Python, Ruby, Java, Kotlin (programming language), and C#.

Selenium logo
Selenium logo

Because of its ability to render JavaScript on a web page, Selenium can help scrape dynamic websites. This is a handy feature, considering that many modern websites, especially in e-commerce, use JavaScript to load their content dynamically.

As an example, let's go ahead and scrape Amazon to get information from Douglas Adams' book, The Hitchhiker's Guide to the Galaxy. The script below will initialize a browser instance controlled by Selenium and parse the JavaScript in Amazon's website so we can extract data from it.

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# Insert the website URL that we want to scrape
url = "https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0" \
      "?_encoding=UTF8&qid=1642536225&sr=8-1 "

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)

# Create a dictionary with the scraped data
book = {
    "book_title": driver.find_element(By.ID,  'productTitle').text,
    "author": driver.find_element(By.CSS_SELECTOR, '.a-link-normal.contributorNameID').text,
    "edition": driver.find_element(By.ID, 'productSubtitle').text,
    "price": driver.find_element(By.CSS_SELECTOR,  '.a-size-base.a-color-price.a-color-price').text,
}

# Print the dictionary contents to the console
print(book)

Despite its advantages, Selenium was not designed to be a web scraper and, because of that, has some noticeable limitations:

  • User-friendliness. Selenium has a steeper learning curve when compared to Beautiful Soup, requiring a more complex setup and experience to master.

  • Inefficient. Scraping vast amounts of data in Selenium is slow and inefficient, making it unsuitable for large-scale tasks.

Puppeteer - JavaScript browser automation tool

Developed and maintained by Google, Puppeteer is a Node.js library that provides a high-level API to manipulate a headless Chrome programmatically, which can also be configured to use a full, non-headless browser.

Puppeteer logo
Puppeteer logo

Puppeteer's ability to emulate a real browser allows it to render JavaScript and overcome many of the limitations of the tools mentioned above. Some of the examples of its features are:

  • Crawl a Single Page Application and generate pre-rendered content.

  • Take screenshots and generate PDFs of pages.

  • Automate manual user interactions, such as UI testing, form submissions, and keyboard inputs.

To demonstrate some of Puppeteer's capabilities, let's go to the Hacker News website, scrape the titles and ranks of the latest news and take a screenshot of the page.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://news.ycombinator.com');

 
  let latestNews = await page.evaluate(() => {
    // Get all the entries on the page with a CSS selector in this case identified
    // by the class name.
    entries = document.body.querySelectorAll('.athing');

    entryItems = [];

    for (let i = 0; i < entries.length; i++) {
        // Query for the next title element on the page
        const title =  entries[i].querySelector('td.title > a');
        
        // Push the entry's position + title to the entryItems array
        entryItems.push(`${i + 1}: ${title.innerText}`);
    }
    // Return entryItems array
    return entryItems
})

// Write the entryItems array to the console
console.log(latestNews)

await page.screenshot({ path: 'Y-Combinator.png' });
await browser.close();
})();

By the end of the script's run, we will get an entryItems array with the latest news published on Hacker News and its respective ranking.

Puppeteer can mimic most human interactions in a browser. The ability to control a browser programmatically greatly expands the realm of possibility of what is achievable using this library. Besides web scraping, Puppeteer can be used for workflow automation and automated testing.

Playwright - JavaScript and Python browser automation tool

Playwright is a Node.js library developed and maintained by Microsoft.

A significant part of Playwright's developer team is composed of the same engineers that worked on Puppeteer. Because of that, both libraries have many similarities, lowering the learning curve and reducing the hassle of migrating from one library to another.

Playwright logo
Playwright logo


One of the major differences is that Playwright offers cross-browser support, being able to drive Chromium, WebKit (Safari's browser engine), and Firefox, while Puppeteer only supports Chromium.

Additionally, you can use Playwright's API in TypeScript, JavaScript, Python, .NET, Java.

To highlight some of Playwright's core features as well as its similarities with Puppeteer and differences with Selenium, let's go back to Amazon's website and once again collect information from Douglas Adams' The Hitchhiker's Guide to the Galaxy.

Playwright JavaScript version:

const playwright = require('playwright');

(async () => {
  const browser = await playwright.webkit.launch();
  const page = await browser.newPage();
  await page.goto('https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=1642536225&sr=8-1');

  const book = {
      bookTitle: await (await page.$('#productTitle')).innerText(),
      author: await (await page.$('.a-link-normal.contributorNameID')).innerText(),
      edition: await (await page.$('#productSubtitle')).innerText(),
      price: await (await page.$('.a-size-base.a-color-price.a-color-price')).innerText(),
  };

  console.log(book);
  await page.screenshot({ path: 'book.png' });
  await browser.close();
})();

Playwright Python version:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.webkit.launch()
    page = browser.new_page()
    page.goto("https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0"
              "?_encoding=UTF8&qid=1642536225&sr=8-1")

	# Create a dictionary with the scraped data
    book = {
    "book_title": page.query_selector('#productTitle').inner_text(),
    "author": page.query_selector('.a-link-normal.contributorNameID').inner_text(),
    "edition": page.query_selector('#productSubtitle').inner_text(),
    "price": page.query_selector('.a-size-base.a-color-price.a-color-price').inner_text(),
    }

    print(book)

    page.screenshot(path="book.png")
    browser.close()

Despite being a relatively new library, Playwright is rapidly gaining adepts amongst the developer community. Because of its modern features, cross-browser, multi-language support, and ease of use, it can be said that Playwright has already surpassed its older brother Puppeteer.

Automation


Apify SDK

Apify SDK is an open-source, scalable web crawling, scraping, and automation library for JavaScript. It offers a complete collection of tools for every automation and scraping use case, like the CheerioCrawler, PuppeteerCrawler, and PlaywrightCrawler.

Apify SDK
Apify SDK logo

Apify SDK not only shares many of the features of the previously mentioned tools but builds on top of them to enhance performance and seamlessly integrate storage, export of results, and proxy rotation. It works on any system and can be used as standalone or run as a serverless microservice on the Apify platform.

Here are some of the key features that enable Apify SDK to automate any web workflow:

  • Supports all browsers, including Chrome, Firefox, and Webkit.
  • Manages lists and queues of URLs to crawl, running crawlers in parallel at maximum system capacity to ensure efficiency and scalability.
  • Apify SDK seamlessly integrates with Apify Proxy, a proxy service that uses machine learning to rotate and select the optimal IP address for the specific target website.

The example below demonstrates how to use CheerioCrawler in combination with RequestQueue to recursively scrape the Hacker News website using headless Chrome / Playwright.

const Apify = require('apify');

Apify.main(async () => {
    const requestQueue = await Apify.openRequestQueue();
    await requestQueue.addRequest({ url: 'https://news.ycombinator.com/' });

    const crawler = new Apify.CheerioCrawler({
        requestQueue,
        handlePageFunction: async ({ request, $ }) => {
            console.log(`Processing ${request.url}...`);

            const data = $('.athing').map((index, post) => {
                return {
                    title: $(post).find('.title > a').text(),
                    rank: $(post).find('.rank').text(),
                    href: $(post).find('.title > a').attr('href'),
                };
            }).toArray();

            // Store the results to the default dataset.
            await Apify.pushData(data);

            // Find a link to the next page and enqueue it if it exists.
            await Apify.utils.enqueueLinks({
                $,
                requestQueue,
                selector: '.morelink',
                baseUrl: request.loadedUrl,
            });
        },
    });

    // Run the crawler and wait for it to finish.
    await crawler.run();
});

The crawler starts with a single URL, finds links to the following pages, enqueues them, and continues until no more desired links are available. The results are then stored on your disk.

🤖 Here are some useful links to help you get started with Apify SDK:

Getting started with Apify SDK

Crawl multiple URLs - Example

Apify SDK Github

Apify Documentation

Scrapy

Scrapy is a full-featured web scraping framework and is the go-to choice for large-scale scraping projects in Python.

Scrapy logo
Scrapy logo

Scrapy is written with Twisted, a popular event-driven networking framework, which gives it some asynchronous capabilities. For instance, Scrapy doesn't have to wait for a response when handling multiple requests, contributing to its efficiency.

In the example below, we use Scrapy to crawl IMDB's best movies list and retrieve the title, year, duration, genre, and rating of each listed movie.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BestMoviesSpider(CrawlSpider):
    name = 'best_movies'
    allowed_domains = ['imdb.com']

    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'

    def start_requests(self):
        yield scrapy.Request(url='https://www.imdb.com/search/title/?groups=top_250&sort=user_rating', headers={
            'User-Agent': self.user_agent
        })

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//h3[@class='lister-item-header']/a"), callback='parse_item', follow=True, process_request='set_user_agent'),
        Rule(LinkExtractor(restrict_xpaths="(//a[@class='lister-page-next next-page'])[2]"), process_request='set_user_agent')
    )

    def set_user_agent(self, request):
        request.headers['User-Agent'] = self.user_agent
        return request

    def parse_item(self, response):
        yield {

            'title': response.xpath("//div[@class='title_wrapper']/h1/text()").get(),
            'year': response.xpath("//span[@id='titleYear']/a/text()").get(),
            'duration': response.xpath("normalize-space((//time)[1]/text())").get(),
            'genre': response.xpath("//div[@class='subtext']/a[1]/text()").get(),
            'rating': response.xpath("//span[@itemprop='ratingValue']/text()").get(),
            'movie_url': response.url
            
        }


Despite being an efficient and complete scraper, Scrapy has one significant drawback: it lacks user-friendliness. It requires a lot of setups and pre-requisite knowledge. Besides, in order to scrape dynamic websites, Scrapy requires integration with Splash, making the learning curve to master this framework even steeper.

Apify Python API Client

Regardless of your choice of framework for web scraping in Python, you can take your application to the next level by using Apify's Python API Client to integrate your scrapers with the Apify Platform. Here are some of its features:

  • Host your code in the cloud and tap into the computational power of Apify's servers.

  • Schedule actor runs. Schedules allow you to run actors and tasks regularly or at any time you specify.

  • Integration with existing scrapers. The Python API client allows the scraped data stored in Apify datasets to be processed and visualized using famous Python libraries, Pandas and Matplotlib.

Ready to start web scraping with Python?🐍
Check out these articles on how to get started with Apify's Python Client:

Apify API client for Python

How to scrape data in Python using Beautiful Soup

How to process data in Python using Pandas

What about performance?

Finally, both JavaScript and Python are excellent choices for web scraping. The best fit for you will largely depend on your knowledge, background, and use case.

Most of the tools discussed in this article can meet the demands of small and medium-sized projects. Nonetheless, if you plan on a large-scale, enterprise-level project, two tools stand out for their high performance: Apify SDK and Scrapy.

However, there are substantial differences, even among these two powerful tools. In the long run, your tooling of choice can heavily impact both the project's efficiency and cost. A prime example of this is described in Daltix's success story. Daltix is a retail data agency that saved 90% of its web scraping costs simply by moving from Scrapy to Apify SDK.

Conclusion


In short, the right choice of language and framework will depend on the requirements of your project and your programming background.

JavaScript is the language of the web. Thus, it offers an excellent opportunity for you to use only one language to understand the inner workings of a website and scrape data from it. This will make your code cleaner and ease the learning process in the long run.

On the other hand, Python might be your best choice if you are also interested in data science and machine learning. These fields greatly benefit from having access to large sets of data. Therefore, by mastering Python, you can obtain the necessary data through web-scraping, process it, and then directly apply it to your project.

But choosing your preferred programming language doesn't have to be a zero-sum game. You can combine JavaScript and Python to get the best of both worlds.

At Apify Store you can try hundreds of existing web scraping solutions powered by the Apify SDK for free. As a next step, you can use output data from those ready-made solutions to take advantage of Python's extensive collection of data manipulation libraries by using Apify's Python API Client.

If you are still not sure about how to get started with web scraping, you can request a custom solution to outsource entire projects to us and we will take care of everything for you😉

Finally, don't forget to join Apify's community on Discord to connect with other web scraping and automation enthusiasts. 🚀


Useful Resources

Want to learn more about web scraping and automation? Here are some recommendations for you:

👨‍💻 Web Scraping Academy

🔎 What is the difference between web scraping and crawling?

💬 Join our Discord



Great! Next, complete checkout for full access to Apify
Welcome back! You've successfully signed in
You've successfully subscribed to Apify
Success! Your account is fully activated, you now have access to all content
Success! Your billing info has been updated
Your billing was not updated