Headless browsers: what are they and how do they work?

Websites trying to detect headless browsers vs. headless browsers trying to evade detection is a cat-and-mouse game that probably won’t end any time soon. We have a couple of solutions to keep you ahead of that game.

Content

We're Apify. You can build, deploy, share, and monitor any scrapers on the Apify platform. Check us out.

If you already know the basics, skip to web scraping with headless browsers and find out how to evade headless browser detection.

What is a headless browser?

A headless browser is a web browser without a GUI (or graphical user interface, if you hate acronyms). Headless browsers are prevalent in web scraping because they can help you render JavaScript or programmatically behave like a human user to avoid blocking. Headless browsers are also popular for cross-browser testing because the browser renders everything programmatically rather than via the UI. That means you can run browsers on machines that don't have displays or virtual displays like XVFB connected to them.

How a headless browser works

When a headless browser loads a web page, it sends a request to the web server, receives the HTML document in response, parses and renders the page, and executes any JavaScript code. In this sense, it’s no different from a standard browser. The difference is instead of rendering web pages on a screen, a headless browser provides access to web page content and functionality through a command-line interface (CLI) or application programming interface (API) to perform an action on the page.

In real-world scenarios, headless browsers are not controlled directly via their APIs, which are often complex and unintuitive, but with specialized libraries that encapsulate the headless browser's API into a format that's easier to work with. This is also a frequent point of confusion. People often mistake the libraries (sometimes called drivers) for the headless browsers themselves. More on that below when we talk about examples.

Are headless browsers faster?

Since headless browsers don’t have the overhead of rendering a web page on a screen, they tend to be faster than GUI-based browsers when running tests or rendering large numbers of pages. The difference is not staggering, though. Even a headless browser still needs to download all the assets, parse JavaScript, and render HTML – it just doesn't do it visually.

Interestingly, in web scraping scenarios, you may sometimes get better performance with a headful (opposite of headless) browser. That's because headful browsers are easier to program to behave as real human users. Many features of headful browsers need to be mimicked in their headless counterparts to make them look authentic. It's not impossible, but sometimes it might be faster to use a headful browser instead of fiddling with headless browser fingerprints.

What are examples of headless browsers?

Earlier, we said that there were headless browsers themselves, and then libraries used to control them - sometimes called drivers - which are often confused with headless browsers. Let's take a look at which is which.

Examples of popular headless browsers include Headless Chrome and Splash, but almost any modern browser can be run headless now. The most popular libraries for controlling headless browsers are Puppeteer, Playwright, and Selenium.

Playwright vs. Puppeteer: which is better?
Two powerful Node.js libraries: described and compared.

Headless browsers

  • Chromium
    Chromium is probably the most popular browser that can run headless. It wasn't the first headless browser, but it was the first full-featured one. Since many current browsers like Edge or Brave are based on Chromium, it means that you can run those headless as well.
  • Google Chrome
    Since Chrome is built on Chromium, it can run headless as well. It's a little bulkier and headless features come to it later, but it's easier to mimic a real user with it.
  • Firefox
    Firefox can also run headless. Thanks to its native privacy settings, it's a great choice for blending in with other traffic or testing privacy features.
  • WebKit (Apple Safari)
    Even WebKit, which is the open-source browser engine behind Apple Safari, can run headless. It's especially useful for testing if your website runs great on Safari.
  • Splash
    A headless web browser written in Python with an HTTP API, Lua scripting support, and a built-in Python (Jupyter)-based IDE. It's not as popular as the others because it's not based on a real browser, but it's fast and can get the job done in simple scenarios.

Libraries that control headless browsers (drivers)

  • Playwright
    An open-source library built by Microsoft to automate Chromium, WebKit, and Firefox browsers with a unified API. It's available in many programming languages, including JavaScript (Node.js), Python, Java, and more. In our opinion, Playwright is by far the best library to run headless browsers these days.
  • Selenium
    An open-source suite of tools to automate web browsers across multiple platforms. Selenium is the king in terms of usage and community. It's old and it's huge. But in comparison with Playwright it can be slow and clunky sometimes.
  • Puppeteer
    An open-source Node.js library that automates Chromium and Chrome. Puppeteer is maintained by people close to the Chromium team. Playwright is essentially a more recent, improved copy of Puppeteer.

There are other libraries that helped shape the headless browser scene, like PhantomJS, NightmareJS or others, but they're long unmaintained. They deserve an honorable mention, but we wouldn't recommend using them.

What is the difference between Chrome and Chrome Headless?

Even though it may seem like it, Headless Chrome is not a browser. It's a feature of Google Chrome that allows you to run the Chrome browser headless, meaning that it doesn’t have a GUI and cannot be interacted with directly. Instead, it can be controlled programmatically using the DevTools protocol.

Luckily, you don't have to learn the DevTools protocol itself. Popular libraries like the above-mentioned Playwright and Puppeteer provide intuitive APIs to control headless Chrome. For automated testing, check out Playwright Test, which is a tailor-made library for headless browser testing. For web scraping with headless browsers, the most feature-packed open-source library is Crawlee, which uses both Puppeteer and Playwright in the background.

11 best automated browser testing tools for developers
Read about automated browser testing and the best tools for testing your web apps.

Web scraping with headless browsers

When web scraping with a headless browser, you can do much more in comparison to making HTTP requests for static content. You can programmatically do anything a human could do with a browser (click elements, take screenshots, type into text areas), and since headless browsers are capable of loading the JavaScript contained on a website, dynamic content can be rendered, interacted with, and scraped.

There are also tools similar to headless browsers that don’t involve any browser and simulate things entirely. One such tool is the JSDOM library, which you can use to parse the web page and interact with it via JavaScript, as you do with a headless browser. JSDOM is faster and less resource intensive than a headless browser, but it does not implement all the features of a browser, so some websites simply won't work with JSDOM, or won't show all the content.

If you want to learn how to work with two of the most popular libraries for controlling headless browsers, Apify Academy has a free Puppeteer and Playwright course. Check it out!

How to detect headless browsers

Web scraping is ethical and legal, but that does not prevent website owners from employing anti-scraping measures in order to protect their data from being extracted. Understandably, they only want real users using real web browsers on their websites. Headless browsers are one of the ways you can get your bots to emulate real users, which is why websites try to detect bots acting under the guise of a headless browser.

Websites detect that you’re using headless Chrome or a similar headless browser by finding small discrepancies in your browser’s behavior. Here are the most common strategies:

1. They check the user-agent header your browser is sending
2. They check other HTTP headers and header consistency
3. They check your browser's JavaScript web APIs, such as Navigator.

Can you make a headless browser undetectable?

There are two ways you can solve the above problems and make Chrome Headless or any other headless browser undetectable.

  1. Modify the user-agent
  2. Change your browser fingerprint

Modifying the user-agent is fast and easy, and it can work against naive or old anti-scraping protections, but in most cases, you'll have to pair it with updating the fingerprint as well. Neither method is entirely bulletproof, though.

What is a user-agent?

The user agent is a piece of information sent by the browser to the server with every HTTP request. It identifies the browser, its version, and the operating system it is running on. Headless Chrome sets a special user agent string to identify itself as a headless browser, and a website can use this string to determine if it’s being accessed from a headless browser. The string will look something like this:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36

With the recent advances in privacy online, user-agent strings are being phased out, because they contain information that can be used to identify people. While it is a noble effort, you can still get all the same information and much more thanks to browser fingerprinting.

What is browser fingerprinting?

Browser fingerprinting is a technique that can be used to identify a web browser. It involves collecting information about the browser and the device it's running on, such as the version of the browser, the operating system, the language settings, and the installed plugins, and creating a unique "fingerprint" based on this information. This fingerprint can then be used to track a user's activity across different websites and devices, even if the user is using private browsing mode or has deleted their cookies. It can also identify whether a browser is a bot or a real user. That’s why changing browser fingerprints when doing browser-based scraping significantly reduces blocking.

Modifying the user-agent

Here’s an example of how to bypass Headless Chrome detection by setting a user-agent string in Puppeteer headless mode. This will override the default headless Chrome user-agent string:

import puppeteer from 'puppeteer';

const userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36';

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

// We set the custom user-agent 
// before navigating to the page.
await page.setUserAgent(userAgent);

await page.goto('https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html');
await page.screenshot({ path: 'result.png' });

await browser.close();

Changing your browser fingerprint

Changing browser fingerprints resolves both the problem of checking the HTTP header and the browser’s JS web APIs. Here’s one way to solve the latter problem with Puppeteer:

await page.evaluateOnNewDocument(() => {
  // Overwrite the `plugins` property to use a custom getter.
  Object.defineProperty(navigator, 'plugins', {
    // This just needs to have `length > 0` for the current test,
    // but we could mock the plugins too if necessary.
    get: () => [1, 2, 3, 4, 5, 6, 7, 8],
  });
});

Still, changing browser fingerprints can be tedious, and there’s an easier way which kills both of the problems mentioned above with one stone. You can automate the task with an open-source web-scraping library called Crawlee. Fingerprint generation is enabled by default and available in Crawlee's PlaywrightCrawler and PuppeteerCrawler classes.

If you need to narrow down the fingerprints used, you can customize the generation algorithm. Below are examples of doing that with PlaywrightCrawler and PuppeteerCrawler.

PlaywrightCrawler:

PlaywrightCrawler is a class that encapsulates management of browsers with Playwright in web-scraping scenarios and adds many useful features like automatic retries, proxy and fingerprint rotation, URL queues and more.

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // You can customize parameters of the fingerprint
    // or leave it to Crawlee to automatically choose
    // and rotate the fingerprints for you.
    browserPoolOptions: {
        useFingerprints: true,
        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: [{
                    name: 'edge',
                    minVersion: 96,
                }],
                devices: ['desktop'],
                operatingSystems: ['windows'],
            },
        },
    },
    // ...
});


PuppeteerCrawler:

PuppeteerCrawler does the same thing, but with Puppeteer. You can easily switch between the libraries simply by swapping the two class names. Their interfaces are identical.

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    // You can customize parameters of the fingerprint
    // or leave it to Crawlee to automatically choose
    // and rotate the fingerprints for you.
    browserPoolOptions: {
        useFingerprints: true,
        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['chrome', 'firefox'],
                devices: ['mobile'],
                locales: ['en-US'],
            },
        },
    },
    // ...
});

Switch your crawlers from HTTP to headless browsers in 3 lines of code. Crawlee builds on top of Puppeteer and Playwright and adds its own anti-blocking features and human-like fingerprints.

Don't stop here

We hope those tips help you stay headless and that this article has whetted your appetite for more information about web scraping with headless browsers. So be sure to check out the online literature below to learn more:

Theo Vasilis
Theo Vasilis
Writer, Python dabbler, and crafter of web scraping tutorials. Loves to inform, inspire, and illuminate. Interested in human and machine learning alike.

Get started now

Step up your web scraping and automation