How to run Puppeteer in a Docker container

Running Puppeteer in Docker

In a standard software engineering team, code testing is one of the procedures followed before shipping products to production. Unit testing, end-to-end (E2E) testing, and automating user actions are different ways to test the usability and functionality of your web application. Puppeteer shines in all these areas.

Puppeteer is a Node.js library that allows you to interact with headless and headful Chrome or Chromium browsers to perform various tasks, such as taking screenshots, generating PDFs from web pages, and automating file downloads and uploads.

However, running Puppeteer consistently across different environments can be challenging. Docker, a containerization tool, helps you create isolated environments that run consistently across any infrastructure. Therefore, running Puppeteer in Docker simplifies the setup process and ensures that it works perfectly in every environment, from a developer's machine to the production server.

ℹ️

Environment Infrastructure refers to different operating systems (Windows, Linux, iOS, etc) or different developer devices without conflicting with any of their current working configurations.

What is a container?

If you're not familiar with containers in programming, think of them as large, strong boxes used for shipping big items from one place to another. Manufacturers and wholesalers who purchase goods in bulk from different parts of the world load them into a container and ship them to their destination. When the container arrives, the items inside it are the same as those that were loaded unless they were tampered with during transport.

With that understanding, a container, in the programming context, refers to a lightweight, standalone package that contains everything (resources, dev dependencies, code, settings, etc.) needed to run an application. Having many containers on a single system does not cause conflict because they're isolated from each other, which makes them more efficient, faster, and consistent across different environments.

How does Docker work?

Docker simplifies the process of managing containers and allows developers to package an application and its dependencies into a single unit (container). Afterward, it can be shipped to run on other machines without errors. In other words, Docker allows you to develop, ship, and run container applications.

Docker containers are portable, ensuring that an application runs the same way in different environments.

Docker Image vs. Container

Looking at the definitions, you might need clarification on the two. They're different but closely related. A Docker image is an executable package that includes everything needed to run an application, while a container is the execution environment created from a Docker image.

From Dockerfile to containter: https://blog.apify.com/content/images/2023/08/docker-pipeline-2.svg — From Dockerfile to Container. Source: Python on Docker: how to dockerize Python applications

According to Usama Jamil:

The Docker image is the blueprint, while the Docker container is the live, operational version of that blueprint that brings your application to life in a controlled and predictable manner.

-- Python on Docker: how to dockerize Python applications

Setting up Docker

You can install Docker on your Windows, Linux, or Mac devices by following the instructions on the respective operating system guide on the Docker documentation website.
Verify your installation by running docker --version in your terminal. You can also run docker run hello-world. This will download a hello-world test image and run it in a container; if successful, it will print “Hello from Docker!”

Docker installation. Hello from Docker! This message shows that your installation appears to be working correctly.

Basic Docker commands

Pulling an image: docker pull image-name: replace the image-name with the docker image you want to pull from the Docker hub.
Running a container: docker run image-name.
Managing containers: docker ps will list all the running containers; docker stop container-id will stop the container whose ID you passed in the command; docker rm container-id will remove the container you passed to its ID in the command.
Building a Docker image: docker build -t your-image-name will build the image in your current working directory.

Now that you're familiar with Puppeteer and Docker, you can progress into creating and running Puppeteer in a Docker container.

Creating the Dockerfile

The Dockerfile is a script that contains the commands and dependencies that will assemble the Docker image. Below is the step-by-step procedure for creating a Dockerfile for running Puppeteer.

Creating the base image: You can start with a node image as the base.
Dependencies: Install puppeteer. If there are any other commands to install any library that your Puppeteer package depends on, include them.
Application: Copy your application files into the image

ℹ️

For this tutorial, create a folder docker-ppt in any directory of your choice (desktop or documents), then open the folder in your code editor. Open your terminal and run npm init -y to initialize node in your project folder.

By now, a package.json file would have been created for you. Modify it to look like the sample below. (You can change the author to your name if you like.)

{
  "name": "docker-ppt",
  "version": "1.0.0",
  "description": "a simple project for testing how to run puppeteer in a docker container",
  "main": "puppeteer-script.js",
  "type": "module",
  "scripts": {
    "test": "echo \\"Error: no test specified\\" && exit 1"
  },
  "keywords": [
    "puppeteer",
    "docker",
    "container"
  ],
  "author": "Apify",
  "license": "ISC"
}

The next step is to create a Puppeteer script that you will run in the Docker container to confirm the successful setup of the container. Create a file in your code editor, puppeteer-script.js, and put the code below in it.

ℹ️

The puppeteer script will launch a browser, navigate to a page, and take a screenshot. If there's an error, it will return the error, and if it's successful, it will return a successful message. The screenshot will be downloaded as a jpg file into your file directory.

Puppeteer script

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({
    headless: new,
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
});

const page = await browser.newPage();

try {
    // Navigate to the page
    await page.goto('<https://blog.apify.com>', { waitUntil: 'load', timeout: 60_000 }); // 60 seconds timeout

    // Take a screenshot
    await page.screenshot({ path: 'apify.jpeg', fullPage: true });

    console.log('Puppeteer script executed successfully.');
} catch (error) {
    console.error('Error running Puppeteer script:', error);
} finally {
    // Close the browser
    await browser.close();
}

The final step is to create a Dockerfile in the project directory. Create a file, and name it Dockerfile then put the code below in it.

Dockerfile

FROM node:20.11.1

# Install the latest Chrome dev package and necessary fonts and libraries
RUN apt-get update \\
    && apt-get install -y wget gnupg \\
    && wget -q -O - <https://dl-ssl.google.com/linux/linux_signing_key.pub> | gpg --dearmor -o /usr/share/keyrings/googlechrome-linux-keyring.gpg \\
    && echo "deb [arch=amd64 signed-by=/usr/share/keyrings/googlechrome-linux-keyring.gpg] <https://dl-ssl.google.com/linux/chrome/deb/> stable main" > /etc/apt/sources.list.d/google.list \\
    && apt-get update \\
    && apt-get install -y google-chrome-stable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-khmeros fonts-kacst fonts-freefont-ttf libxss1 dbus dbus-x11 \\
      --no-install-recommends \\
    && rm -rf /var/lib/apt/lists/* \\
    && groupadd -r apify && useradd -rm -g apify -G audio,video apify

# Determine the path of the installed Google Chrome
RUN which google-chrome-stable || true

# Switch to the non-root user
USER apify

# Set the working directory
WORKDIR /home/apify

# Copy package.json and package-lock.json
COPY --chown=apify:apify package*.json ./

# Install Puppeteer without downloading bundled Chromium
RUN npm install puppeteer --no-save

# Copy your Puppeteer script into the Docker image
COPY --chown=apify:apify . .

# Update the PUPPETEER_EXECUTABLE_PATH to the correct Chrome path (placeholder, update based on the output of `which google-chrome-stable`)
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \\
    PUPPETEER_EXECUTABLE_PATH=/usr/bin/google-chrome-stable

# Set the command to run your Puppeteer script
CMD ["node", "puppeteer-script.js"]

Breakdown of the Dockerfile

FROM node:20.11.1: This line specifies the base image for the Docker container. It uses a specific Node.js version 20 image.
The RUN line updates the package lists for the Debian-based image, installs wget and gnupg (needed for adding the Google Chrome repository), adds the Google Chrome signing key, adds the Chrome repository to the system's software sources, and then installs Google Chrome along with various fonts and libraries necessary for Chrome to run correctly. It also installs dbus and dbus-x11 which are system message buses and interfaces needed by Chrome. The -no-install-recommends option minimizes the size of the installation by skipping optional packages.
&& rm -rf /var/lib/apt/lists/*: This part of the command removes the package lists downloaded by apt-get update to keep the Docker image size small.
&& groupadd -r apify && useradd -rm -g apify -G audio,video apify: creates a non-root user named apify (with a group also named apify) and adds the user to the audio and video groups. This is a security best practice for running applications inside Docker containers.
RUN which google-chrome-stable || true: Attempts to locate the google-chrome-stable executable in the system's PATH and outputs its location. The || true ensures that this command doesn't cause the build process to fail if the executable isn't found. This step is mostly for debugging and doesn't affect the Docker image.
USER apify: Switches the user context to the apify user for subsequent commands and when running the container.
WORKDIR /home/apify: Sets the working directory inside the container to /home/apify.
COPY --chown=apify:apify package*.json ./: Copies package.json and package-lock.json (if they exist) into the working directory, setting ownership to the apify user.
RUN npm install puppeteer --no-save: Installs Puppeteer without saving it to package.json dependencies and without downloading Chromium, as the Dockerfile configures Puppeteer to use the installed version of Google Chrome.
COPY --chown=apify:apify . .: Copies the rest of the application's files into the working directory, setting ownership to the apify user.
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true PUPPETEER_EXECUTABLE_PATH=/usr/bin/google-chrome-stable: Sets environment variables to skip the Chromium download by Puppeteer and to use the installed Google Chrome as Puppeteer's browser executable.
CMD ["node", "puppeteer-script.js"]: execute the Puppeteer script using Node.js.

ℹ️

This Dockerfile is tailored for running Puppeteer with an externally installed Google Chrome, focusing on security by running as a non-root user and minimizing the Docker image size.

Building and running the Docker image

To build your Puppeteer Docker image, start the docker desktop application that was installed earlier if you had it closed. You can now use the terminal of your code editor to run this docker build -t docker-ppt . command. The docker-ppt can have any name of your choice. After a successful build, you can now run the image.
To run the image, run the docker run docker-ppt command in the terminal.

ℹ️

If you need to exclude some files that are not relevant to the build, use the .dockerignore file to exclude them, just as you have .gitignore for git.

With all the setup done, the Docker environment will be consistent regardless of where you are running the container.

Output

After the Docker file has run, you should see a successful statement in the terminal. You can then go to your Docker application, click on the container menu, then click on the container name, and click on the log tab. You’ll see the same message you saw in your terminal (either successful or failed). An image representation of this is shown below.

The second step is to check for the downloaded screenshot in the file directory. Click on the click on the files tab, then the whole docker image file will be loaded. Click on home, select the user (which is apify for this example), scroll down a bit, and you’ll find the downloaded file.

You have successfully created and run Puppeteer in a Docker container. You can find all the codes in this GitHub repository.

Performance and issues

To optimize performance, consider running Puppeteer with --no-sandbox in Docker, but be aware of the security implications.

Also, if you encounter issues, check that all necessary dependencies are installed in your Dockerfile and that your Puppeteer script is correctly referencing paths and URLs.

The challenges of memory consumption

A critical issue with a containerized environment like Docker is the unpredictable memory consumption of different websites when automated with tools like Puppeteer. This is because the memory footprint can vary significantly from one site to another.

For instance, automating tasks on a resource-heavy site like Airbnb might consume several gigabytes of memory per open tab, whereas a more lightweight site like Wikipedia might only use a fraction of that.

Processing multiple pages concurrently in a scenario where the pages of one site are consuming much more memory than the pages of another can lead to significant discrepancies in resource allocation. This variability makes it difficult to optimize resource usage and maintain stability, especially when scaling to handle multiple pages or tasks simultaneously.

The solution to memory consumption: Crawlee and AutoscaledPool

Crawlee is a web scraping and browser automation library developed by Apify, and one of its core features is the AutoscaledPool. It is designed to efficiently manage multiple resource-intensive tasks that can be executed simultaneously. The pool dynamically adjusts the number of tasks it runs based on the system's CPU and memory availability and the state of the JavaScript event loop. Below is a detailed explanation of its key aspects and functionalities:

Task management: AutoscaledPool only starts new tasks if there is enough CPU power and memory available and the JavaScript event loop is not overloaded. It relies on the Snapshotter class to obtain information about CPU and memory usage, which can be from local system resources or the Apify cloud infrastructure.
Core functions: To use AutoscaledPool, you need to implement three functions: runTaskFunction, isTaskReadyFunction, and isFinishedFunction. These functions define the tasks to run, whether more tasks are ready to be processed, and whether the pool should finish running.
Running the pool: The pool starts by calling AutoscaledPool.run() and periodically checking if more tasks are ready or if they should be finished. It manages optimal concurrency and continues until all tasks are completed or an error occurs.
Concurrency control: AutoscaledPool has properties to control concurrency, including:
- minConcurrency: Gets the minimum number of tasks running in parallel.
- maxConcurrency: Gets the maximum number of tasks running in parallel.
- desiredConcurrency: Estimates the number of parallel tasks the system can currently support.
- currentConcurrency: Shows the number of parallel tasks currently running.
Aborting the pool: The abort() method can be used to stop the pool. It resolves the promise returned by the run() function but does not guarantee to abort all running tasks.
Example usage: An example usage of AutoscaledPool is provided in the documentation, demonstrating how to instantiate and run the pool with specified task functions.

This feature is particularly useful in scenarios where resource-intensive tasks need to be managed efficiently, ensuring optimal use of system resources without overloading the CPU or memory.

Conclusion and next steps

By following this guide, you've learned how to set up Puppeteer in a Docker container. This setup ensures that your automated browser tasks and tests run consistently across different environments.

Now you can experiment with more complex Puppeteer scripts and Docker configurations to enhance your web development and testing capabilities.