When paired with a proxy, Puppeteer can truly be practically unstoppable; however, there can be some difficulties when trying to configure Puppeteer with Headless Chrome correctly to authenticate a proxy that requires a username and password.
Normally, when using a proxy requiring authentication in a non-headless browser (specifically Chrome), you'll be required to add credentials into a popup dialog that looks like this:
The problem with running in headless mode is that this dialog never even exists, as there is no UI in a headless browser. This means that other avenues have to be taken in order to authenticate your proxy. Perhaps you've tried doing this (which doesn't work):
Don't worry, we've tried it too. The reason this doesn't work is because Chromium doesn't offer a command-line option which supports passing in the proxy credentials. Not to worry, though! Today, we'll be showing you four different (and very simple) methods that'll help you authenticate your proxy and be right on your way:
1. Using the authenticate() method on the Puppeteer page object:
For two years now, Puppeteer has supported a baked-in solution to authenticating a proxy with the authenticate() method. Nowadays, this is the most common method of doing it in vanilla Puppeteer.
There are two key things to note with this method:
The proxy URL must be passed into the --proxy-server flag within the args array when launching Puppeteer.
The authenticate() method takes an object with both "username" and "password" keys.
2. Using the proxy-chain NPM package:
The proxy-chain package is an open-source package developed by and maintained by Apify which provides a different approach with a feature that allows you to easily "anonymize" an authenticated proxy. This can be done by passing your proxy URL with authentication details into the proxyChain.anonymizeProxy method, then using its return value within the --proxy-server argument when launching Puppeteer.
An important thing to note when using this method is that after closing the browser, it is a good idea to use the closeAnonymizedProxy() method to forcibly close any pending connections that there may be.
This package performs both basic HTTP proxy forwarding, as well as HTTP CONNECT tunneling to support protocols such as HTTPS and FTP. It also supports many other features, so it is worth looking into it for other use cases.
3. Within ProxyConfigurationOptions in the Apify SDK:
The Apify SDK is the most modern and efficient way to write scalable automation and scraping software in Node.js using Puppeteer, Playwright, and Cheerio. If you aren't familiar with it, check out the docs here.
Within the ProxyConfigurationOptions object in which you provide the Apify.createProxyConfiguration() method, there is an option named proxyUrls. This is simply an array of custom proxy URLs which will be rotated. Though it is an array, you can still pass only one proxy URL.
Pass your proxy URL with authentication details into this array, then pass the proxyConfiguration into the options of PuppeteerCrawler, and your proxy will be used by the crawler.
The massive advantage of using the Apify SDK for proxies as opposed to the first method is that multiple different custom proxies can be inputted, and the rotation of them will be automatically handled.
4. Setting the Proxy-Authorization header
If all else fails, setting the Proxy-Authorization header for each of your crawler's requests is an option; however, it does have its setbacks. This method only works with HTTP websites, and not HTTPS websites.
Similarly to the first method, the proxy URL needs to be passed into the --proxy-server flag within args. The second step is to set an extra auth header on the page object using the setExtraHTTPHeaders() method.
It is important to note that your authorization details must be base64 encoded. This can be done with the Buffer class in Node.js.
Once again, this method only works for HTTP websites, not HTTPS websites.
A bit about proxies
Proxies have many uses, but in the scraping and automation world, one of the biggest reasons to use them is to deal with the various protections that certain sites may deploy. For example, some sites are able to detect whether or not an IP address is coming from a server or from a genuine human being on their computer/phone, then block the request from the server. Certain proxies can be a lifesaver in these situations, as they can make bots appear human and never get blocked. Learn more about this topic here.
If you’re looking for a proxy for web scraping or web automation purposes, make sure to check out Apify Proxy, an HTTP proxy service that gives you access to a large pool of datacenter and residential IP addresses, monitors the health of the IP pool, and intelligently rotates the IPs.