How to make headless Chrome and Puppeteer use a proxy server with authentication
We have released a new open-source package called proxy-chain on NPM to enable running headless Chrome and Puppeteer over a proxy server that requires authentication.
The addition of headless mode to Google Chromium and the launch of the corresponding Node.js API called Puppeteer by Google earlier this year has made it extremely simple for developers to automate actions on the web, such as filling in forms or saving screenshots of web pages. However, there is a catch. If you want the web browser to use a proxy server that requires authentication, you’re out of luck.
In order to let Chromium use a custom proxy server, you can use the--proxy-server command-line option:
Note that chrome must be an alias of your Chromium executable (see how to do this). You need to use Chromium rather than Chrome, as Chrome doesn’t support the --proxy-server option in the non-headless (headful?) mode.
If the proxy server requires authentication (by responding with the 407 Proxy Authentication Required HTTP header), the browser will show a dialog prompting you to enter a username and password.
However, if you start Chromium in headless mode there is no such dialog, because, you know, the browser has no windows. Chromium provides no command-line option to pass the proxy credentials and neither Puppeteer’s API nor the underlying Chrome DevTools Protocol (CDP) provide any way to programmatically pass it to the browser. It turns out there is no simple way to force headless Chromium to use a specific proxy username and password. (Edit from 2018–01–09: Puppeteer’spage.authenticate() the function also seems to address this problem — see note in the bottom).
To work around this limitation of Chromium, you can set up an open local proxy server that will forward data to an upstream authenticated proxy, and then let Chromium use the local open proxy. For example, such a proxy chain can be created using Squid and its cache_peer configuration directive. The Squid configuration file (squid.conf) might look as follows:
Now the proxy should be running locally on port 3128 so that it can be used by Chromium:
chrome --proxy-server=http://localhost:3128
This approach becomes tedious if you want to use it programmatically from your code or if you need to change proxies on the fly. In such a case you’ll need to dynamically update Squid configuration or launch a separate Squid instance for each proxy.
Here at Apify, we went down that path during our work on the apify NPM package and it turned out to be a nightmare. The Squid processes would spontaneously hang or not start at all, each platform behaved differently, etc.
To hack around this, we developed a new NPM package called proxy-chain and released it as open-source on GitHub. With it you can easily “anonymize” an authenticated proxy and then launch headless Chromium with Puppeteer using the following Node.js code:
The proxy-chain package performs both basic HTTP proxy forwarding as well as HTTP CONNECT tunneling to support protocols such as HTTPS and FTP. The package has many more features that we are using for our upcoming projects, so stay tuned on Twitter:
We’re looking forward to hearing your feedback. If you find any problem with the package, please submit an issue on GitHub or preferably create a pull request :)
If you’re looking for a proxy for web scraping, make sure to check Apify Proxy, an HTTP proxy service that gives you access to both datacenter and residential IP addresses enables intelligent IP address rotation, and much more.
Happy crawling with proxies!
Edit (2018–01–09): A few people pointed out that Puppeteer has a new function called page.authenticate() which internally uses the CDP’s AuthChallengeResponse object to pass credentials to Chrome login dialog. Indeed, this approach seems to work, although one might think of situations where the function will not be adequate. For example, what if you want to open a web page that needs basic authentication via the authenticated proxy?