Booking.com provides accommodation all over the world, so there’s lots of data available on the site. The user interface is quite friendly for a human user, but getting that data in machine-processable format is not a simple task, since there is no official Booking.com API. That’s where our new Booking.com actor comes in.
The actor is capable of extracting the data in several ways, depending on the starting conditions and expected comprehensiveness of the output.
To use the actor, just navigate to its library page and click Try actor. Or read on for a short guide to getting the most out of scraping Booking.com.
The actor can be started using an API, or directly from the Apify platform using a graphical user interface.
The input supports most of the same attributes as the official Booking.com page. In raw JSON format, the same input would look like this:
The actor is usually started using at least the search input attribute, but it is also possible to start it with direct Booking.com URLs using a startUrls input attribute containing an array of URLs. For more information on this option, check out the README.
The simplest way the actor can be used is in scraping data just from the hotel list page, i.e. the crawler will not navigate to the detail page. This is also the fastest possible way of scraping data from Booking.com, since the number of pages to extract is significantly reduced. Unless you need more details about each hotel, this is the most efficient option.
This type of scraping can be enabled by setting the simple input attribute or selecting the Extract hotel list pages only (no detail) option in the graphical interface.
For simpler cases, this is usually enough. Data will be extracted for all of the hotels in the list, and from the subsequent pagination pages. If the maxPages attribute is set, only that number of pagination pages will be navigated to. Data for the first hotel in this list looks like this:
The latitude and longitude data is not visibly present on the page, but it is possible to extract it from the page HTML.
If the simple scrape is not enough, by not setting the simple input attribute the more comprehensive option can be used. The crawler will then navigate to every single hotel detail page and extract data from it. This will accordingly increase the necessary crawling time (approximately 15×), but the results will contain more data.
As you can see, we now have information about breakfast (not present in this specific hotel), check-in and check-out times as well as a more detailed address, although it was not possible to get data about any of the rooms. This is due to the fact that Booking.com only displays room information if specific dates of stay are set, i.e. to get the room data, you need to set the checkIn and checkOut input attributes.
This is what the room information on the page looks like:
As you can see in the table, the first three rooms share the first cell. When the data is extracted from this table, there will be an array of rooms and the shared data will be copied to each of the applicable rooms. Output data for the second room will look like this (it is one of the elements in the output rooms array):
Scraping more than 1000 results
Booking.com utilizes many anti-scraping mechanisms, one of them being that it will only display a maximum of 1000 results for any given search. This is obviously a problem for scraping, but it is possible to (mostly) overcome this limitation. Our actor does this by using various criteria filters to limit the number of results and then combining data from all of the limited searches into one, keeping only unique results.
The primary drawback of this approach is that when you start the crawler using startUrls, they cannot contain any of the filters, since the crawler will simply replace them. This means that if you want to start the crawler using your own filters in the URL, you will be limited to a maximum of 1000 results.