Import data easily with Apify’s new Google Sheets actor

Lukáš Křivka
Lukáš Křivka

Integrate your Apify actors and Google spreadsheets to upload lists of URLs, save crawled data for later processing and lots more

Google Sheets is by far the most commonly used tool for handling Excel-like tables on the web. It’s easy to use, has powerful features, is fast and accessible from anywhere. No wonder that almost everybody uses it to store and process some kind of data. Apify users are no exception.

Unfortunately, until now, we’ve lacked first-class integration of Google Sheets, so our users could not unlock its full potential. If a user wanted to upload a list of URLs to be crawled or save crawled data for later processing, there were basically two main choices:

  • External integration software such as Zapier
  • Existing hacky and incomplete solutions available on Apify

Zapier works nicely and is super friendly but it has a major disadvantage: you can only read or upload a limited number of rows. Also, it creates a dependency on an external service, which is not something we want to force upon our users.

The problem with the existing solutions on Apify is that they require the spreadsheet to be public for both reading and uploading and need excessive boilerplate if you want to upload.

We knew we had to do better. So we’ve developed a completely new actor that will manage everything you could ever need from Google Sheets integration. It allows you to read or write thousands of rows in just a few seconds, works behind proper authentication and authorization (you don’t need to share your Google credentials at all) and enables rich functionality that ensures every use case is covered. You can read detailed documentation in the Apify Library, but let’s start learning about it right now by actually using it.

Just one note: If this tutorial seems too long and detailed, feel free to skip any parts. You only need to go through the Initial setup and Authorization chapters to get the basics. Don’t worry if you just want to use it as an API — only the initial setup is required in our app.

Initial setup

First, let’s go to the Apify Library and search for “google sheets”.

As you can see, our library search tool will already suggest something for you to pick, so let’s go with its suggestion and click on the actor.

You then get to the Google Sheets actor detail page, where all the important stuff is explained. Let’s click on “Try actor”. By the way, you need to have an account on Apify, so sign up for our free account if you haven’t already, and let’s test it.

You will get redirected to the actors section in our app. This is the place where you can run and integrate the actor and review the logs and results. You are now in the console tab, where you can set up the input for the actor and run it. Take a quick look at the input options, but don’t worry about them for now. For most use cases, you just need the basics. Let’s focus on the first three fields.

Okay, so let’s try to upload some data! Before we do that, we need two things: the spreadsheet and the data. So create an empty spreadsheet and fill in its id (you can find this in its URL — it will look like this: 1oz8YzfE5gVw84qWAywlugH0ddrkc1FyAe1aEO4TFTjA) to the field “Spreadsheet id”.

Now we need the data. If you’ve used any Apify crawlers or actors before, you can use data you already have. If not, don’t worry. Let’s quickly crawl something easy so we can test the spreadsheet. This simple Hacker News actor will help us. The cool thing is that we don’t even need to run the crawler, we can just go to “Example run” and look at the dataset. Feel free to review the data in any of the provided formats, but what we actually want is the dataset id.

So take the id and fill it in to the “Dataset or crawler execution id” field. You can leave mode on “append”. And that’s everything we need to set in input. However, there’s still one thing needed before we can finally upload the data. But I promise that this is really the final step before you can get started — and you only have to do it once for your account.

Authentication and first run

Since we wanted to make everything secure for you, the Google Sheets actor requires your authentication and authorization. Basically, you need to prove to Google that you are who you are by logging in and allowing Apify to work with your spreadsheet. The details are explained in this knowledge base article.

This is the only time you need to run it directly. Once you authorize it, future uses will be authorized automatically and you can use the actor programmatically. So just press “Run” and watch the log.

The log will prompt you to open a live view tab. Open it and press Authorize. If you are running it locally, open localhost:3000.

It will open a new tab in your browser where you need to log in to Google with the account where your spreadsheet is saved. Allow Apify to access your spreadsheet and you will get to a page displaying a long token.

Copy it and paste it to the input field in live view tab and press “submit”.

And that’s it! You are now authorized for life! Well, almost… Until you stop using this actor for a full six months you will not need to authorize again. If you want to use different Google accounts, you will have to authorize the actor once for each of them and each has to have unique “Google OAuth tokens store” input, e.g.

You can always disable Apify’s access to your spreadsheet either by disallowing it directly in your Google account or by deleting the key-value store, e.g. “lukas-2nd-email”, where the tokens are stored.

The actor should finish in seconds so watch for the status of the actor run to change to “SUCCEEDED” and look at your spreadsheet. Voila.

Raw Data upload

Don’t care about crawling and just want to use the Google Sheets actor as an API to easily import your data behind proper authentication? Google’s API is very complicated and requires a lot of reading and developer experience. The Google Sheets actor gives you the ability to use it out of the box with only a little setup!

You can pass raw data in JSON format as part of the input.

Raw data can either be passed as an array of objects, where each object represents one data row or as an array of arrays, where the first array is the header row and the next are data rows. For more details, please read about the raw data formatting in the docs.

You can easily upload large amounts of data using our API.

Appending and saving a backup

Let’s explore some other options. First, let’s shuffle the data a bit so we have something new. Go back to the Hacker News actor and click on “Try actor”. It will add it to your account so you can run it whenever you want to. Let’s press run to get live data. Wait for status “SUCCEEDED”, copy the id of the dataset and paste it to the “Dataset or crawler execution id” field like before.

Now if we were to press run it would just add the data as additional rows to the spreadsheet. That might be enough for some use cases, but let’s go further. When appending new rows, we usually want to skip duplicates. For these reasons, we have two input fields: Deduplicate by equality and Deduplicate by field. Deduplicate by equality would not work for us since the same post will change its rank and number of comments over time. But we can deduplicate the posts by a particular field. The link field makes the most sense since that is kind of unique for each post.

Let’s add that to the input and also check on Create backup. We will get into what that means right after.

Since it may already be some time from the initial dataset, there may not be anything to deduplicate, but you can try it a few more times later with newer data. Since we checked the Create backup option, the run also saved a backup file, which is basically a JSON with the rows as arrays. If we are not sure that we are doing the correct transformation, we can easily restore the spreadsheet using load backup mode and passing in the id of the key-value store where the backup was saved.

Google Sheets have, of course, their own version history, so feel free to use that instead.

Other modes (replace, modify, read)

So far we’ve only tried the append and load backup modes. Let me briefly explain the other three modes you can use.

  • Replace behaves basically the same way as append, but the sheet is cleared before inserting the new rows. This is useful when you want to have only the most recent data.
  • Modify applies deduplication options or transform function (see further) to the data directly in your spreadsheet. So you can process the data directly in your spreadsheet.
  • Read allows you to retrieve the data from the spreadsheet and saves them into the default key-value store as OUTPUT record. In this way you can call the Google Sheets actor from anywhere else and use the data for any of your needs. All of that without the need to publish the spreadsheet!

As you can see, we are getting the data back in the same format as they were pushed.

Integration into crawlers

Let’s get to more real use cases. The most basic use case is to have a scheduled scraping job upload the data into a spreadsheet. Since we don’t want to pass the input directly in the source code if it gets complicated, let’s create a separate task to hold the settings of the Google Sheets actor.

Go to the console tab of the actor and press Create task.

You will get redirected to the Tasks section. A task is basically actor input setup. Now let’s prepare the input here before we call it from the source code. Switch to replace mode, keep the Spreadsheet id but we don’t need the dataset id anymore. That’s the only thing that will change over time since every actor run has a different dataset (unless you use named datasets).

Then go to the settings of the task and just name it so you can easily identify it.

This task is ready to use! Let’s open any actor that saves data into a dataset. If you don’t have one, you can copy the Hacker News actor so you can write into the source code.

Now go to the source and at the end of the source code, we want to call the spreadsheet-test task.

  • ** Screenshot with Apify.callTask() if it gets implemented ***

That’s it. If we schedule this actor, we will have fresh data uploaded to our spreadsheet immediately after each run.

Advanced (transformations)

Finally, we’ll explore how to transform our data using a bit of JavaScript. So far we’ve only deduplicated the data. but very often we want to carry out more complicated workflows. Fortunately, the Transform function gives us a flexible tool to achieve most of what we can think of.

You simply pass in a JavaScript function that gets an object with arrays of data from the spreadsheet and dataset and returns a new array that will be imported to the spreadsheet or outputted to the key-value store depending on the mode you choose. An array of data from the spreadsheet is called spreadsheetData and an array of data from the dataset is called datasetData. Inside the array are objects with fields and values with the same structure as returned from the read mode.

Let’s say that we are only interested in the most successful posts on Hacker News and we want to monitor the rankings for a long time. We’ll set up a scheduler to run the Hacker News actor daily and we’ll call the Google Sheets actor in append mode at the end of the source code as shown above. Then we will set a transform function that will check both the data we have in the spreadsheet and from the daily crawled dataset and will pick the top 20 most successful posts (measured by score) and sort them. In JavaScript, this is really trivial.

And here we go:

One important thing to keep in mind when using the transform function: the data you are working with are flattened and the field names are concatenated with a slash. That means that every nested object (as are arrays) is flattened too. The property price that was pushed to the dataset in this object:

{
    options:[
        {
         price: 200
        }
    ]
}

Will have to be accessed like this:

oneOfObjects['options/0/price']

That’s it. It would be great if you could play with the different options and give us some feedback because we want to improve our public actors as much as possible. For more details and detailed documentation, please look at the library detail page of the Google Sheets actor or Github.



Great! Next, complete checkout for full access to Apify
Welcome back! You've successfully signed in
You've successfully subscribed to Apify
Success! Your account is fully activated, you now have access to all content
Success! Your billing info has been updated
Your billing was not updated