TL;DR: You can now run Scrapy spiders written in Python on the Apify platform, using the new Scrapy Executor (apify/scrapy-executor) actor. This is big news, because now you can finally also enjoy all the advantages of the Apify platform with your Scrapy scripts.
This post was written by Vojta Drmota in September 2019.
Running a Scrapy spider on Apify
Running a simple single-file Scrapy spider on Apify is as easy as copy-pasting a single source file. Here are the steps:
- Locate the source code of your Scrapy spider. No worries if you don’t have it now, you can use the example source code provided below.
- Head over to the Scrapy Executor (apify/scrapy-executor) actor.
- Click the Try actor button.
- Paste your Scrapy source code into the Scrapy spider input field. Optionally, select a proxy to hide the origin of your spider.
- Click Run and see the log of your Scrapy spider:
6. You’re done! Your Scrapy spider now has a new home on the Apify platform.
Here’s a full example of the source code:
Storing data on Apify
As you may have noticed, although the Scrapy spider successfully ran on the Apify platform, it didn’t store any data there. On Apify, actors typically store crawling results in a storage called Dataset. Datasets are useful for structured data, such as a list of products from an e-commerce site. The stored data can be exported to formats such as JSON, XML or CSV. Another important data storage on Apify is a Key-value store, which is useful for storing files, such as screenshots or PDFs.
To store data from your Scrapy spider in the Dataset or Key-value store, you can use the apify Python package. First, import the package by adding the following command to the top of your Scrapy source code:
After that, you can push the results to the Dataset associated with the actor run by calling:
item is an object with the data scraped from the web page. Once your Scrapy spider has finished running, you will find the data in Apify’s Dataset associated with the actor run.
Here is a complete reference of the data storage methods in the apify Python package:
# Push data to the Apify dataset apify.pushData(item)
# Add a new record to the Key-value store apify.setValue(key, value)
# Retrieves a record from the Key-value store apify.getValue(key)
# Deletes a record from the Key-value store apify.deleteValue(key)
Note that the apify Python package works for both local and cloud actor runs.
Running multi-file Scrapy spiders on Apify
If you are experienced with developing Scrapy spiders, chances are that your source code is divided into multiple files, in order to leverage Scrapy’s item pipeline, advanced settings or additional tools. Running these spiders is also straightforward, you’ll just need to clone the Scrapy Executor actor’s source code from GitHub, move your source code files into it, and then build the actor on Apify using a CLI tool so that it can be executed there.
For more details, see the Scrapy Executor actor’s README.
We’re excited to see how you’ll make use of the Apify platform with your Scrapy spiders. If you have any questions or any feedback to share, don’t hesitate to contact us at firstname.lastname@example.org.
For the latest updates and product developments, follow @apify on Twitter.