Running Scrapy spiders written in Python on Apify

Administrator
Administrator
TL;DR: You can now run Scrapy spiders written in Python on the Apify platform, using the new Scrapy Executor (apify/scrapy-executor) actor. This is big news, because now you can finally also enjoy all the advantages of the Apify platform with your Scrapy scripts.

This post was written by Vojta Drmota in September 2019.

If you’re more comfortable writing web crawlers in Python than JavaScript, you can now easily deploy Scrapy crawlers to the Apify cloud. This will let you enjoy all the benefits of Apify compared to other hosted web scraping providers, for example, integrations to Zapier and a thousand other web apps, sub-minute scheduling of jobs, scalable data storage and export, and our affordable proxy service.

Scrapy is a Python framework intended to make writing web spiders easier in Python (note that in Scrapy, the web crawlers are called “spiders”). In some ways, it is similar to Apify SDK, but there are several key differences. The main difference being the programming language that each is intended for: while Scrapy is a Python package, the Apify SDK is a Node.js/JavaScript library. Another difference is that to scrape target websites that need to run JavaScript, Scrapy depends on the Splash headless browser, which can be easily detected, while Apify SDK uses a full-featured Chrome browser, which is far more complicated to detect.

Running a Scrapy spider on Apify

Running a simple single-file Scrapy spider on Apify is as easy as copy-pasting a single source file. Here are the steps:

  1. Locate the source code of your Scrapy spider. No worries if you don’t have it now, you can use the example source code provided below.
  2. Head over to the Scrapy Executor (apify/scrapy-executor) actor.
  3. Click the Try actor button.
  4. Paste your Scrapy source code into the Scrapy spider input field. Optionally, select a proxy to hide the origin of your spider.
  5. Click Run and see the log of your Scrapy spider:
Scrapy executor actor output
Log of Scrapy spider running on Apify

6. You’re done! Your Scrapy spider now has a new home on the Apify platform.

Here’s a full example of the source code:

Storing data on Apify

As you may have noticed, although the Scrapy spider successfully ran on the Apify platform, it didn’t store any data there. On Apify, actors typically store crawling results in a storage called Dataset. Datasets are useful for structured data, such as a list of products from an e-commerce site. The stored data can be exported to formats such as JSON, XML or CSV. Another important data storage on Apify is a Key-value store, which is useful for storing files, such as screenshots or PDFs.

To store data from your Scrapy spider in the Dataset or Key-value store, you can use the apify Python package. First, import the package by adding the following command to the top of your Scrapy source code:

import apify

After that, you can push the results to the Dataset associated with the actor run by calling:

apify.pushData(item)

where item is an object with the data scraped from the web page. Once your Scrapy spider has finished running, you will find the data in Apify’s Dataset associated with the actor run.

Here is a complete reference of the data storage methods in the apify Python package:

# Push data to the Apify dataset
apify.pushData(item)
# Add a new record to the Key-value store
apify.setValue(key, value)
# Retrieves a record from the Key-value store
apify.getValue(key)
# Deletes a record from the Key-value store
apify.deleteValue(key)

Note that the apify Python package works for both local and cloud actor runs.

Running multi-file Scrapy spiders on Apify

If you are experienced with developing Scrapy spiders, chances are that your source code is divided into multiple files, in order to leverage Scrapy’s item pipeline, advanced settings or additional tools. Running these spiders is also straightforward, you’ll just need to clone the Scrapy Executor actor’s source code from GitHub, move your source code files into it, and then build the actor on Apify using a CLI tool so that it can be executed there.

For more details, see the Scrapy Executor actor’s README.

Conclusion

We’re excited to see how you’ll make use of the Apify platform with your Scrapy spiders. If you have any questions or any feedback to share, don’t hesitate to contact us at support@apify.com.

For the latest updates and product developments, follow @apify on Twitter.



Great! Next, complete checkout for full access to Apify
Welcome back! You've successfully signed in
You've successfully subscribed to Apify
Success! Your account is fully activated, you now have access to all content
Success! Your billing info has been updated
Your billing was not updated