Migrating from Scrapy to Apify SDK

Apify SDK is now Crawlee. This blog is about Apify SDK, but everything in it is true about Crawlee as well. Read the announcement blog post ➜

Technology is constantly changing and moving forward at breakneck speed. High-performing solutions from just a couple of years ago may already be considered obsolete. Tech companies can often find themselves falling behind their competitors due to technical debt, which, if not addressed quickly, can even drive a company out of business.

Overcoming technical debt is not easy, as it usually involves embracing new technologies and even migrating your current solutions to an entirely different tech stack. These are big decisions that can take time to implement and potentially define the future of your business.

Understandably, much deliberation is involved in taking such a step, mainly when you have doubts about whether the change will bring enough improvements to justify the invested time and resources.

Let's look at a practical, real-life example of a retail data analytics company, Daltix, that decided to take the leap and address its web scraping technical debt by shifting from Scrapy to Apify SDK.

Hitting the structural limits of Scrapy

Daltix is a Belgium/Portugal-based retail data provider extracting data from hundreds of e-commerce websites daily. They employ web scraping to collect online retail data, giving major retailers the ability to analyze near real-time data in order to make strategic decisions for their businesses.

To carry out its scraping activities, Daltix started using a custom-made Python framework built on top of the popular framework, Scrapy. This solution enabled them to scale from 250 thousand to 2 million scraped resources a day. At that point they started hitting structural limits with the framework, severely hindering the company's future growth plans.

On top of that, Daltix began facing more blocking from websites, as advanced anti-scraping methods, such as browser and TLS fingerprinting, were becoming increasingly more common, and the Python web scraping ecosystem could not keep up with these changes.

As a result, they were expending enormous amounts of engineering effort in the ongoing maintenance of their spiders and coming across websites that they simply could not scrape. To address these problems, Daltix, in 2019, decided to switch to Apify's open-source web scraping and automation Node.js library, Apify SDK.

Apify SDK vs. Scrapy

It was a challenging and lengthy project to port their framework from Scrapy to Apify SDK and migrate 70 scrapers, taking a whole year to complete. To begin with, the team had to dedicate time to adapt to the new JavaScript workflows and learn a new library while making sure the previous solutions ran smoothly.

With so much work involved in transitioning from one solution to another, it raises the question, was the transition worth the effort? In short, the migration more than paid off.

Over time, the Daltix team observed a drastic increase in all performance metrics:

Daltix experienced a sharp decrease in resource requirements needed to run scrapers, saving close to 90% of Amazon EC2 costs and gaining a 60% reduction in the time taken to collect retail data.
The improvements translated into over 9,000 EC2 hours saved per month.
With increased scalability, Daltix boosted their scraping from 2 million to >5 million resources per day and expected these numbers to double again in the near future.
Despite scaling up their scraping activity, they now need 30% less engineering input to manage the process.
The combination of Node, Apify SDK, and Puppeteer dramatically improved the effectiveness and number of anti-scraping countermeasures that Daltix was able to deploy.

To fully understand what lies behind Daltix's success and why they consider modern ES6+ JavaScript and Node.js to be better-suited tools for web scraping, let's explore the key technical differences between the Node.js and Python ecosystems and how they reflect on the effectiveness of web scraping solutions built with Apify SDK and Scrapy.

It was a year-long project for us to switch from Scrapy to Apify as we had to train the team in JavaScript as well as migrate all of our existing scrapers. While the switch was challenging for our small team, it was also a big success and we are very happy with Apify.

Simon Esprit
CTO of Daltix

Enhanced scalability with Node.js

The scalability of an application can be defined as its ability to support an increasing number of requests without decreasing its performance. This is an important factor to consider when developing scrapers. You can easily find yourself processing huge amounts of information that can become a major bottleneck for your business if your application is not optimized for handling such a large volume of data.

Scalability through Node.js programming can be effortlessly achieved due to its native single-threaded, asynchronous, non-blocking architecture. These default features of the JavaScript programming language ensure the smooth scalability of applications written in Node.js.

The two major cornerstones behind Node’s scalability are its event-based model and non-blocking input/output.

Despite JavaScript being single-threaded, Node's event-based architecture with asynchronous I/O, allows operations to be completed outside the main thread and, therefore, avoid blocking it.

Because the Apify SDK is a Node.js library, it enjoys all the benefits described above, making it the ideal choice for high-scale scraping activities.

**How to install Node.js properly in 2023**

On the other hand, Python does not support asynchronous programming by default. To work around this limitation, Scrapy is written in Twisted, an event-driven networking framework, which gives it some asynchronous capabilities. However, Daltix's team constantly faced difficulties due to Twisted being a legacy library that is hard to code for and debug. It was only in 2022 that Scrapy gained non-experimental partial support for modern Python aysncio coroutines.

On top of that, Scrapy's GitHub was drowning in issues and open pull requests, which diminished the confidence Daltix had in the health of the project. The wider Python ecosystem itself was also still in the flux of transitioning to version 3 of the language and settling on a modern approach to writing async code.

In contrast, Daltix's team found in Apify SDK a modern, optimized, and easily extendable codebase, as well as the willingness of the SDK team to have an open communication channel between them, offering Daltix's engineers the opportunity of influencing the future of the library.

Bypassing anti-scraping protections

Another major problem Daltix engineers faced while using Scrapy was its limited ability to circumvent the increasingly sophisticated anti-bot measures employed by modern websites.

One reason for this was that Python’s various HTTP request libraries were stagnating and performing poorly against anti-bot protected sites. There were also no promising solutions for well-maintained http/2 aware clients.

By contrast, the Node.js ecosystem was bursting with novel technologies and HTTP clients specialized in sending browser-like requests, such as GOT Scraping, enabling bots to blend in with the website's traffic and avoid blocking.

On top of that, Apify SDK has integrated tools to mitigate anti-scraping methods, such as disabling browser fingerprinting protections used by websites and rotating proxies to prevent IP blocking.

**Web scraping: how to crawl without getting blocked**

Advanced browser automation tools

Browser automation tools play a crucial role in web scraping due to their ability to render JavaScript code and interact with dynamic websites.

In 2019, Daltix engineers attended a conference held by Zyte (formerly Scrapinghub), the company responsible for developing and maintaining Scrapy. To their disappointment, Scrapy's lead developers were still pushing for Splash, a headless automation browser based on the scripting language Lua and also developed by Zyte.

This announcement was badly received by the Daltix team, as both of Python's most popular browser automation tools at the time, Splash and Selenium, were trending towards obsolescence and increasingly ineffective against well-defended targets. At the same time, Google's Node.js browser automation library, Puppeteer, was the clear winner in terms of performance and future prospects, while neither Scrapy nor the broader Python ecosystem had any viable alternatives.

Fast forward to 2022, and there was a slight change in this scenario. Microsoft's Playwright is rapidly gaining popularity and establishing itself as the leading browser automation tool across various programming languages, including JavaScript and Python.

This new tool undoubtedly offers the Python ecosystem a much-needed second wind in the web scraping and browser automation landscapes. Nonetheless, the reality remains unchanged, as the two major browser automation software, Playwright and Puppeteer, keep defining the future of dynamic website scraping primarily in Node.js.

The combination of Javascript, Node, and Apify gave us all the pieces we needed to address every one of our existing challenges, futureproof our platform, and resume scaling up our web collection activities.

Charlie Orford
Lead Engineer on Data Collection

What's the future for Apify SDK and Daltix?

At this point, it is evident why the Apify SDK stood out for Daltix as the perfect alternative to Scrapy. In addition to its modern integrated features, Apify SDK is also part of the flourishing Node.js ecosystem, which is at the forefront of innovation in the web scraping space.

It was Daltix’s Lead Engineer on Data Collection (Charlie Orford) who came across the Apify platform and presented it to the team. The pro-Apify argument was simple but powerful: scraping the web using JavaScript - the language the web is built on - seemed to be the natural solution.

On top of that, Apify's modern codebase was easy to follow. More importantly, its straightforward, clean design made its code easily customizable, making it an attractive target to serve as the building block for Daltix's new framework.

Finally, Datlix's team considers the transition to Apify to be an important contributor to their current success and a strong investment in the company's future, which is now carrying on with its plans to expand further into Europe.

Still not sure if you’re not ready to switch from Python to JavaScript just yet?

Join our Discord server to get direct access to our technical support team. We will be happy to answer any questions and help you migrate your existing scrapers to the Apify SDK and platform.

If you want to start experiencing all the powerful features Apify offers head over to Apify SDK for JavaScript docs or try our Python SDK to get started. Alternatively, you can request a custom solution and let our community devs handle all the work for you.