Technology is constantly changing and moving forward at breakneck speed. High-performing solutions from just a couple of years ago may already be considered obsolete. Tech companies can often find themselves falling behind their competitors due to technical debt, which, if not addressed quickly, can even drive a company out of business.
Overcoming technical debt is not easy, as it usually involves embracing new technologies and even migrating your current solutions to an entirely different tech stack. These are big decisions that can take time to implement and potentially define the future of your business.
Understandably, much deliberation is involved in taking such a step, mainly when you have doubts about whether the change will bring enough improvements to justify the invested time and resources.
Let's look at a practical, real-life example of a retail data analytics company, Daltix, that decided to take the leap and address its web scraping technical debt by shifting from Scrapy to Apify SDK.
Hitting the structural limits of Scrapy
Daltix is a Belgium/Portugal-based retail data provider extracting data from hundreds of e-commerce websites daily. They employ web scraping to collect online retail data, giving major retailers the ability to analyze near real-time data in order to make strategic decisions for their businesses.
To carry out its scraping activities, Daltix started using a custom-made Python framework built on top of the popular framework, Scrapy. This solution enabled them to scale from 250 thousand to 2 million scraped resources a day. At that point they started hitting structural limits with the framework, severely hindering the company's future growth plans.
On top of that, Daltix began facing more blocking from websites, as advanced anti-scraping methods, such as browser and TLS fingerprinting, were becoming increasingly more common, and the Python web scraping ecosystem could not keep up with these changes.
As a result, they were expending enormous amounts of engineering effort in the ongoing maintenance of their spiders and coming across websites that they simply could not scrape. To address these problems, Daltix, in 2019, decided to switch to Apify's open-source web scraping and automation Node.js library, Apify SDK.
Apify SDK vs. Scrapy
With so much work involved in transitioning from one solution to another, it raises the question, was the transition worth the effort? In short, the migration more than paid off.
Over time, the Daltix team observed a drastic increase in all performance metrics:
- Daltix experienced a sharp decrease in resource requirements needed to run scrapers, saving close to 90% of Amazon EC2 costs and gaining a 60% reduction in the time taken to collect retail data.
- The improvements translated into over 9,000 EC2 hours saved per month.
- With increased scalability, Daltix boosted their scraping from 2 million to >5 million resources per day and expected these numbers to double again in the near future.
- Despite scaling up their scraping activity, they now need 30% less engineering input to manage the process.
- The combination of Node, Apify SDK, and Puppeteer dramatically improved the effectiveness and number of anti-scraping countermeasures that Daltix was able to deploy.
CTO of Daltix
Enhanced scalability with Node.js
The scalability of an application can be defined as its ability to support an increasing number of requests without decreasing its performance. This is an important factor to consider when developing scrapers. You can easily find yourself processing huge amounts of information that can become a major bottleneck for your business if your application is not optimized for handling such a large volume of data.
The two major cornerstones behind Node’s scalability are its event-based model and non-blocking input/output.
Because the Apify SDK is a Node.js library, it enjoys all the benefits described above, making it the ideal choice for high-scale scraping activities.
On the other hand, Python does not support asynchronous programming by default. To work around this limitation, Scrapy is written in Twisted, an event-driven networking framework, which gives it some asynchronous capabilities. However, Daltix's team constantly faced difficulties due to Twisted being a legacy library that is hard to code for and debug. It was only in 2022 that Scrapy gained non-experimental partial support for modern Python aysncio coroutines.
On top of that, Scrapy's GitHub was drowning in issues and open pull requests, which diminished the confidence Daltix had in the health of the project. The wider Python ecosystem itself was also still in the flux of transitioning to version 3 of the language and settling on a modern approach to writing async code.
In contrast, Daltix's team found in Apify SDK a modern, optimized, and easily extendable codebase, as well as the willingness of the SDK team to have an open communication channel between them, offering Daltix's engineers the opportunity of influencing the future of the library.
Bypassing anti-scraping protections
Another major problem Daltix engineers faced while using Scrapy was its limited ability to circumvent the increasingly sophisticated anti-bot measures employed by modern websites.
One reason for this was that Python’s various HTTP request libraries were stagnating and performing poorly against anti-bot protected sites. There were also no promising solutions for well-maintained http/2 aware clients.
By contrast, the Node.js ecosystem was bursting with novel technologies and HTTP clients specialized in sending browser-like requests, such as GOT Scraping, enabling bots to blend in with the website's traffic and avoid blocking.
On top of that, Apify SDK has integrated tools to mitigate anti-scraping methods, such as disabling browser fingerprinting protections used by websites and rotating proxies to prevent IP blocking.
Advanced browser automation tools
In 2019, Daltix engineers attended a conference held by Zyte (formerly Scrapinghub), the company responsible for developing and maintaining Scrapy. To their disappointment, Scrapy's lead developers were still pushing for Splash, a headless automation browser based on the scripting language Lua and also developed by Zyte.
This announcement was badly received by the Daltix team, as both of Python's most popular browser automation tools at the time, Splash and Selenium, were trending towards obsolescence and increasingly ineffective against well-defended targets. At the same time, Google's Node.js browser automation library, Puppeteer, was the clear winner in terms of performance and future prospects, while neither Scrapy nor the broader Python ecosystem had any viable alternatives.
This new tool undoubtedly offers the Python ecosystem a much-needed second wind in the web scraping and browser automation landscapes. Nonetheless, the reality remains unchanged, as the two major browser automation software, Playwright and Puppeteer, keep defining the future of dynamic website scraping primarily in Node.js.
Lead Engineer on Data Collection
What's the future for Apify SDK and Daltix?
At this point, it is evident why the Apify SDK stood out for Daltix as the perfect alternative to Scrapy. In addition to its modern integrated features, Apify SDK is also part of the flourishing Node.js ecosystem, which is at the forefront of innovation in the web scraping space.
On top of that, Apify's modern codebase was easy to follow. More importantly, its straightforward, clean design made its code easily customizable, making it an attractive target to serve as the building block for Daltix's new framework.
Finally, Datlix's team considers the transition to Apify to be an important contributor to their current success and a strong investment in the company's future, which is now carrying on with its plans to expand further into Europe.
Join our Discord server to get direct access to our technical support team. We will be happy to answer any questions and help you migrate your existing scrapers to the Apify SDK and platform.
If you want to start experiencing all the powerful features Apify offers head over to Apify SDK docs and try our Python API Client to get started. Alternatively, you can request a custom solution and let us handle all the work for you.