Web scraping with AWS Lambda vs. Apify: what are the differences?

In the tech world, can David compete with Goliath? Let’s find out how Apify fares as an AWS Lambda alternative for web scraping.

Content

Every AWS alternative has strengths in particular fields that appeal to businesses focused on those areas. One example is Apify, an AWS alternative for a specific area of cloud computing: web scraping and browser automation.

When it comes to cloud computing software, it’s tempting to think of Amazon Web Services as untouchable. After all, AWS offers a broader range of services than any other IaaS provider (over 180 services, according to the AWS pricing page). So why look for AWS alternatives? One simple answer is there is too much on AWS to choose from. As more and more businesses move to the cloud, the sheer number of services and pricing options can be overwhelming. If you’re interested in a particular niche, it’s much easier and more cost-efficient to start with a specialized platform.

To really take advantage of the full scale of Amazon Web Services, a business might have to organize an AWS training course for its IT team or even hire consultants. Gartner, which published its Magic Quadrant for Cloud Infrastructure and Platform Services report in October 2022, not only comments on AWS’s complex pricing structure but also notes:

“AWS’ extensive portfolio of services requires expertise to implement… while it’s easy to get started, optimal use - especially keeping up with new service innovations and best practices, and managing costs - may challenge even highly agile, expert IT organizations, including AWS partners.”

More importantly, every AWS alternative has strengths in particular fields that appeal to businesses focused on those areas. One example is Apify, an AWS alternative for a specific area of cloud computing: web scraping and browser automation.

So let’s find out more about one of the many Amazon Web Services that can be used for web scraping, namely AWS Lambda, and how Apify compares as a web scraping platform.

What is AWS Lambda?

Why use AWS Lambda?

AWS Lambda was spawned in 2014 from Amazon’s primary cloud computing platform, Amazon EC2, which was released in 2006 and remains one of Amazon’s most popular services. While EC2 is an Infrastructure-as-a-Service (IaaS), AWS Lambda is a Function-as-a-Service (FaaS). FaaS is a cloud-computing service that provides mechanisms to deploy, run, and execute code in response to events without the infrastructure typically required for microservices or building your service from the ground up.

Lambda lets you run applications within Lambda’s standard runtime environment and resources. AWS has a list of events that can trigger a Lambda function that operates synchronously or asynchronously, depending on the trigger type. The simplest events trigger functions as API calls via the Amazon API Gateway. Other events can be triggered by code commits, cloud system monitors, CI/CD pipelines, Amazon Kinesis Data Streams, AWS IoT Events, and AWS CloudWatch Events.

When it comes to the programming languages supported, AWS Lambda functions can be run in Java, Go, PowerShell, Node.js, C#, Python, and Ruby with the Amazon Linux operating system.

When to use AWS Lambda vs. Amazon EC2

EC2 provides virtual servers for running applications on AWS infrastructure, called Instances. These Instances run until you manually stop them or schedule them to shut down a task. That makes EC2 suitable for long-running applications, with the caveat that when an application is executed, it takes a long time to start.

AWS Lambda provides a serverless architecture that enables you to run a piece of code in the cloud after an event trigger is activated. Lambda offers a small but scalable and inexpensive function that lets you focus on writing code instead of configuring infrastructure. A Lambda function is always available but not always running. An application starts when it is triggered by an event to which it is linked.

The run time of a Lambda function is limited to 15 minutes, with a maximum memory of 10 GB for a running function. That means if you require long-running applications, you might want to consider AWS Lambda alternatives, such as EC2 or Apify. If you don’t have a high number of regular requests, you might want to consider Lambda as an alternative to EC2.

What is the Apify platform?

What is the Apify platform?

Apify is a serverless computing platform built to serve large-scale, high-performance web scraping and automation needs. It provides ‘Actors’ (serverless microservices), queues, result storages, proxies, scheduling, and integrations accessible through a web interface or API.

Apify allows developers to build and run applications in the cloud without having to manage servers. The Apify platform allocates machine resources on demand, enabling the applications that run on it to scale up. Typically, serverless functions are not designed for long-running tasks, and it isn't easy to move serverless functions around. Apify, however, has solved this problem. Actors - so-called because they perform actions from a script (in the coding sense) - use containers. These containers help maintain environment parity and application consistency during distribution. This combination of containers and the Apify platform, which gives actors direct access to data storage, task creation, scheduling, integrations, and the Apify API, makes Apify an excellent choice for web scraping and automation.

What languages does Apify support?

Most actors on the Apify platform are written in TypeScript or JavaScript, but Apify also makes it extremely easy to run anything you can wrap in a Docker with Python code and libraries such as Puppeteer, Playwright, and Selenium.

AWS vs. Apify: the advantages and disadvantages

AWS vs. Apify: the advantages and disadvantages


Run time

Lambda is intended for more event-driven or request-driven use cases with better support for infrastructure-as-code style deployments. Apify is more suited for long-running jobs due to its unlimited runtime duration.

AWS Lambda functions can be configured to run up to 15 minutes per execution, while Apify provides an infinite runtime for web scraping and browser automation jobs. In this sense, Apify shares EC2’s advantage over Lambda. For web scraping, which requires crawling websites of thousands of pages, you need long-running batch operations that take an input, perform a task, and produce an output. That means Apify has the edge when it comes to large web scraping operations.

Storage, Docker containers, and RAM

Apify provides integrated scraping and automation-specific storages and utilities, like datasets and request queues, and it also offers simpler scheduling for actors. Apify can run actors with up to 32 GB of RAM, while Lambda provides a maximum of 10 GB of RAM.

With Apify’s actors, you can run anything that can run inside a Docker container. With Lambda, you can also run any container, but the containers have to integrate the Lambda runtime API.

If your application requires storing data in a single function invocation, AWS Lambda, with its ephemeral storage, is a great option. If your application needs durable, persistent storage, Apify makes it easier to manage as it provides that storage automatically.

Integrations

Lambda provides better integration with other AWS services, while Apify has better GitHub and other version control system (VCS) integrations. Most Apify actors have a well-defined input and output, making them easy to integrate with other apps using the Apify API. With AWS, you have an unstructured raw API whose interface differs from one app to another.

Web scraping on AWS Lambda vs. the Apify platform

Web scraping on AWS Lambda vs. the Apify Platform
Apify vs. AWS Lambda for web scraping tasks

How to build web scrapers on the Apify platform

There are three ways to do web scraping with Apify:

1. Use Apify’s web scrapers

You can build a scraper using one of Apify’s universal scrapers for extracting data from any website (e.g., Web Scraper, Playwright Scraper) or one of the site-specific scrapers (e.g., Google Maps Scraper, Twitter Scraper), and configure it for your use case with your choice of start URLs, page function, and other settings.

2. Crawlee and the Apify SDK

Alternatively, you can use Apify’s web scraping and browser automation library, Crawlee, to build your scraper and use Apify’s tool kit, the Apify SDK, to turn it into an actor on the Apify platform. Crawlee provides you with the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats. With its rich configuration options, you can tweak almost any aspect of Crawlee to suit your needs.

Crawlee · Build reliable crawlers. Fast. | Crawlee

3. Use your own solution

A third option is to migrate your existing solutions to the Apify platform using the Apify SDK and even make your code available for other users on Apify Store. This way, you can use the Apify platform to deliver your web scraping or automation solutions without having to think about infrastructure.

Conclusion

Bigger doesn’t necessarily mean better. Going with the industry standard (AWS) isn’t always the best call if you have specific requirements, such as long run times, suitable integrations, data storage, and scheduling for web scraping and automation purposes. The Apify platform and its actors provide notable capabilities that make Apify a real contender for your web scraping and browser automation projects.

Theo Vasilis
Theo Vasilis
Writer, Python dabbler, and crafter of web scraping tutorials. Loves to inform, inspire, and illuminate. Interested in human and machine learning alike.

Get started now

Step up your web scraping and automation