Crawlee for Python tutorial (ultimate beginner’s guide)
This hands-on tutorial walks you through building web scrapers using Crawlee for Python. We'll work with both static and dynamic websites, along with practical applications of this versatile library.
Crawlee is Apify’s open-source web scraping library, available for Node.js and now also for Python.
What sets Crawlee apart is its all-in-one approach to web scraping. Built on top of BeautifulSoup and Playwright, it allows you to easily switch between these libraries, depending on your needs. Whether you’re crawling links, scraping data, or storing it in machine-readable formats, Crawlee simplifies the process by handling the technical details for you.
All the code we will write during this article is available on GitHub if you would like to clone it and test things out yourself. The main branch contains a Zappos.com scraper built with Crawlee’s BeautifulSoupCrawler and the mintmobile-playwright branch contains the MintMobile Scraper built with Crawlee’s PlaywrightCrawler.
Now, we all know the best way to learn is by doing. So, let’s roll up our sleeves and see what Crawlee can do.
Installation and setup
Using the Crawlee CLI with pipx
The fastest way to bootstrap a project with Crawlee for Python is by running the command below:
pipx run crawlee create my-crawler
You’ll need pipx installed on your computer for this. If you don’t have it yet, you can find installation instructions on the Pipx GitHub page. If you prefer using pip, you can refer to the setup instructions in the Crawlee documentation.
Once you run the command, the Crawlee CLI will appear. Just select the library you need, and if you’re following this tutorial, choose BeautifulSoup.
You can open the directory using your code editor of choice. You will see a structure similar to the one below:
Lastly, you’ll need to install poetry to manage the dependencies and run the crawler.
As you explore the files, you’ll notice some boilerplate code already set up to crawl the Crawlee documentation. It’s fully functional, and you can run it with the command poetry run python -m crawlee-python-demo if you’re curious. However, in this article, we’ll focus on replicating a real project scenario by adapting this pre-generated code to scrape our target website, Zappos.com.
Initial configuration
The main.py file serves as the control center for our scraper. Here, you can choose the underlying library (BS4 or Playwright), set the number of requests per crawl, and manage the URLs in the request queue.
For now, let’s keep everything as it is and update the starting URL to our target website, as shown in the code below:
# __main__.py
import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from .routes import router
async def main() -> None:
"""The crawler entry point."""
crawler = BeautifulSoupCrawler(
request_handler=router,
# Remove the max_requests_per_crawl option to crawl to scrape all products,
)
await crawler.run(
[
'https://www.zappos.com/men/OgL9CsABAuICAhgH.zso',
]
)
if __name__ == '__main__':
asyncio.run(main())
But hold on, we’re not done. We haven’t written any scraping logic yet. If you’ve been paying close attention, you might have noticed that the request_handler is pointing to a router, which is defined in another file called routes.py. That’s where all our scraping logic will go. So, let’s get started with that file and build our scraper.
Exploring the target website - Zappos.com
I will assume you’re already familiar with web scraping and have built a few scrapers before. So, you know the drill: we need to explore the page and figure out what steps our scraper should take to extract the data we’re after.
Once our scraper accesses the initial URL, it will encounter this selection of men’s hiking products available on Zappos.com.
Our goal is to crawl all the products in the “Men’s Hiking Products” section and extract their data. To achieve this, we need to complete two main tasks: first, enqueue all the listing pages and then crawl each product on those pages.
One helpful feature of this page is the URL’s page parameter, ?p={n}, along with the total number of pages (in this case, 8) displayed at the bottom. With this information, we can use Crawlee to paginate through the website.
Handling website pagination
Now, let’s go back into the code and head over to the routes.py file to implement this scraping logic using Crawlee.
# routes.py
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawlingContext
from crawlee.basic_crawler.router import Router
from crawlee.models import Request
router = Router[BeautifulSoupCrawlingContext]()
@router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
"""Default request handler."""
# Select all 'a' elements with a title attribute containing the word "Page"
elements = context.soup.select('#searchPagination div:nth-child(2) > span > a[title*="Page"]')
# Select the last 'a' element from the list and get its text
last_page_text = elements[-1].text if elements else None
if last_page_text:
last_page_number = int(last_page_text)
# Enqueue all product links in the current page and label them as 'product'
await context.add_requests(
[
Request.from_url(context.request.loaded_url + f'?p={i}', label='listing')
for i in range(0, last_page_number)
]
)
@router.handler('listing')
async def listing_handler(context: BeautifulSoupCrawlingContext) -> None:
"""Listing request handler."""
print(f'Processing {context.request.loaded_url}')
# Enqueue all product links in the current page and label them as 'product'
await context.enqueue_links(
selector='a[itemprop="url"]', label='product'
)
There’s a lot to break down here, starting with the default_handler. This handler manages the pagination logic of our scraper. We extract the last pagination number text displayed on the website and store it in the last_page_text variable. With this value, we use Crawlee’s add_requests function to queue up requests for each page.
Secondly, you’ll notice that we add a listing label to each request. Labels in Crawlee are useful for targeting these specific URLs in another handler – in this case, the listing_handler – where we define what actions to take on each page.
Finally, note the use of Crawlee’s enqueue_links function in the listing_handler. If no selector is specified, this versatile function can even identify and queue all links on the page. For our purposes, specifying a selector is all we need. Each link that matches the provided selector will be added to the request queue and labeled as product.
Having the product links labeled with a product tag, we can create a product_handler to define the scraping logic for these pages. But before we get into that, let’s first take a look at what the product page looks like and identify the specific information we want to extract.
Our targets are the product’s brand, name, and current price. You’re welcome to expand the scraper later to grab more data from the page, but for our example, these key details will do the trick. So, without further ado, let’s code the product_handler .
# ...rest of the code
@router.handler('product')
async def product_handler(context: BeautifulSoupCrawlingContext) -> None:
"""Product request handler."""
# Extract necessary elements
brand_element = context.soup.select_one('h1 > div > span[itemprop="brand"] > a[itemprop="url"]')
name_element = context.soup.select_one('h1 > div > span[itemprop="brand"] + span')
price_element = context.soup.select_one('span[itemprop="price"]')
# Push the product data to Crawlee's data storage
await context.push_data(
{
'url': context.request.loaded_url,
'brand': brand_element.text if brand_element else None,
'name': name_element.text if name_element else None,
'current_price': price_element.attrs['content'] if price_element else None,
}
)
For those experienced with web scraping, this code should feel straightforward and familiar. However, you might notice the push_data method sending the scraped data somewhere. That “somewhere” is Crawlee’s data storage.
Now that our code is complete, let’s run it to see where our data ends up and move into Crawlee’s storage capabilities. Use the command below to get the crawler running.
poetry run python -m crawlee-python-demo
Storage – datasets
After running the crawler for the first time, you'll notice a new storage directory in your project. This directory contains three key components: datasets, where the scraped results are stored; key_value_stores, which can store a variety of file types; and request_queues, which keeps track of the requests made by the crawler.
In our Zappos.com scraper, we used the push_data method to store our scraped data in the datasets storage. We’ll explore the key_value_stores in the upcoming “Crawlee + Playwright” section of this article.
Export data to CSV
The storage is great, but what if you want to export your data to a single file in a format like CSV or JSON? With Crawlee, that’s not only possible but also easy to do. Just head to the __main__.py file and add this line of code: await crawler.export_data('output.csv').
# __main__.py
import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from .routes import router
async def main() -> None:
"""The crawler entry point."""
crawler = BeautifulSoupCrawler(
request_handler=router,
# Remove the max_requests_per_crawl option to crawl to scrape all products,
)
await crawler.run(
[
'https://www.zappos.com/men/OgL9CsABAuICAhgH.zso',
]
)
# Export data to CSV, JSON, or any other format
await crawler.export_data('output.csv')
if __name__ == '__main__':
asyncio.run(main())
Zappos.com scraper final code (GitHub)
Crawlee + Playwright – MintMobile Scraper
In this section, we'll adapt our existing BeautifulSoup scraper to use Playwright and scrape a dynamic website, MintMobile.
While you could start a new project and select Playwright instead of BeautifulSoup, that approach might not always be practical, especially in real-world scenarios where a website could change from static content to dynamic. Rebuilding a scraper from scratch in such cases would be time-consuming.
Crawlee allows you to change your BeautifulSoup scraper into a Playwright scraper with a few minor changes. Let's walk through this process.
Installation
First, we’ll need to install the Crawlee Playwright Extra package.
pip install 'crawlee[playwright]'
Next, install the Playwright dependencies.
playwright install
And that’s it! Now, let’s review the code changes we need to make.
Switching from BeautifulSoupCrawler to PlaywrightCrawler
Starting with the __main__.py file, we only need to make three changes:
Import PlaywrightCrawler
Modify the main() function to use PlaywrightCrawler along with its options.
Next, we’ll adapt the routes and start building our scraper.
Updating routes.py and handling pagination
First, let’s take a look at the MintMobile web page to understand what we want to accomplish with this scraper. Before getting into the code, we’ll focus on getting a general idea of what we need to scrape.
The screenshot above shows our starting URL for the MintMobile phones website. The pagination logic here is quite similar to what we encountered when scraping Zappos.com. So, our first task is to adapt the existing pagination logic to work on MintMobile and set up the listing_handler to enqueue all the products on each page. Since we’re dealing with dynamically generated content, we’ll also adjust the code to use Playwright instead of BeautifulSoup.
# routes.py
# 1. Modified to import PlaywrightCrawlingContext
from crawlee.playwright_crawler import PlaywrightCrawlingContext
from crawlee.basic_crawler.router import Router
from crawlee.models import Request
# We will use the KeyValueStore in the next code snippet
from crawlee.storages import KeyValueStore
# 2. Adjusted the router to use PlaywrightCrawlingContext
router = Router[PlaywrightCrawlingContext]()
# 3. Updated all the handlers and the syntax to align with Playwright's context
@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
"""Default request handler."""
# Select the last 'a' element from the list and get its text
last_page_element = await context.page.query_selector('a.facetwp-page.last')
last_page = await last_page_element.get_attribute('data-page') if last_page_element else None
if last_page:
last_page_number = int(last_page)
# Enqueue all product links on the current page and label them as 'product'
await context.add_requests(
[
Request.from_url(context.request.loaded_url + f'?_paged={i}', label='listing')
for i in range(1, last_page_number + 1)
]
)
@router.handler('listing')
async def listing_handler(context: PlaywrightCrawlingContext) -> None:
"""Listing request handler."""
print(f'Processing {context.request.loaded_url}')
# Enqueue all product links on the current page and label them as 'product'
await context.enqueue_links(
selector='a.button.product_type_variable', label='product'
)
As you can see, the code logic closely resembles what we previously had for Zappos. Besides updating the CSS selectors to scrape MintMobile, we made three key changes to adapt the code for Playwright:
Imported PlaywrightCrawlingContenxt
Updated the Router to use PlaywrightCrawlingContext
Adjusted all handlers and syntax to align with Playwright’s context.
Next, we’ll finalize the scraping logic by coding the product_handler and exploring another Crawlee storage option, the KeyValueStore.
Scraping logic and saving items to the KeyValueStore
In this section, we’ll write the product_handler to scrape the name, price, and image of each crawled product. Additionally, we’ll use Playwright to capture a screenshot of each product’s page and save it to Crawlee’s KeyValueStore (KVS). Unlike datasets, the KVS can store almost any type of file, including the PNG screenshots we’ll capture with Playwright.
Here is the code:
# ...rest of the code
@router.handler('product')
async def product_handler(context: PlaywrightCrawlingContext) -> None:
"""Product request handler."""
# Extract necessary elements
page_title = await context.page.title()
image_element = await context.page.query_selector('img.wp-post-image')
name_element = await context.page.query_selector('h1.product_title')
price_element = await context.page.query_selector('p.price > span > bdi')
# Open the default key-value store.
kvs = await KeyValueStore.open()
# Capture the screenshot of the page using Playwright's API.
screenshot = await context.page.screenshot()
# Store the screenshot in the key-value store.
await kvs.set_value(
key = page_title,
value = screenshot,
content_type = 'image/png',
)
# Push the product data to Crawlee's data storage
await context.push_data(
{
'url': context.request.loaded_url,
'image': await image_element.get_attribute('src') if image_element else None,
'name': await name_element.text_content() if name_element else None,
'price': await price_element.text_content() if price_element else None,
}
)
In the code above, we first extract the relevant data from the page. Next, we open the default KeyValueStore (KVS), take a screenshot of the page, and save it to the KVS. Finally, we push the scraped data to the dataset. After running the code with the command poetry run python -m crawlee-python-demo, your storage should resemble the one in the picture below but with data for all the products in the store.
MintMobile Scraper Final Code (GitHub)
Proxies
Proxies are essential for modern web scraping, as they help scrapers avoid blocking and enable reliable data extraction at scale. In this section, we’ll explore Crawlee’s proxy configuration, including its advanced tiered proxies.
To demonstrate how to integrate proxies into your crawlers, we’ll build on the MintMobile scraper we just created and add proxies to it. The examples in this section provide a general overview of using proxies with Crawlee. For a more in-depth look at features like IP rotation, session management, and proxy inspection, check out the proxy section of the Crawlee documentation.
Proxy configuration
Setting up proxy configuration in Crawlee is simple and requires only minor changes to your code. The example below demonstrates how to configure Crawlee to use proxies for the PlaywrightCrawler, but the same approach applies to the BeautifulSoupCrawler as well.
# __main__.py
import asyncio
from crawlee.playwright_crawler.playwright_crawler import PlaywrightCrawler
# Import the ProxyConfiguration class from the Crawlee package
from crawlee.proxy_configuration import ProxyConfiguration
from .routes import router
async def main() -> None:
"""Proxy configuration."""
# Add your list of proxy URLs
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy-example-1.com/',
'http://proxy-example-2.com/',
]
)
"""The crawler entry point."""
crawler = PlaywrightCrawler(
browser_type='firefox',
headless=True,
request_handler=router,
# Pass the proxy configuration to the crawler
proxy_configuration=proxy_configuration,
)
await crawler.run(
[
'https://phones.mintmobile.com/',
]
)
await crawler.export_data('output.csv')
if __name__ == '__main__':
asyncio.run(main())
Tiered proxies
Great, now you’ve got the basics of using proxies down. But Crawlee can do even better than that. Building scalable crawlers with HTTP proxies is all about finding that sweet spot between keeping costs low and avoiding blocking. Some websites might let you scrape without a proxy, while others might force you to use affordable options like datacenter proxies or pricier residential proxies in more challenging cases.
Crawlee makes this process easier by allowing you to set up multiple tiers of proxy URLs. It starts with the cheapest option and only switches to higher, more reliable tiers if it runs into blocks. Within each active tier, Crawlee rotates through proxies in a round-robin style, keeping your scraper running smoothly and efficiently.
import asyncio
from crawlee.playwright_crawler.playwright_crawler import PlaywrightCrawler
# Import the ProxyConfiguration class from the Crawlee package
from crawlee.proxy_configuration import ProxyConfiguration
from .routes import router
async def main() -> None:
"""Proxy configuration."""
proxy_configuration = ProxyConfiguration(
tiered_proxy_urls=[
# lower tier, cheaper, preferred as long as they work
['http://cheap-datacenter-proxy-1.com/', 'http://cheap-datacenter-proxy-2.com/'],
# higher tier, more expensive, used as a fallback
['http://expensive-residential-proxy-1.com/', 'http://expensive-residential-proxy-2.com/'],
]
)
"""The crawler entry point."""
crawler = PlaywrightCrawler(
proxy_configuration=proxy_configuration, # Pass the proxy configuration to the crawler
browser_type='firefox',
headless=True,
request_handler=router,
max_requests_per_crawl=5,
)
await crawler.run(
[
'https://phones.mintmobile.com/',
]
)
await crawler.export_data('output.csv')
if __name__ == '__main__':
asyncio.run(main())
As you can see, the configuration remains almost the same. The only change is adding the tiered_proxy_urls option to ProxyConfiguration and listing the proxy URLs from the cheapest to the most expensive, in that order.
Learn more and join Crawlee's community
Congratulations if you’ve made it this far and built both projects included in this tutorial! You should now have a solid grasp of Crawlee’s basics and some of the innovative features it offers for web scraping. However, there’s still much more to explore with Crawlee, and the library is constantly evolving, so you can look forward to even more powerful capabilities in the future.
If you want to help shape Crawlee and connect with other web scraping enthusiasts, join the Crawlee Discord and give the project a star on GitHub to show your support.
Finally, don’t miss this recording of the Crawlee for Python webinar!