Python cache: a complete guide

Everything you need to know about caching in Python.

Content

Hi! We're Apify, a full-stack web scraping and browser automation platform. Our SDK for Python is a great toolkit to help simplify the process of making scrapers to collect web data. This tutorial aims to give you a solid understanding of Python caching for storing that data.

What is Python caching?

Caching is an optimization technique that stores frequently accessed data in a temporary location called a cache for faster retrieval. This significantly improves application performance and reduces the load on primary data sources.

Python is a programming language that offers several approaches to implement caching:

  1. Data structures: Python's built-in dictionaries can be effectively used as simple caches.
  2. Decorators: The @lru_cache decorator from the functools module provides a convenient way to cache function return values using the Least Recently Used (LRU) eviction policy.
  3. External caching services: Python applications can integrate with external caching services like Memcached and Redis for more advanced caching features and capabilities.

Hardware vs. software

Caching occurs at both hardware and software levels. Hardware caching uses specialized components like the CPU to store data temporarily. The most common example is CPU caching, having multiple levels (L1, L2, L3, or L4) of increasing size but decreasing speed.

Python implements caching at the software level using its internal methods. This typically involves storing data in RAM, often using data structures like dictionaries or libraries like functools.lru_cache as it doesn't directly use dedicated caches like L1 or L2.

Cache hits and misses

When data needs to be accessed, the application first searches for it in the cache. If the data is found there, it is directly returned without executing further instructions. This is known as a cache hit.

If the data is not present in the cache (a cache miss), it is loaded from the underlying storage and then copied into the cache for future use. This ensures the cache stays updated with frequently accessed data.

In other words, if the data is not found in the first level of cache (L1), the system will search for it in the next level (L2), and so on. The more cache levels a system has to check, the longer it takes to complete a request. This can lead to an increased cache miss rate, especially if the system needs to look into the main database to fetch the requested data.

Cache expiration and eviction

Caches have limited capacity, and when this limit is reached, we need strategies to decide which data to keep and which to discard. This is where cache expiration and eviction policies come in. Cache expiration removes an entry from the cache after a defined lifetime or if it is not accessed after a certain period, preventing stale data.

On the other hand, cache eviction focuses on removing specific items to make room for new or more relevant data, typically removing the data that is least likely to be reaccessed soon. Some example policies include LRULFUFIFO, and Random.

Python cache

Implementing Python cache

There are multiple ways to implement caching in Python, such as using a dictionary, creating a manual decorator, using methods from the built-in functools module, using third-party modules like cachetools or integrating with external services like Redis or Memcached. Let's explore these methods and their proper usage.

Imagine you're building a newsreader app that gathers news from various sources. When you launch the app, it downloads the news content from the server and displays it.

Now, consider what happens if you navigate between a couple of articles. The app will fetch the content from the server again each time unless you cache it. This constant retrieval puts unnecessary pressure on the server hosting the news.

A smarter approach is to store the content locally after downloading each article. Then, the next time you open the same article, the app can retrieve it from the local copy instead of contacting the server again.

There are several other examples where users must make multiple identical requests to the server. However, by effectively using Python caching, they can optimize application performance and memory usage.

Python caching using a dictionary

You can create a cache using a Python dictionary because reading data from a dictionary is fast (O(1) time).

In the newsreader example, you can check whether the content is already in your cache instead of fetching data directly from the server every time. If it is, you can retrieve it from the cache instead of returning it to the server. You can use the article's URL as the key and its content as the value in the cache.

Here's an example of how this caching technique might be implemented. We've defined a global dictionary to serve as the cache, along with two functions:

  1. get_content_from_server: This function fetches data from the server only if it's not already in the cache.
  2. get_content: This function first checks for the data in the cache. If it's found, the cached data is retrieved and returned. If not, the function calls get_content_from_server to fetch the data from the server. The fetched data is then stored in the cache for future use and returned.
import requests
import time

# A simple cache dictionary to store fetched content for reuse
cache = {}

# Function to fetch content from the server using the provided URL
def get_content_from_server(url):
    print("Fetching content from server...")
    response = requests.get(url)
    return response.text

# Function to get content, checking the cache first before fetching from the server
def get_content(url):
    print("Getting content...")
    if url not in cache:
        # If not in the cache, fetch content from the server and store it in the cache
        cache[url] = get_content_from_server(url)
    return cache[url]

# Main block of code
if __name__ == "__main__":
    # First Pass
    print("1st Pass:")
    start = time.time()
    get_content("<https://books.toscrape.com/>")
    print("Total time: ", time.time() - start)

    # Second Pass
    print("\\\\n2nd Pass:")
    start = time.time()
    get_content("<https://books.toscrape.com/>")
    print("Total time: ", time.time() - start)

And here’s the output:

Python caching using a dictionary

Notice that "Getting content..." is printed twice, while "Fetching content from server..." is printed only once. You'll also observe a significant difference in time. This occurs because, after initially accessing the article, its URL and content are stored in the cache dictionary, which naturally takes time. The second time, since the item is already in the cache, the code doesn't need to fetch it from the server again, resulting in a much faster retrieval.

Python caching using a manual decorator

In Python, a decorator is a function that takes another function as an argument and returns a modified function. This allows you to alter the behavior of the original function without directly changing its source code.

One common use case for decorators is implementing caching in recursive functions. Recursive functions often call themselves with the same input values, leading to redundant calculations.

Let's begin by creating a function that takes a URL as input, sends a request to that URL, and subsequently returns the response text.

import requests

def fetch_html_data(url):
    response = requests.get(url)
    return response.text

Let's memoize this function using Python decorator.

def memoize(func):
    cache = {}
    def inner_cached(*args):
        if args in cache:
            return cache[args]
        else:
            result = func(*args)
            cache[args] = result
            return result

    return inner_cached

@memoize
def fetch_html_data_cached(url):
    response = requests.get(url)
    return response.text

We define a memoize decorator that creates a cache dictionary to store the results of previous function calls. By adding @memoize above the fetch_html_data_cached function, we ensure that it makes only a single network request for each distinct URL and then stores the response in the cache for subsequent requests.

Here's how the inner_cached function, which is part of the memoize decorator, works:

  1. It first determines whether the current input arguments have already been cached.
  2. If a cached result exists for the given arguments, it is immediately returned, avoiding redundant network calls.
  3. If no cached result is found, the code calls the original fetch_html_data_cached function to fetch the data from the network.
  4. The retrieved response is then stored in the cache before being returned.
import time
import requests

# Function to retrieve HTML content from a given URL
def fetch_html_data(url):
    # Send a GET request to the specified URL and return the response text
    response = requests.get(url)
    return response.text

# Memoization decorator to cache function results
def memoize(func):
    cache = {}

    # Inner function to store and retrieve data from the cache
    def inner_cached(*args):
        if args in cache:
            return cache[args]
        else:
            result = func(*args)
            cache[args] = result
            return result

    return inner_cached

# Memoized function to retrieve HTML content with caching
@memoize
def fetch_html_data_cached(url):
    # Send a GET request to the specified URL and return the response text
    response = requests.get(url)
    return response.text

# Make 10 requests using the normal function
start_time = time.time()
for _ in range(10):
    fetch_html_data("<https://books.toscrape.com/>")
print("Time taken (normal function):", time.time() - start_time)

# Make 10 requests using the memoized function
start_time = time.time()
for _ in range(10):
    fetch_html_data_cached("<https://books.toscrape.com>")
print("Time taken (memoized function):", time.time() - start_time)

Here’s the output:

Python caching manual decorator output

We're making 10 requests to both functions, and the time difference increases significantly. What happens if we make hundreds of requests? That would create a huge time difference. This is how Python caching saves us.

Caching considerations

The caching process offers speed but has a finite size. We cannot cache data indefinitely. For example, in the library analogy, the number of books we can place on our table is limited. Similarly, when caching data, we must consider the memory footprint (the amount of space it occupies in memory).

Without memory management, an application would keep storing new items in memory, leading to:

  • Cache growth: The cache's memory footprint would expand, potentially consuming excessive memory resources.
  • Performance degradation: Memory pressure can slow down the application or even cause it to crash.
  • Scalability challenges: Large caches can hinder the efficient scaling of the application.

Before moving to caching strategies, have a look at a few timing considerations:

  • Access time: Results for previously computed arguments should be retrieved quickly, ideally in O(1) time.
  • Insertion time: New data should be inserted into the cache, preferably in O(1) time (depending on implementation, some may take O(n) time, so choose wisely).
  • Deletion time: When the cache reaches its capacity, data must be removed according to the chosen caching strategy.
1. Least Recently Used (LRU)

The Least Recently Used (LRU) cache works based on the principle that the data most recently accessed is likely to be reaccessed soon. To ensure that the most relevant data remains available, the LRU cache evicts the least recently accessed items first. It offers a good balance between performance and memory usage, making it suitable for a wide variety of tasks.

2. Most Recently Used (MRU)

The MRU policy replaces the cache element that was used most recently. This is based on the assumption that the least recently used entries are the most likely to be reused. MRU can be considered the opposite of the LRU.

3. Least Frequently Used (LFU)

The LFU policy removes the item from the cache that has been used the least since its initial entry. This is based on the principle that entries with a high number of hits are more likely to be reused in the future.

4. First-In/First-Out (FIFO)

When storage capacity reaches its limit, the oldest data will be discarded first to make room for newer entries, which are more likely to be reused.

5. Last-In/First-Out (LIFO)

Like a stack data structure, it evicts the most recent entries. This type is ideal when access to the oldest data is more important.

💡
KEY TAKEAWAYS

➡️ Choose strategies that align with your application's data access patterns. Give priority to frequently accessed, static data.

➡️ Use efficient data structures like hash tables for fast access and insertion. Avoid O(n) operations that slow down as data size increases.

➡️ In addition, ensure that cached data is invalidated when the source changes so that data integrity is maintained.

The functools library in Python provides a variety of decorators for caching, one of the most common being the @lru_cache decorator. Similar to other caching solutions, @lru_cache uses a dictionary to store function calls and their corresponding results. This enables efficient retrieval of results when a function is called with the same arguments again.

However, managing cache size is crucial to avoid memory issues. The @lru_cache decorator offers a maxsize attribute that controls the maximum number of cached entries. By default, maxsize is set to 128. If you set maxsize to None, the cache will grow indefinitely, potentially leading to memory problems when storing a large number of distinct calls.

To use the @lru_cache decorator, you can create a new function for extracting HTML content and place the decorator name above the function definition. Ensure that you import the functools module before using the decorator.

from functools import lru_cache

@lru_cache(maxsize=16)
def fetch_html_content_lru(url):
    response = requests.get(url)
    return response.text

In the above example, the fetch_html_content_lru method is memoized using the @lru_cache decorator. The cache is limited to 16 items in this case. Whenever a new call comes in, the decorator's implementation will remove the least recently used entry from the cache so that the new item can be added.

Here’s the complete code:

from functools import lru_cache
import time
import requests

# Function to fetch the HTML content from a given URL
def fetch_html_content(url):
    response = requests.get(url)
    return response.text

# Memoized version using LRU Cache
@lru_cache(maxsize=16)
def fetch_html_content_lru(url):
    response = requests.get(url)
    return response.text

# Measure the time taken for the original function
start_time = time.time()
fetch_html_content("<https://books.toscrape.com/>")
print("Time taken (original function):", time.time() - start_time)

# Measure the time taken for the memoized function (LRU cache)
start_time = time.time()
fetch_html_content_lru("<https://books.toscrape.com>")
print("Time taken (memoized function with LRU cache):", time.time() - start_time)

Here’s the output:

Cache decorator

The time difference for the above code is not significant. However, if we repeatedly make a larger number of requests, you will see a significant increase in the time taken. To illustrate this, let's calculate fibonacci(40).

from functools import lru_cache
import time

def calculate_fibonacci(n: int) -> int:
    if n == 0 or n == 1:
        return 1
    return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)

@lru_cache(maxsize=16)
def calculate_fibonacci_lru(n: int) -> int:
    if n == 0 or n == 1:
        return 1
    return calculate_fibonacci_lru(n-1) + calculate_fibonacci_lru(n-2)

start_time = time.time()
calculate_fibonacci(40)
print("Time taken (Without LRU Cache):", time.time() - start_time)

start_time = time.time()
calculate_fibonacci_lru(40)
print("\\\\nTime taken (With LRU Cache):", time.time() - start_time)
print("Cache  info: ", calculate_fibonacci_lru.cache_info())

Here’s the result:

Terminal cache info time taken

You can see the significant difference in time. Now let's break down cache_info():

  • hits=38 is the number of calls that @lru_cache returned directly from memory because they were already cached.
  • misses=41 is the number of calls that weren't in the cache and had to be computed.
  • maxsize=16 is the maximum size of the cache, as defined by the maxsize attribute of the decorator.
  • currsize=16 is the current size of the cache, indicating that it's full.

There is another important decorator provided by the functools module called @cached_property. This decorator has been available in Django for many years, but it was only added to the Python language itself in version 3.8, released in October 2019.

Similar to @lru_cache@cached_property caches the results of expensive function calls. However, it differs in that it can only be applied to methods, which are functions that belong to an object. Additionally, it can only be used on methods that have no parameters other than self.

This decorator is particularly useful for computationally expensive properties, as it saves the cost of recomputing the property's value on each access.

Here's an example demonstrating the usage of @cached_property:

from functools import cached_property

class Circle:
    def __init__(self, radius):
        self.radius = radius

    @cached_property
    def area(self):
        print("Calculating area...")
        return 3.14159 * self.radius**2

circle = Circle(5)
print(circle.area)
print(circle.area)
print(circle.area)
print(circle.area)

Here’s the output:

Caching terminal cached property

In the provided example, the area property is decorated with cached_property. This means that the first time circle.area is accessed, the calculation of the area is performed, and the result is stored in a cache (as shown by the message "Calculating area..."). Subsequent accesses to circle.area retrieves the cached value, and that's why there is no message printed again.

Let's compare Python dictionaries and @lru_cache for different use cases. First, take a look at the image below, which shows the number of function calls made for calculating Fibonacci(40) using Python dictionaries and @lru_cache.

cache function calls

As you can see, when using a Python dictionary, a total of 83 calls were made. In contrast, using the @lru_cachedecorator, only 45 calls were needed. Almost half!

In general, if your primary goal is to cache function results for performance optimization, @lru_cache is often the preferred choice due to its simplicity and efficiency. However, if you need greater control over data storage or have very large caches, dictionaries are a more suitable option.

FeaturePython Dictionaries@lru_cache
PurposeGeneral data storage and retrievalCaching function results
ControlYou manage everything (Explicit)Decorator takes care
PersistenceData stays until removedEviction based on least recent use
Size limitNo limitLimited by maximum cache size (maxsize)

Here’s the code:

import cProfile
from functools import lru_cache

cache = {}
def fibonacci_cache_dict(n):
    if n == 0:
        return 0
    if n == 1:
        return 1
    if n in cache:
        return cache[n]
    cache[n] = fibonacci_cache_dict(n - 1) + fibonacci_cache_dict(n - 2)
    return cache[n]

@lru_cache
def fibonacci_cache_lru(n):
    if n == 0 or n == 1:
        return 1
    return fibonacci_cache_lru(n - 1) + fibonacci_cache_lru(n - 2)

def main():
    # fibonacci_cache_dict(40)
    fibonacci_cache_lru(40)

if __name__ == "__main__":
    cProfile.run("main()")

Python caching using third-party library cachetools

Cachetools is a Python module that provides various memoizing collections and decorators, including variants of the Python Standard Library's @lru_cache function decorator, such as LFUCache, TTLCache, and RRCache.

To use the cachetools module, first install it using pip:

$ pip install cachetools

Cachetools provides five main functions:

  1. cached
  2. LRUCache
  3. TTLCache
  4. LFUCache
  5. RRCache
1. Cached

In Python, the cached function serves as a decorator. It accepts a cache object as an argument, and this object manages the storage and retrieval of cached data.

from cachetools import cached, LRUCache
import time
import requests

# API endpoint for fetching TODOs
endpoint = "<https://jsonplaceholder.typicode.com/todos/>"

@cached(cache={}))
def fetch_data(id):
    start_time = time.time()

    # Requesting different IDs
    response = requests.get(endpoint + f"{id}")
    print("\\\\nFetching data, time taken: ", time.time() - start_time)

    # Checking if the response status code is 200 (OK) before extracting the title
    return response.json().get("title") if response.status_code == 200 else None

# Example calls to fetch_data with different IDs
print(fetch_data(1))  # Fetching and caching data for ID 1
print(fetch_data(2))  # Fetching and caching data for ID 2
print(fetch_data(3))  # Fetching and caching data for ID 3

# Repeating some calls to demonstrate caching behavior
print(fetch_data(1))  # Retrieving cached data for ID 1
print(fetch_data(2))  # Retrieving cached data for ID 2
print(fetch_data(3))  # Retrieving cached data for ID 3

Here, the cache={} argument initializes the cache as an empty dictionary to store subsequent results.

data fetched from the cache

Have you observed something in the above image? You're right—in point 4, all the data is fetched from the cache. For IDs 1, 2, and 3, fetch_data functions execute and print the message "fetching data, time taken." However, when we try to retrieve the data for the same IDs again, it's fetched from the cache this time, so no message is printed.

2. LRUCache

The LRUCache decorator is used within the cached decorator. It accepts a parameter named "maxsize" and prioritizes the removal of the least recently used items to create space when needed.

from cachetools import cached, LRUCache
import time
import requests

# API endpoint for fetching TODOs
endpoint = "<https://jsonplaceholder.typicode.com/todos/>"

# Using the LRUCache decorator function with a maximum cache size of 3
@cached(cache=LRUCache(maxsize=3))
def fetch_data(id):
    start_time = time.time()

    # Requesting different IDs
    response = requests.get(endpoint + f"{id}")
    print(f"\\\\nFetching data for ID {id}, time taken: ", time.time() - start_time)

    # Checking if the response status code is 200 (OK) before extracting the title
    return response.json().get("title") if response.status_code == 200 else None

# Example calls to fetch_data with different IDs
print(fetch_data(1))  # Fetching and caching data for ID 1
print(fetch_data(2))  # Fetching and caching data for ID 2
print(fetch_data(3))  # Fetching and caching data for ID 3

# Repeating some calls to demonstrate caching behavior
print(fetch_data(1))  # Retrieving cached data for ID 1
print(fetch_data(2))  # Retrieving cached data for ID 2
print(fetch_data(4))  # Fetching and caching data for a new ID (4)

print(fetch_data(3))  # Fetching and caching data for ID 3

Here’s the output:

LRUCache decorator fetching data
3. TTLCache

TTL, or “Time to Live”, takes two parameters: “maxsize” and “TTL”. The use of “maxsize” is the same as LRUCache, and the TTL states how long the cache for the function will be stored in the system's cache memory.

from cachetools import cached, TTLCache
import time
import requests

endpoint = "<https://jsonplaceholder.typicode.com/todos/>"

@cached(cache=TTLCache(maxsize=3, ttl=20))
def fetch_data(id):
    start_time = time.time()

    response = requests.get(endpoint + f"{id}")
    time.sleep(id)
    print(f"\\\\nFetching data for ID {id}, time taken: ", time.time() - start_time)

    return response.json().get("title") if response.status_code == 200 else None

print(fetch_data(1))
print(fetch_data(2))
print(fetch_data(3))

print("\\\\nI'm waiting...")
time.sleep(18)

print(fetch_data(1))
print(fetch_data(2))
print(fetch_data(3))

Here’s the output:

TTLCache fetching data

In the code above, the TTL (Time to Live) is set to 20 seconds, which means data will remain in the cache for 20 seconds before being cleared.

For IDs 1, 2, and 3, the execution initially involves a 6-second wait (1+2+3 seconds). Following this execution, there's an additional 18-second wait, totaling 24 seconds (18+6). Since 24 seconds exceeds the TTL of 20 seconds, the cache entries for IDs 1, 2, and 3 are cleared.

Consequently, when subsequent calls are made for IDs 1, 2, and 3, the data will be fetched from the server again, as it's no longer available in the cache.

4. LFUCache

LFU (Least Frequently Used) cache is a caching technique that tracks how often items are accessed. It discards the least frequently used items to make space when necessary. LFU cache takes one parameter: maxsize, which specifies the maximum number of items it can hold.

from cachetools import cached, LFUCache
import time
import requests

# API endpoint for fetching TODOs
endpoint = <https://jsonplaceholder.typicode.com/todos/>"

# Using the LFUCache decorator function with a maximum cache size of 3
@cached(cache=LFUCache(maxsize=3))
def fetch_data(id):
    start_time = time.time()

    # Requesting different IDs
    response = requests.get(endpoint + f"{id}")
    print("\\\\nFetching data, time taken: ", time.time() - start_time)

    # Checking if the response status code is 200 (OK) before extracting the title
    return response.json().get("title") if response.status_code == 200 else None

# Example calls to fetch_data with different IDs
print(fetch_data(1))  # Fetching and caching data for ID 1
print(fetch_data(2))  # Fetching and caching data for ID 2
print(fetch_data(3))  # Fetching and caching data for ID 3

# Repeating some calls to demonstrate caching behavior
print(fetch_data(1))  # Retrieving cached data for ID 1
print(fetch_data(3))  # Retrieving cached data for ID 3
print(fetch_data(4))  # Fetching and caching data for a new ID (4)

print(fetch_data(2))  # Fetching and caching data for ID 2

Here’s the output:

LFUCache fetching data
5. RRCache

This class randomly selects items and discards them to make space when necessary. By default, it uses the random.choice() function to select items from the list of cache keys. The class accepts one parameter, maxsize, which specifies the maximum cache size, similar to the LRUCache class.

from cachetools import cached, RRCache
import time
import requests

# API endpoint for fetching TODOs
endpoint = "<https://jsonplaceholder.typicode.com/todos/>"

# Using the RRCache decorator function with a maximum cache size of 3
@cached(cache=RRCache(maxsize=2))
def fetch_data(id):
    start_time = time.time()

    # Requesting different IDs
    response = requests.get(endpoint + f"{id}")
    print(f"\\\\nFetching data for ID {id}, time taken: ", time.time() - start_time)

    # Checking if the response status code is 200 (OK) before extracting the title
    return response.json().get("title") if response.status_code == 200 else None

print(fetch_data(3))
print(fetch_data(2))
print(fetch_data(3))
print(fetch_data(1))
print(fetch_data(3))
print(fetch_data(2))

Here’s the output:

RRCache fetching data

Python caching using Redis

Multiple external in-memory cache services like Memcached, and Redis can be integrated with Python to cache the data. These external caches are extremely powerful and offer a wide variety of features. These external caches take care of all the complications of creating and maintaining the cache.

Redis is an in-memory data store that can be used as a caching engine. Since it keeps data in RAM, Redis delivers it very quickly. Memcached is another popular in-memory caching system. Both are significantly faster than traditional databases and in-memory caching libraries.

They are designed for high-speed data access and efficient handling of large request volumes, making them suitable for improving the speed and scalability of web applications. Many people agree that Redis is superior to Memcached in most circumstances.

$ pip install redis

In this section, we'll explore how to implement caching with Python and Redis. To interact with Redis from Python, you'll first need to install redis-py, the Python client library that provides a user-friendly API for communicating with Redis servers. redis-py is a well-established and robust library that enables you to interact directly with Redis through Python function calls.

The two fundamental commands for interacting with Redis are SET (or SETEX) and GET:

  • SET is used to assign a value to a key. You can optionally specify an expiration time using the EX argument.
  • GET is used to retrieve the value associated with a given key.
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

Here’s the code:

# Import necessary libraries
import requests
import redis  # for using Redis as a cache
import time

# Initialize a Redis client connecting to localhost and default port 6379
redis_client = redis.StrictRedis(host="localhost", port=6379, decode_responses=True)

# Function to get data from an API with caching in Redis
def get_api_data(api_url):
    # Check if data is already cached in Redis
    cached_data = redis_client.get(api_url)

    if cached_data:
        print(f"Data for {api_url} found in cache!")
        return cached_data
    else:
        try:
            # Request the API
            response = requests.get(api_url)

            # Check for any HTTP errors in the response
            response.raise_for_status()

            # Cache the API response in Redis with a timeout of 10 seconds
            redis_client.setex(api_url, 10, response.text)
            print(f"Data for {api_url} fetched from API and cached!")
            return response.text

        # Handle request exceptions, if any
        except requests.RequestException as e:
            print(f"Error fetching data from {api_url}: {e}")
            return None

# List of API endpoints to be fetched
api_endpoints = [
    "<https://jsonplaceholder.typicode.com/todos/1>",
    "<https://jsonplaceholder.typicode.com/posts/2>",
    "<https://jsonplaceholder.typicode.com/users/3>",
    "<https://jsonplaceholder.typicode.com/todos/1>",
    "<https://jsonplaceholder.typicode.com/posts/2>",
    "<https://jsonplaceholder.typicode.com/users/3>",
]

# Loop through the API endpoints, fetching and caching data with a 2-second delay
for endpoint in api_endpoints:
    time.sleep(2)
    data = get_api_data(endpoint)

In the code above, we have a couple of URLs from which to fetch data. We'll make a call for each URL with a sleep time of 2 seconds between calls. The function get_api_data first checks for the data of the URL in the Redis cache using the GET command. If it's there, it pulls it from the cache. If it's not, it fetches the data from the server and saves it to the Redis cache using the SETEX command with an expiration time of 10 seconds.

Here’s the output:

Python caching using Redis

The above output shows that the data is successfully fetched and saved to the cache. Subsequent attempts to access data for the same URL result in retrieval from the Redis cache, as expected.

Now, let's consider what happens if we change the expiration time to 5 seconds and rerun the code. In this case, each time a request is made, the data will be fetched from the server instead of the cache. This occurs because:

  1. When a URL is accessed for the first time, the data is retrieved from the server and stored in the cache with an expiration time of 5 seconds.
  2. After 5 seconds, the cached data expires and is automatically cleared from the cache.
  3. If another request is made for the same URL, the cache is empty, so the data will be fetched again from the server.

Impact of caching on application performance and memory usage

Caching plays a crucial role in optimizing application performance and memory usage. It works by storing frequently accessed data in a faster location, such as RAM. This means when the application needs that data again, it can retrieve it from the cache in milliseconds, significantly faster than retrieving it from the main source in seconds or even minutes.

By eliminating the need for frequent trips back to the source (database, server, etc.) for every request, caching reduces the workload on the backend systems, resulting in faster loading times, smoother responsiveness, and a better overall user experience. Additionally, the application's overall memory footprint decreases.

When calculating the 40th Fibonacci number without caching would involve redundantly performing the same tasks repeatedly. However, by caching the data, you can significantly reduce the computation time.

Below, we've defined two functions: fibonacci() and fibonacci_cache(). The first function does not use caching, while the second one does. The fibonacci() function works fine, but it's computationally expensive because it recalculates values of n multiple times.

On the other hand, fibonacci_cache() stores the results of calculations using the LRU eviction policy, preventing redundant computations in the recursive process.

import time
from functools import lru_cache

def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

@lru_cache(maxsize=16)
def fibonacci_cache(n):
    if n < 2:
        return n
    return fibonacci_cache(n - 1) + fibonacci_cache(n - 2)

if __name__ == "__main__":
    print("Time taken, without cache: ")
    start = time.time()
    fibonacci(40)
    print(time.time() - start)

    print("\\\\nTime taken, with cache: ")
    start = time.time()
    fibonacci_cache(40)
    print(time.time() - start)

Let's take a look at the visuals to understand how the fibonacci() function computes the results. Notice the high number of redundant calculations (highlighted in red) needed to calculate Fibonacci(6). These repeated calls significantly increase the computation time.

Fibonacci Number Recursive Implementation
Source: Viblo

To optimize performance, we're using the @lru_cache decorator from the functools module. This decorator implements a cache based on the Least Recently Used (LRU) strategy, automatically discarding data that hasn't been accessed recently. From this point, any previously calculated data will be efficiently retrieved from the cache rather than being recalculated.

Fibonacci Number Recursive Implementation (Memoized)
Source: Viblo

Here's the output for Fibonacci(40), showcasing the significant difference in execution time with and without caching:

Fibonacci caching execution time

Rules and best practices for caching

Caching is a technique that makes the trade-off between space and time. This trade-off between space (using extra memory for a cache) and time (faster data retrieval) can boost performance significantly when used effectively. When it comes to improving the performance of an application, Python caching might not be the only solution.

When implementing caching in an application, developers must identify functions where caching can be useful. Search for functions that receive the same inputs repeatedly, have long execution times, or access data frequently and rarely change it. If functions are rarely called, perform very quickly, or data changes frequently, caching should not be implemented.

Following these guidelines can help you implement caching effectively.

1. Identify performance bottlenecks

Within an application, certain functions naturally take longer to execute than others. This can be due to heavy calculations, processing, or repeated retrieval of the same data from slow sources such as databases or external APIs. These functions are considered performance bottlenecks and can significantly impact overall application speed.

To optimize performance through caching, it's essential to identify these bottlenecks first. Developers can pinpoint bottleneck functions in two primary ways:

  1. Manually examine the code to evaluate the execution time of individual lines within each function.
  2. Use Python profilers to measure the execution time of individual functions automatically.
import requests

def fetch_content(url):
    response = requests.get(url)
    return response.text

if __name__ == "__main__":
    fetch_content("<https://books.toscrape.com/>")
    fetch_content("<https://books.toscrape.com/>")
    fetch_content("<https://books.toscrape.com/>")

However, if the data being fetched is updated frequently, caching becomes ineffective as it may result in providing outdated information to the application. Therefore, it's crucial to determine when to invalidate the cache and retrieve fresh data.

2. Ensure caching provides a significant performance boost

Ensure that retrieving data from the newly introduced cache is faster than executing the original function directly. In this example, the fetch_content() function has a cache that stores the URL as the key and the content as the value.

import requests

cache = {}

def fetch_content(url):
    if url not in cache:
        cache[url] = requests.get(url).text
    return cache[url]

if __name__ == "__main__":
    fetch_content("<https://books.toscrape.com/>")
    fetch_content("<https://books.toscrape.com/>")
    fetch_content("<https://books.toscrape.com/>")

If the time required to check the cache and retrieve content is similar to the time needed to make a direct request, implementing caching may not yield significant performance benefits. As a developer, it's crucial to ensure that caching measurably improves application performance.

3. Manage the memory footprint of the application

Developers should carefully manage the amount of main memory used by their applications. This is known as the application's memory footprint.

Python caches data in the main memory. Using a larger cache or storing unnecessary information can increase the memory footprint. Developers should be mindful of the cache's data structures and store only essential information.

Consider the following example:

Suppose there is a college alumni portal that displays information about its graduates. Each alumnus has various details stored in the database, such as ID, name, email address, physical address, and phone number.

College alumni portal graduate information

However, the dashboard displays only the names of the alumni. Therefore, when building the application, it's more efficient to cache only the alumni names rather than their entire information.

First, we'll store only the names in the cache and calculate the total size. We'll then store the complete alumni information in a separate data structure, named cache_optimized. Finally, we'll use a total_size function to calculate the size of each cache to compare their memory footprints.

from sys import getsizeof
from itertools import chain
import requests

cache = {}
cache_optimized = {}

url = "<https://jsonplaceholder.typicode.com/users/>"

# Store only the name of the Alumni
def get_content(id):
    if id not in cache:
        cache[id] = requests.get(f"{url}+{id}").json()
    return cache[id]

# Store all the information about the Alumni
def get_content_optimized(id):
    if id not in cache_optimized:
        cache_optimized[id] = requests.get(f"{url}+{id}").json().get("name")
    return cache_optimized[id]

# ...
# Source of Code: <https://code.activestate.com/recipes/577504/>
def total_size(o, handlers={}):
    def dict_handler(d):
        return chain.from_iterable(d.items())

    all_handlers = {dict: dict_handler}
    all_handlers.update(handlers)
    seen = set()
    default_size = getsizeof(0)

    def sizeof(o):
        if id(o) in seen:
            return 0
        seen.add(id(o))
        s = getsizeof(o, default_size)

        for typ, handler in all_handlers.items():
            if isinstance(o, typ):
                s += sum(map(sizeof, handler(o)))
                break
        return s

    return sizeof(o)
# ...

if __name__ == "__main__":
    get_content(1)
    get_content_optimized(1)
    print("Size of Cache: " + str(total_size(cache)) + " Bytes")
    print("Size of Optimized Cache: " + str(total_size(cache_optimized)) + " Bytes")

Here’s the output:

Cached information memory

As we see, unnecessary cached information increases memory footprint.

Python caching in the real world

We've explored various approaches to implementing Python caching. Let's now code a simple real-world application that uses this concept.

In this application, users will only need to enter their ID to retrieve all of their associated information. The application will fetch this information from the URL (https://jsonplaceholder.typicode.com/users/), which contains data for multiple users.

typicode.com user data

To prevent redundant server calls and potential performance degradation, we'll implement caching and store user data in a JSON file called users.json.

The application will follow these steps:

  1. On receiving a user request, the application first searches for the requested information within the users.json cache file.
  2. If the data is found in the cache and not expired, it's retrieved and displayed to the user immediately. This eliminates the need to make an external server call.
  3. If the information isn't present in the cache, the application fetches all of the user's information from the specified URL.
  4. The retrieved information is then saved to the users.json cache file with the expiration time for future use and returned to the user.
import json
import requests
from datetime import datetime, timedelta

class UserData:
    def __init__(self, data_url: str, cache_filename: str, ttl_seconds: int):
        self.data_url = data_url
        self.cache_filename = cache_filename
        self.ttl_seconds = ttl_seconds

    def fetch_data(self, user_id: int):
        local_data = self.read_local_data()
        if local_data and isinstance(local_data, dict):
            user_data_list = local_data.get("data", [])
            user_data = next(
                (ud for ud in user_data_list if ud.get("id") == user_id), None)
            if user_data and self.is_data_valid(local_data):
                print("Data found in cache, fetching...")
                return user_data
        print("\\nFetching new JSON data from the server...")
        try:
            response = requests.get(self.data_url)
            response.raise_for_status()
            json_data = response.json()

            expiration_time = datetime.now() + timedelta(seconds=self.ttl_seconds)
            data_with_ttl = {"data": json_data,
                             "expiration_time": expiration_time.isoformat()}

            with open(self.cache_filename, "w") as file:
                json.dump(data_with_ttl, file, indent=2)

            user_data = next(
                (ud for ud in json_data if ud.get("id") == user_id), None)
            if user_data:
                return user_data
        except (requests.RequestException, json.JSONDecodeError) as e:
            print(f"Error fetching data: {e}")
        return "Data not found, try again!"

    def read_local_data(self):
        try:
            with open(self.cache_filename, "r") as file:
                data_with_ttl = json.load(file)
                return data_with_ttl
        except (FileNotFoundError, json.JSONDecodeError):
            return None

    def is_data_valid(self, data_with_ttl):
        expiration_time_str = data_with_ttl.get("expiration_time")
        if expiration_time_str:
            expiration_time = datetime.fromisoformat(expiration_time_str)
            return datetime.now() < expiration_time
        return False

if __name__ == "__main__":
    user_id = input("Enter ID: ")
    data_url = "<https://jsonplaceholder.typicode.com/users/>"
    cache_filename = "users.json"
    ttl_seconds = 60

    data_fetcher = UserData(data_url, cache_filename, ttl_seconds)
    fetched_data = data_fetcher.fetch_data(int(user_id))

    print(f"\\nData Fetched for ID ({user_id}): {fetched_data}")

Here’s the output:

Python caching id data

In the scenario described above, searching for an ID for the first time fetches the information from the server and stores it in the cache. This information has an expiration time of 60 seconds, as defined in our code by setting the TTL to 60 seconds. Subsequent searches for other IDs first check the cache. If the data for the requested ID is present and hasn't expired, it's retrieved from the cache (significantly faster than fetching from the server). If the data is expired or doesn't exist in the cache, it's fetched from the server.

Enhancing efficiency with caching in web crawling and scraping

Caching can significantly optimize both crawling and scraping operations by storing previously fetched data. This means that when the same information is needed again, it can be retrieved from the cache rather than by querying the website anew. This not only speeds up the process but also reduces the load on the website’s server and minimizes the risk of being blocked for excessive access.

To effectively demonstrate how caching can be integrated into web crawling tasks, we'll explore examples using two popular Python libraries: Requests for synchronous operations and HTTPX for asynchronous operations. These examples will highlight the implementation of caching mechanisms that can optimize your crawling processes by reducing the need to re-fetch data, thus saving time and decreasing server load.

Separating the crawling and scraping steps

By separating the crawling and scraping steps with Requests and HTTPX, you don't have to recrawl when making changes to the scraper and rerunning it on historical data. So, let's look at how you can move the crawling step into a separate function and using a caching mechanism, the requests-cache library.

We'll use Requests for synchronous crawling and HTTPX for asynchronous crawling.

Requests without caching

pip install requests==2.31.0
import requests


def get_html_content(url: str, timeout: int = 10) -> str:
    response = requests.get(url, timeout=timeout)
    return str(response.text)


def main() -> None:
    html_content = get_html_content('https://news.ycombinator.com')
    print(html_content[:1000])


if __name__ == '__main__':
    main()

HTTPX without caching

pip install httpx==0.27.0
import asyncio
import httpx


async def get_html_content(url: str, timeout: int = 10) -> str:
    async with httpx.AsyncClient(timeout=timeout) as client:
        response = await client.get(url)
    return str(response.text)


async def main() -> None:
    html_content = await get_html_content('https://news.ycombinator.com')
    print(html_content[:1000])


if __name__ == '__main__':
    asyncio.run(main())

Requests with caching using requests-cache

pip install requests-cache==1.2.0
import requests
import requests_cache

requests_cache.install_cache('my_requests_cache', expire_after=3600)  # Cache expires after 1 hour


def get_html_content(url: str, timeout: int = 10) -> str:
    response = requests.get(url, timeout=timeout)
    return str(response.text)


def main() -> None:
    html_content = get_html_content('https://news.ycombinator.com')
    print(html_content[:1000])


if __name__ == '__main__':
    main()

Now when you rerun the script, it will use the cache value and not fetch the website again (if it was downloaded earlier), even if you made changes in the script.

The above example utilized the third-party requests-cache library, which uses the SQlite database. Now, let's do the same thing with HTTPX using the third party diskcache library.

HTTPX with caching using diskcache

pip install diskcache==5.6.3
import asyncio

import httpx
from diskcache import Cache

cache = Cache('my_httpx_cache')  # Create a persistent cache on disk


async def get_html_content(url: str, timeout: int = 10) -> str:
    # Check if the URL is already in the cache
    if url in cache:
        print(f'Using cached content for {url}')
        return str(cache[url])

    print(f'Making a new request for {url}')

    # If not in the cache, make a new request and store in the cache
    async with httpx.AsyncClient(timeout=timeout) as client:
        response = await client.get(url)
        html = str(response.text)
        cache[url] = html
        return html


async def main() -> None:
    html_content = await get_html_content('https://news.ycombinator.com')
    print(html_content[:1000])


if __name__ == '__main__':
    asyncio.run(main())

Alternatively, you could use the Hishel library.

Finishing up

If you want to speed up your code, caching data can be a powerful tool. Web scraping is one of many use cases where caching can be very helpful. In large-scale projects, caching frequently used data can significantly speed up data extraction and improve performance.

Python offers several effective caching mechanisms:

  • You can create simple caches using Python dictionaries, allowing for O(1) time access to cached values.
  • The functools module provides the @lru_cache decorator, which implements the Least Recently Used (LRU) caching algorithm for functions.
  • For more advanced features and flexibility, consider using external caching services like Memcached or Redis.

Further reading on Python

Satyam Tripathi
Satyam Tripathi
I am a freelance technical writer based in India. I write quality user guides and blog posts for software development startups. I have worked with more than 10 startups across the globe.

TAGS

Python

Get started now

Step up your web scraping and automation