Django Cacheback

Cacheback is an extensible caching library that refreshes stale cache items asynchronously using a Celery or rq task (utilizing django-rq). The key idea being that it’s better to serve a stale item (and populate the cache asynchronously) than block the response process in order to populate the cache synchronously.

Using this library, you can rework your views so that all reads are from cache - which can be a significant performance boost.

A corollary of this technique is that cache stampedes can be easily avoided, avoiding sudden surges of expensive reads when cached items becomes stale.

Cacheback provides a decorator for simple usage, a subclassable base class for more fine-grained control and helper classes for working with querysets.

Example

Consider a view for showing a user’s tweets:

from django.shortcuts import render
from myproject.twitter import fetch_tweets

def show_tweets(request, username):
    return render(request, 'tweets.html',
                  {'tweets': fetch_tweets(username)})

This works fine but the fetch_tweets function involves a HTTP round-trip and is slow.

Performance can be improved by using Django’s low-level cache API:

from django.shortcuts import render
from django.cache import cache
from myproject.twitter import fetch_tweets

def show_tweets(request, username):
    return render(request, 'tweets.html',
                  {'tweets': fetch_cached_tweets(username)})

def fetch_cached_tweets(username):
    tweets = cache.get(username)
    if tweets is None:
        tweets = fetch_tweets(username)
        cache.set(username, tweets, 60*15)
    return tweets

Now tweets are cached for 15 minutes after they are first fetched, using the twitter username as a key. This is obviously a performance improvement but the shortcomings of this approach are:

  • For a cache miss, the tweets are fetched synchronously, blocking code execution and leading to a slow response time.
  • This in turn exposes the view to a ‘cache stampede‘ where multiple expensive reads run simultaneously when the cached item expires. Under heavy load, this can bring your site down and make you sad.

Now, consider an alternative implementation that uses a Celery task to repopulate the cache asynchronously instead of during the request/response cycle:

import datetime
from django.shortcuts import render
from django.cache import cache
from myproject.tasks import update_tweets

def show_tweets(request, username):
    return render(request, 'tweets.html',
                  {'tweets': fetch_cached_tweets(username)})

def fetch_cached_tweets(username):
    item = cache.get(username)
    if item is None:
        # Scenario 1: Cache miss - return empty result set and trigger a refresh
        update_tweets.delay(username, 60*15)
        return []
    tweets, expiry = item
    if expiry > datetime.datetime.now():
        # Scenario 2: Cached item is stale - return it but trigger a refresh
        update_tweets.delay(username, 60*15)
    return tweets

where the myproject.tasks.update_tweets task is implemented as:

import datetime
from celery import task
from django.cache import cache
from myproject.twitter import fetch_tweets

@task()
def update_tweets(username, ttl):
    tweets = fetch_tweets(username)
    now = datetime.datetime.now()
    cache.set(username, (tweets, now+ttl), 2592000)

Some things to note:

  • Items are stored in the cache as tuples (data, expiry_timestamp) using Memcache’s maximum expiry setting (2592000 seconds). By using this value, we are effectively bypassing memcache’s replacement policy in favour of our own.
  • As the comments indicate, there are two scenarios to consider:
    1. Cache miss. In this case, we don’t have any data (stale or otherwise) to return. In the example above, we trigger an asynchronous refresh and return an empty result set. In other scenarios, it may make sense to perform a synchronous refresh.
    2. Cache hit but with stale data. Here we return the stale data but trigger a Celery task to refresh the cached item.

This pattern of re-populating the cache asynchronously works well. Indeed, it is the basis for the cacheback library.

Here’s the same functionality implemented using a django-cacheback decorator:

from django.shortcuts import render
from django.cache import cache
from myproject.twitter import fetch_tweets
from cacheback.decorators import cacheback

def show_tweets(request, username):
    return render(request, 'tweets.html',
                  {'tweets': cacheback(60*15, fetch_on_miss=False)(fetch_tweets)(username)})

Here the decorator simply wraps the fetch_tweets function - nothing else is needed. Cacheback ships with a flexible Celery task that can run any function asynchronously.

To be clear, the behaviour of this implementation is as follows:

  • The first request for a particular user’s tweets will be a cache miss. The default behaviour of Cacheback is to fetch the data synchronously in this situation, but by passing fetch_on_miss=False, we indicate that it’s ok to return None in this situation and to trigger an asynchronous refresh.
  • A Celery worker will pick up the job to refresh the cache for this user’s tweets. If will import the fetch_tweets function and execute it with the correct username. The resulting data will be added to the cache with a lifetime of 15 minutes.
  • Any requests for this user’s tweets during the period that Celery is refreshing the cache will also return None. However Cacheback is aware of cache stampedes and does not trigger any additional jobs for refreshing the cached item.
  • Once the cached item is refreshed, any subsequent requests within the next 15 minutes will be served from cache.
  • The first request after 15 minutes has elapsed will serve the (now-stale) cache result but will trigger a Celery task to fetch the user’s tweets and repopulate the cache.

Much of this behaviour can be configured by using a subclass of cacheback.Job. The decorator is only intended for simple use-cases. See the Sample usage and API documentation for more information.

All of the worker related things above an also be done using rq instead of Celery.

Contents

Installation

You need to do three things:

1. Install django-cacheback

To install with Celery support, run:

$ pip install django-cacheback[celery]

If you want to install with RQ support, just use:

$ pip install django-cacheback[rq]

After installing the package and dependencies, add cacheback to your INSTALLED_APPS. If you want to use RQ as your task queue, you need to set CACHEBACK_TASK_QUEUE in your settings to rq.

2. Install a message broker

Celery requires a message broker. Use Celery’s tutorial to help set one up. I recommend rabbitmq.

For RQ you need to set up a redis-server and configure django-rq. Please look up the django-rq installation guide for more details.

3. Set up a cache

You also need to ensure you have a cache set up. Most likely, you’ll be using memcache so your settings will include something like:

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
        'LOCATION': '127.0.0.1:11211',
    }
}
Logging

You may also want to configure logging handlers for the ‘cacheback’ named logger. To set up console logging, use something like:

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'filters': {
        'require_debug_false': {
            '()': 'django.utils.log.RequireDebugFalse'
        }
    },
    'handlers': {
        'console': {
            'level': 'DEBUG',
            'class': 'logging.StreamHandler',
        }
    },
    'loggers': {
        'cacheback': {
            'handlers': ['console'],
            'level': 'DEBUG',
            'propagate': False,
        },
    }
}

Sample usage

As a decorator

Simply wrap the function whose results you want to cache:

import requests
from cacheback.decorators import cacheback

@cacheback()
def fetch_tweets(username):
    url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
    return requests.get(url % username).json

The default behaviour of the cacheback decorator is to:

  • Cache items for 10 minutes.
  • When the cache is empty for a given key, the data will be fetched synchronously.

You can parameterise the decorator to cache items for longer and also to not block on a cache miss:

import requests
from cacheback.decorators import cacheback

@cacheback(lifetime=1200, fetch_on_miss=False)
def fetch_tweets(username):
    url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
    return requests.get(url % username).json

Now:

  • Items will be cached for 20 minutes;
  • For a cache miss, None will be returned and the cache refreshed asynchronously.

As an instance of cacheback.Job

Subclassing cacheback.Job gives you complete control over the caching behaviour. The only method that must be overridden is fetch which is responsible for fetching the data to be cached:

import requests
from cacheback.base import Job

class UserTweets(Job):

    def fetch(self, username):
        url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
        return requests.get(url % username).json

Client code only needs to be aware of the get method which returns the cached data. For example:

from django.shortcuts import render

def tweets(request, username):
    return render(request,
                  'tweets.html',
                  {'tweets': UserTweets().get(username)})

You can control the lifetime and behaviour on cache miss using either class attributes:

import requests
from cacheback.base import Job

class UserTweets(Job):
    lifetime = 60*20
    fetch_on_miss = False

    def fetch(self, username):
        url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
        return requests.get(url % username).json

or by overriding methods:

import time
import requests
from cacheback.base import Job

class UserTweets(Job):

    def fetch(self, username):
        url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
        return requests.get(url % username).json

    def expiry(self, username):
        now = time.time()
        if username.startswith(a):
            return now + 60*20
        return now + 60*10

    def should_item_be_fetched_synchronously(self, username):
        return username.startswith(a)

In the above toy example, the cache behaviour will be different for usernames starting with ‘a’.

Invalidation

If you want to programmatically invalidate a cached item, use the invalidate method on a job instance:

job = UserTweets()
job.invalidate(username)

This will trigger a new asynchronous refresh of the item.

You can also simply remove an item from the cache so that the next request will trigger the refresh:

job.delete(username)

Setting cache values

If you want to update the cache programmatically use the set method on a job instance (this can be useful when your program can discover updates through a separate mechanism for example, or for caching partial or derived data):

tweets_job = UserTweets()

user_tweets = tweets_job.get(username)

new_tweet = PostTweet(username, 'Trying out Cacheback!')

# Naive example, assuming no other process would have updated the tweets
tweets_job.set(username, user_tweets + [new_tweet])

The data to be cached can be specified in a few ways. Firstly it can be the last positional argument, as above. If that is unclear, you can also use the keyword data:

tweets_job.set(username, data=(current_tweets + [new_tweet]))

And if your cache method already uses a keyword argument called data you can specify the name of a different parameter as a class variable called set_data_kwarg:

class CustomKwUserTweets(UserTweets):
    set_data_kwarg = 'my_cache_data'

custom_tweets_job = CustomKwUserTweets()

custom_tweets_job.set(username, my_cache_data=(user_tweets + [new_tweet]))

This also works with a decorated function:

@cacheback()
def fetch_tweets(username):
    url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
    return requests.get(url % username).json

user_tweets = fetch_tweets(username)

new_tweet = PostTweet(username, 'Trying out Cacheback!')

fetch_tweets.job.set(fetch_tweets, username, (user_tweets + [new_tweet])))

or:

fetch_tweets.job.set(fetch_tweets, username, data=(current_tweets + [new_tweet])))

And you can specify the set_data_kwarg in the decorator params as you’d expect:

@cacheback(set_data_kwarg='my_cache_data')
def fetch_tweets(username):
    url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
    return requests.get(url % username).json

fetch_tweets.job.set(fetch_tweets, username, my_cache_data=(user_tweets + [new_tweet])))

NOTE: If your fetch method, or cacheback-decorated function takes a named parameter of data and you wish to use the set method, you must provide a new value for the set_data_kwarg parameter, and not pass in the data to cache as the last positional argument. Otherwise the value of the data parameter will be used as the data to cache.

Post-processing

The cacheback.Job instance provides a process_result method that can be overridden to modify the result value being returned. You can use this to append information about whether the result is being returned from cache or not.

API

The main class is cacheback.base.Job. The methods that are intended to be called from client code are:

class cacheback.base.Job

A cached read job.

This is the core class for the package which is intended to be subclassed to allow the caching behaviour to be customised.

delete(*raw_args, **raw_kwargs)

Remove an item from the cache

get(*raw_args, **raw_kwargs)

Return the data for this function (using the cache if possible).

This method is not intended to be overidden

invalidate(*raw_args, **raw_kwargs)

Mark a cached item invalid and trigger an asynchronous job to refresh the cache

It has some class properties than can be used to configure simple behaviour:

class cacheback.base.Job

A cached read job.

This is the core class for the package which is intended to be subclassed to allow the caching behaviour to be customised.

There are also several methods intended to be overridden and customised:

class cacheback.base.Job

A cached read job.

This is the core class for the package which is intended to be subclassed to allow the caching behaviour to be customised.

empty()

Return the appropriate value for a cache MISS (and when we defer the repopulation of the cache)

expiry(*args, **kwargs)

Return the expiry timestamp for this item.

fetch(*args, **kwargs)

Return the data for this job - this is where the expensive work should be done.

key(*args, **kwargs)

Return the cache key to use.

If you’re passing anything but primitive types to the get method, it’s likely that you’ll need to override this method.

key(*args, **kwargs)

Return the cache key to use.

If you’re passing anything but primitive types to the get method, it’s likely that you’ll need to override this method.

process_result(result, call, cache_status, sync_fetch=None)

Transform the fetched data right before returning from .get(...)

Parameters:
  • result – The result to be returned
  • call – A named tuple with properties ‘args’ and ‘kwargs that holds the call args and kwargs
  • cache_status – A status integrer, accessible as class constants self.MISS, self.HIT, self.STALE
  • sync_fetch – A boolean indicating whether a synchronous fetch was performed. A value of None indicates that no fetch was required (ie the result was a cache hit).
should_missing_item_be_fetched_synchronously(*args, **kwargs)

Return whether to refresh an item synchronously when it is missing from the cache

should_stale_item_be_fetched_synchronously(delta, *args, **kwargs)

Return whether to refresh an item synchronously when it is found in the cache but stale

timeout(*args, **kwargs)

Return the refresh timeout for this item

Queryset jobs

There are two classes for easy caching of ORM reads. These don’t need subclassing but rather take the model class as a __init__ parameter.

class cacheback.jobs.QuerySetFilterJob(model, lifetime=None, fetch_on_miss=None, cache_alias=None, task_options=None)

For ORM reads that use the filter method.

class cacheback.jobs.QuerySetGetJob(model, lifetime=None, fetch_on_miss=None, cache_alias=None, task_options=None)

For ORM reads that use the get method.

Example usage:

from django.contrib.auth import models
from django.shortcuts import render
from cacheback.jobs import QuerySetGetJob, QuerySetFilterJob

def user_detail(request, username):
    user = QuerySetGetJob(models.User).get(username=username)
    return render(request, 'user.html',
                  {'user': user})

def staff(request):
    staff = QuerySetFilterJob(models.User).get(is_staff=True)
    return render(request, 'staff.html',
                  {'users': staff})

These classes are helpful for simple ORM reads but won’t be suitable for more complicated queries where filter is chained together with exclude.

Settings

CACHEBACK_CACHE_ALIAS

This specifies which cache to use from your CACHES setting. It defaults to default.

CACHEBACK_VERIFY_CACHE_WRITE

This verifies the data is correctly written to memcache. If not, then a RuntimeError is raised. Defaults to True.

CACHEBACK_TASK_QUEUE

This defines the task queue to use. Valid options are rq and celery. Make sure that the corresponding task queue is configured too.

Advanced usage

Three thresholds for cache invalidation

It’s possible to employ three threshold times to control cache behaviour:

  1. A time after which the cached item is considered ‘stale’. When a stale item is returned, an async job is triggered to refresh the item but the stale item is returned. This is controlled by the lifetime attribute of the Job class - the default value is 600 seconds (10 minutes).
  2. A time after which the cached item is removed (a cache miss). If you have fetch_on_miss=True, then this will trigger a synchronous data fetch. This is controlled by the cache_ttl attribute of the Job class - the default value is 2592000 seconds, which is the maximum ttl that memcached supports.
  3. A timeout value for the refresh job. If the cached item is not refreshed after this time, then another async refresh job will be triggered. This is controlled by the refresh_timeout attribute of the Job class and defaults to 60 seconds.

Contributing

Start by cloning the repo, creating a virtualenv and running:

$ make install

to install the testing dependencies.

Running tests

Use:

$ py.test

or generate coverage report:

$ py.test --cov

or use Tox with:

$ tox

to test all Python/Django combinations.

Sandbox VM

There is a VagrantFile for setting up a sandbox VM where you can play around with the functionality. Bring up the Vagrant box:

$ vagrant up

This may take a while but will set up a Ubuntu Precise64 VM with RabbitMQ installed. You can then SSH into the machine:

$ vagrant ssh
$ cd /vagrant/sandbox

You can now decide to run the Celery implementation:

$ honcho -f Procfile.celery start

Or you can run the RQ implementation:

$ honcho -f Procfile.rq start

The above commands will start a Django runserver and the selected task worker. The dummy site will be available at http://localhost:8080 on your host machine. There are some sample views in sandbox/dummyapp/views.py that exercise django-cacheback.

Indices and tables