---
blogpost: true
date: Aug 3, 2015
title: Caching
tags: scipy, Python, Programming, dask
description: Humans repeat stuff. Caching helps.
---

_This work is supported by [Continuum Analytics](http://continuum.io)
and the [XDATA Program](http://www.darpa.mil/program/XDATA)
as part of the [Blaze Project](http://blaze.pydata.org)_

**tl;dr: Caching improves performance under repetitive workloads. Traditional
LRU policies don't fit data science well. We propose a new caching policy.**

## Humans Repeat Stuff

Consider the dataset that you've worked on most recently. How many times have
you loaded it from disk into memory? How many times have you repeated almost
the same computations on that data?

Exploratory data science workloads involve repetition of very similar
computations. These computations share structure. By caching frequently used
results (or parts of results) we may be able to considerably speed up
exploratory data analysis.

## Caching in other contexts

The web community loves caching. Database backed web applications almost
always guard their database lookups with a system like `memcached` which
devotes some amount of memory to caching frequent and recent queries.
Because humans visit mostly the same pages over and over again this can reduce
database load by an order of magnitude. Even if humans visit different pages
with different inputs these pages often share many elements.

## Limited caching resources

Given infinite memory we would cache every result that we've ever
computed. This would give us for instant recall on anything that wasn't novel.
Sadly our memory resources are finite and so we evict cached results
that don't seem to be worth keeping around.

Traditionally we use a policy like Least Recently Used (LRU). This policy
evicts results that have not been requested for a long time. This is cheap and
works well for web and systems applications.

## LRU doesn't fit analytic workloads

Unfortunately LRU doesn't fit analytic workloads well. Analytic workloads have
a large spread computation times and of storage costs. While most web
application database queries take roughly the same amount of time
(100ms-1000ms) and take up roughly the same amount of space to store (1-10kb),
the computation and storage costs of analytic computations can easily vary by
many orders of magnitude (spreads in the millions or billions are common.)

Consider the following two common computations of a large NumPy array:

1. `x.std() # costly to recompute, cheap to store`
2. `x.T # cheap to recompute, costly to store`

In the first case, `x.std()`, this might take a second on a large array
(somewhat expensive) but takes only a few bytes to store. This result is so
cheap to store that we're happy to keep it in our cache for a long time, even
if its infrequently requested.

In the second case, `x.T` this is cheap to compute (just a metadata change in
the array) and executes in microseconds. However the result might take
gigabytes of memory to store. We don't want to keep this in our cache, even if
it's very frequently requested; it takes up all of the space for other
potentially more useful (and smaller) results and we can recompute it trivially
anyway.

So we want to keep cached results that have the following properties:

1. Costly to recompute (in seconds)
2. Cheap to store (in bytes)
3. Frequently used
4. Recently used

## Proposed Caching Policy

Here is an alternative to LRU that respects the objectives stated above.

Every time someone accesses an entry in our cache, we increment the score
associated to the entry with the following value

$$ \textrm{score} = \textrm{score} + \frac{\textrm{compute time}}{\textrm{nbytes}} (1 + \epsilon)^{t} $$

Where `compute time` is the time it took to compute the result in the first
place, `nbytes` is the number of bytes that it takes to store the result,
`epsilon` is a small number that determines the halflife of what "recently"
means, and `t` is an auto-incrementing tick time increased at every access.

This has units of inverse bandwidth (s/byte), gives more importance to new
results with a slowly growing exponential growth, and amplifies the score of
frequently requested results in a roughly linear fashion.

We maintain these scores in a heap, keep track of the total number of bytes,
and cull the cache as necessary to keep storage costs beneath a fixed budget.
Updates cost `O(log(k))` for `k` the number of elements in the cache.

## Cachey

I wrote this up into a tiny library called `cachey`. This is experimental
code and subject to wild API changes (including renaming.)

The central object is a `Cache` that includes asks for the following:

1. Number of available bytes to devote to the cache
2. Halflife on importance (the number of access that occur to reduce the
   importance of a cached result by half) (default 1000)
3. A lower limit on costs to consider entry to the cache (default 0)

## Example

So here is the tiny example

```python
>>> from cachey import Cache
>>> c = Cache(available_bytes=1e9)  # 1 GB

>>> c.put('Hello', 'world!', cost=3)

>>> c.get('Hello')
'world!'
```

## More interesting example

The cache includes a `memoize` decorator. Lets memoize `pd.read_csv`.

```python
In [1]: import pandas as pd
In [2]: from cachey import Cache

In [3]: c = Cache(1e9)
In [4]: read_csv = c.memoize(pd.read_csv)

In [5]: %time df = read_csv('accounts.csv')
CPU times: user 262 ms, sys: 27.7 ms, total: 290 ms
Wall time: 303 ms

In [6]: %time df = read_csv('accounts.csv')  # second read is free
CPU times: user 77 µs, sys: 16 µs, total: 93 µs
Wall time: 93 µs

In [7]: c.total_bytes / c.available_bytes
Out[7]: 0.096
```

So we create a new function `read_csv` that operates exactly like
`pandas.read_csv` except that it holds on to recent results in `c`, a cache.
This particular CSV file created a dataframe that filled a tenth of our
cache space. The more often we request this CSV file the more its score will
grow and the more likely it is to remain in the cache into the future. If
other memoized functions using this same cache produce more valuable results
(costly to compute, cheap to store) and we run out of space then this result
will be evicted and we'll have to recompute our result if we ask for
`read_csv('accounts.csv')` again.

## Memoize everything

Just memoizing `read_csv` isn't very interesting. The `pd.read_csv`
function operates at a constant data bandwidth of around 100 MB/s. The caching
policies around `cachey` really shine when they get to see _all of our
computations_. For example it could be that we don't want to hold on to the
results of `read_csv` because these take up a lot of space. If we find
ourselves doing the same `groupby` computations then we might prefer to use our
gigabyte of caching space to store these both because

1. groupby computations take a long time to compute
2. groupby results are often very compact in memory

## Cachey and Dask

I'm slowly working on integrating cachey into dask's shared memory scheduler.

Dask is in a good position to apply `cachey` to many computations. It can look
at every task in a task graph and consider the result for inclusion into the
cache. We don't need to explicitly `memoize` every function we want to use,
`dask` can do this for us on the fly.

Additionally dask has a nice view of our computation as a collection of
sub-tasks. Similar computations (like mean and variance) often share
sub-tasks.

## Future Work

Cachey is new and untested but potentially useful now, particularly through the
`memoize` method shown above.
It's a [small and simple codebase](http://github.com/mrocklin/cachey).
I would love to hear if people find this kind of caching policy valuable.

I plan to think about the following in the near future:

1. How do we build a hierarchy of caches to share between memory and disk?
2. How do we cleanly integrate cachey into dask
   (see [PR #502](https://github.com/ContinuumIO/dask/pull/502))
   and how do we make dask amenable to caching
   (see [PR #510](https://github.com/ContinuumIO/dask/pull/510))?