Dask Working Notes - Posted in 2015

Dask is one year old

2015-12-21T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Dask turned one yesterday. We discuss success and failures.

Dask began one year ago yesterday with the following commit (with slight edits here for clarity’s sake).

def istask(x):
    return isinstance(x, tuple) and x and callable(x[0])


def get(d, key):
    v = d[key]
    if istask(v):
        func, args = v[0], v[1:]
        return func(*[get(d, arg) for arg in args])
    else:
        return v

 ... (and around 50 lines of tests)

this is a very inefficient scheduler

Since then dask has matured, expanded to new domains, gathered excellent developers, and spawned other open source projects. I thought it’d be a good time to look back on what worked, what didn’t, and what we should work on in the future.

Most users experience dask through the high-level collections of dask.array/bag/dataframe/imperative. Each of these evolve as projects of their own with different user groups and different levels of maturity.

dask.array

The parallel larger-than-memory array module dask.array has seen the most success of the dask components. It is the oldest, most mature, and most sophisticated subproject. Much of dask.array’s use comes from downstream projects, notably xray which seems to have taken off in climate science. Dask.array also sees a fair amount of use in imaging, genomics, and numerical algorithms research.

People that I don’t know now use dask.array to do scientific research. From my perspective that’s mission accomplished.

There are still tweaks to make to algorithms, particularly as we scale out to distributed systems (see far below).

dask.bag

Dask.bag started out as a weekend project and didn’t evolve much beyond that. Fortunately there wasn’t much to do and this submodule probably has the highest value/effort ratio .

Bag doesn’t get as much attention as its older sibling array though. It’s handy but not as well used and so not as robust.

dask.dataframe

Dataframe is an interesting case, it’s both pretty sophisticated, pretty mature, and yet also probably generates the most user frustration.

Dask.dataframe gains a lot of value by leveraging Pandas both under the hood (one dask DataFrame is many pandas DataFrames) and by copying its API (Pandas users can use dask.dataframe without learning a new API.) However, because dask.dataframe only implements a core subset of Pandas, users end up tripping up on the missing functionality.

This can be decomposed into to issues:

It’s not clear that there exists a core subset of Pandas that would handle most use cases. Users touch many diffuse parts of Pandas in a single workflow. What one user considers core another user considers fringe. It’s not clear how to agree on a sufficient subset to implement.
Once you implement this subset (and we’ve done our best) it’s hard to convey expectations to the user about what is and is not available.

That being said, dask.dataframe is pretty solid. It’s very fast, expressive, and handles common use cases well. It probably generates the most StackOverflow questions. This signals both confusion and active use.

Special thanks here go out to Jeff Reback, for making Pandas release the GIL and to Masaaki Horikoshi (@sinhrks) for greatly improving the maturity of dask.dataframe.

dask.imperative

Also known as dask.do this little backend remains one of the most powerful and one of the least used (outside of myself.) We should rethink the API here and improve learning materials.

General thoughts on collections

Warning: this section is pretty subjective

Big data collections are cool but perhaps less useful than people expect. Parallel applications are often more complex than can be easily described by a big array or a big dataframe. Many real-world parallel computations end up being more particular in their parallelism needs. That’s not to say that the array and dataframe abstractions aren’t central to parallel computing, it’s just that we should not restrict ourselves to them. The world is more complex.

However, it’s reasonable to break this “world is complex” rule within particular domains. NDArrays seem to work well in climate science. Specialized large dataframes like Dato’s SFrame seem to be effective for a particular class of machine learning algorithms. The SQL table is inarguably an effective abstraction in business intelligence. Large collections are useful in specific contexts, but they are perhaps the focus of too much attention. The big dataframe in particular is over-hyped.

Most of the really novel and impressive work I’ve seen with dask has been done either with custom graphs or with the dask.imperative API. I think we should consider APIs that enable users to more easily express custom algorithms.

Avoid Parallelism

When giving talks on parallelism I’ve started to give a brief “avoid parallelism” section. From the problems I see on stack overflow and from general interactions when people run into performance challenges their first solution seems to be to parallelize. This is sub-optimal. It’s often far cheaper to improve storage formats, use better algorithms, or use C/Numba accelerated code than it is to parallelize. Unfortunately storage formats and C aren’t as sexy as big data parallelism, so they’re not in the forefront of people’s minds. We should change this.

I’ll proudly buy a beer for anyone that helps to make storage formats a sexier topic.

Scheduling

Single Machine

The single machine dynamic task scheduler is very very solid. It has roughly two objectives:

Use all the cores of a machine
Choose tasks that allow the release of intermediate results

This is what allows us to quickly execute complex workflows in small space. This scheduler underlies all execution within dask. I’m very happy with it. I would like to find ways to expose it more broadly to other libraries. Suggestions are very welcome here.

We still run into cases where it doesn’t perform optimally (see issue 874), but so far we’ve always been able to enhance the scheduler whenever these cases arise.

Distributed Cluster

Over the last few months we’ve been working on another scheduler for distributed memory computation. It should be a nice extension to the existing dask collections out to “big data” systems. It’s experimental but usable now with documentation at the follow links:

Feedback is welcome. I recommend waiting for a month or two if you prefer clean and reliable software. It will undergo a name-change to something less generic.

Distributed Prototype

2015-10-09T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We demonstrate a prototype distributed computing library and discuss data locality.

Here’s a new prototype library for distributed computing. It could use some critical feedback.

This blogpost uses distributed on a toy example. I won’t talk about the design here, but the docs should be a quick and informative read. I recommend the quickstart in particular.

We’re going to do a simple computation a few different ways on a cluster of four nodes. The computation will be

Make a 1000 random numpy arrays, each of size 1 000 000
Compute the sum of each array
Compute the total sum of the sums

We’ll do this directly with a distributed Pool and again with a dask graph.

Start up a Cluster

I have a cluster of four m3.xlarges on EC2

ssh node1
dcenter

ssh node2
dworkder node1:8787

ssh node3
dworkder node1:8787

ssh node4
dworkder node1:8787

Notes on how I set up my cluster.

Pool

On the client side we spin up a distributed Pool and point it to the center node.

>>> from distributed import Pool
>>> pool = Pool('node1:8787')

Then we create a bunch of random numpy arrays:

>>> import numpy as np
>>> arrays = pool.map(np.random.random, [1000000] * 1000)

Our result is a list of proxy objects that point back to individual numpy arrays on the worker computers. We don’t move data until we need to. (Though we could call .get() on this to collect the numpy array from the worker.)

>>> arrays[0]
RemoteData<center=10.141.199.202:8787, key=3e446310-6...>

Further computations on this data happen on the cluster, on the worker nodes that hold the data already.

>>> sums = pool.map(np.sum, arrays)

This avoids costly data transfer times. Data transfer will happen when necessary though, as when we compute the final sum. This forces communication because all of the intermediate sums must move to one node for the final addition.

>>> total = pool.apply(np.sum, args=(sums,))
>>> total.get()  # finally transfer result to local machine
499853416.82058007

distributed.dask

Now we do the same computation all at once by manually constructing a dask graph (beware, this can get gnarly, friendlier approaches exist below.)

>>> dsk = dict()
>>> for i in range(1000):
...     dsk[('x', i)] = (np.random.random, 1000000)
...     dsk[('sum', i)] = (np.sum, ('x', i))

>>> dsk['total'] = (sum, [('sum', i) for i in range(1000)])

>>> from distributed.dask import get
>>> get('node1', 8787, dsk, 'total')
500004095.00759566

Apparently not everyone finds dask dictionaries to be pleasant to write by hand. You could also use this with dask.imperative or dask.array.

dask.imperative

def get2(dsk, keys):
    """ Make `get` scheduler that hardcodes the IP and Port """
    return get('node1', 8787, dsk, keys)

>>> from dask.imperative import do
>>> arrays = [do(np.random.random)(1000000) for i in range(1000)]
>>> sums = [do(np.sum)(x) for x in arrays]
>>> total = do(np.sum)(sums)

>>> total.compute(get=get2)
499993637.00844824

dask.array

>>> import dask.array as da
>>> x = da.random.random(1000000*1000, chunks=(1000000,))
>>> x.sum().compute(get=get2)
500000250.44921482

The dask approach was smart enough to delete all of the intermediates that it didn’t need. It could have run intelligently on far more data than even our cluster could hold. With the pool we manage data ourselves manually.

>>> from distributed import delete
>>> delete(('node0', 8787), arrays)

Mix and Match

We can also mix these abstractions and put the results from the pool into dask graphs.

>>> arrays = pool.map(np.random.random, [1000000] * 1000)
>>> dsk = {('sum', i): (np.sum, x) for i, x in enumerate(arrays)}
>>> dsk['total'] = (sum, [('sum', i) for i in range(1000)])

Discussion

The Pool and get user interfaces are independent from each other but both use the same underlying network and both build off of the same codebase. With distributed I wanted to build a system that would allow me to experiment easily. I’m mostly happy with the result so far.

One non-trivial theme here is data-locality. We keep intermediate results on the cluster and schedule jobs on computers that already have the relevant data if possible. The workers can communicate with each other if necessary so that any worker can do any job, but we try to arrange jobs so that workers don’t have to communicate if not necessary.

Another non-trivial aspect is that the high level dask.array example works without any tweaking of dask. Dask’s separation of schedulers from collections means that existing dask.array code (or dask.dataframe, dask.bag, dask.imperative code) gets to evolve as we experiment with new fancier schedulers.

Finally, I hope that the cluster setup here feels pretty minimal. You do need some way to run a command on a bunch of machines but most people with clusters have some mechanism to do that, even if its just ssh as I did above. My hope is that distributed lowers the bar for non-trivial cluster computing in Python.

Disclaimer

Everything here is very experimental. The library itself is broken and unstable. It was made in the last few weeks and hasn’t been used on anything serious. Please adjust expectations accordingly and provide critical feedback.

Distributed Documentation

Caching

2015-08-03T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Caching improves performance under repetitive workloads. Traditional LRU policies don’t fit data science well. We propose a new caching policy.

Consider the dataset that you’ve worked on most recently. How many times have you loaded it from disk into memory? How many times have you repeated almost the same computations on that data?

Exploratory data science workloads involve repetition of very similar computations. These computations share structure. By caching frequently used results (or parts of results) we may be able to considerably speed up exploratory data analysis.

Caching in other contexts

The web community loves caching. Database backed web applications almost always guard their database lookups with a system like memcached which devotes some amount of memory to caching frequent and recent queries. Because humans visit mostly the same pages over and over again this can reduce database load by an order of magnitude. Even if humans visit different pages with different inputs these pages often share many elements.

Limited caching resources

Given infinite memory we would cache every result that we’ve ever computed. This would give us for instant recall on anything that wasn’t novel. Sadly our memory resources are finite and so we evict cached results that don’t seem to be worth keeping around.

Traditionally we use a policy like Least Recently Used (LRU). This policy evicts results that have not been requested for a long time. This is cheap and works well for web and systems applications.

LRU doesn’t fit analytic workloads

Unfortunately LRU doesn’t fit analytic workloads well. Analytic workloads have a large spread computation times and of storage costs. While most web application database queries take roughly the same amount of time (100ms-1000ms) and take up roughly the same amount of space to store (1-10kb), the computation and storage costs of analytic computations can easily vary by many orders of magnitude (spreads in the millions or billions are common.)

Consider the following two common computations of a large NumPy array:

x.std() # costly to recompute, cheap to store
x.T # cheap to recompute, costly to store

In the first case, x.std(), this might take a second on a large array (somewhat expensive) but takes only a few bytes to store. This result is so cheap to store that we’re happy to keep it in our cache for a long time, even if its infrequently requested.

In the second case, x.T this is cheap to compute (just a metadata change in the array) and executes in microseconds. However the result might take gigabytes of memory to store. We don’t want to keep this in our cache, even if it’s very frequently requested; it takes up all of the space for other potentially more useful (and smaller) results and we can recompute it trivially anyway.

So we want to keep cached results that have the following properties:

Costly to recompute (in seconds)
Cheap to store (in bytes)
Frequently used
Recently used

Proposed Caching Policy

Here is an alternative to LRU that respects the objectives stated above.

Every time someone accesses an entry in our cache, we increment the score associated to the entry with the following value

\[ \textrm{score} = \textrm{score} + \frac{\textrm{compute time}}{\textrm{nbytes}} (1 + \epsilon)^{t} \]

Where compute time is the time it took to compute the result in the first place, nbytes is the number of bytes that it takes to store the result, epsilon is a small number that determines the halflife of what “recently” means, and t is an auto-incrementing tick time increased at every access.

This has units of inverse bandwidth (s/byte), gives more importance to new results with a slowly growing exponential growth, and amplifies the score of frequently requested results in a roughly linear fashion.

We maintain these scores in a heap, keep track of the total number of bytes, and cull the cache as necessary to keep storage costs beneath a fixed budget. Updates cost O(log(k)) for k the number of elements in the cache.

Cachey

I wrote this up into a tiny library called cachey. This is experimental code and subject to wild API changes (including renaming.)

The central object is a Cache that includes asks for the following:

Number of available bytes to devote to the cache
Halflife on importance (the number of access that occur to reduce the importance of a cached result by half) (default 1000)
A lower limit on costs to consider entry to the cache (default 0)

Example

So here is the tiny example

>>> from cachey import Cache
>>> c = Cache(available_bytes=1e9)  # 1 GB

>>> c.put('Hello', 'world!', cost=3)

>>> c.get('Hello')
'world!'

More interesting example

The cache includes a memoize decorator. Lets memoize pd.read_csv.

In [1]: import pandas as pd
In [2]: from cachey import Cache

In [3]: c = Cache(1e9)
In [4]: read_csv = c.memoize(pd.read_csv)

In [5]: %time df = read_csv('accounts.csv')
CPU times: user 262 ms, sys: 27.7 ms, total: 290 ms
Wall time: 303 ms

In [6]: %time df = read_csv('accounts.csv')  # second read is free
CPU times: user 77 µs, sys: 16 µs, total: 93 µs
Wall time: 93 µs

In [7]: c.total_bytes / c.available_bytes
Out[7]: 0.096

So we create a new function read_csv that operates exactly like pandas.read_csv except that it holds on to recent results in c, a cache. This particular CSV file created a dataframe that filled a tenth of our cache space. The more often we request this CSV file the more its score will grow and the more likely it is to remain in the cache into the future. If other memoized functions using this same cache produce more valuable results (costly to compute, cheap to store) and we run out of space then this result will be evicted and we’ll have to recompute our result if we ask for read_csv('accounts.csv') again.

Memoize everything

Just memoizing read_csv isn’t very interesting. The pd.read_csv function operates at a constant data bandwidth of around 100 MB/s. The caching policies around cachey really shine when they get to see all of our computations. For example it could be that we don’t want to hold on to the results of read_csv because these take up a lot of space. If we find ourselves doing the same groupby computations then we might prefer to use our gigabyte of caching space to store these both because

groupby computations take a long time to compute
groupby results are often very compact in memory

Cachey and Dask

I’m slowly working on integrating cachey into dask’s shared memory scheduler.

Dask is in a good position to apply cachey to many computations. It can look at every task in a task graph and consider the result for inclusion into the cache. We don’t need to explicitly memoize every function we want to use, dask can do this for us on the fly.

Additionally dask has a nice view of our computation as a collection of sub-tasks. Similar computations (like mean and variance) often share sub-tasks.

Future Work

Cachey is new and untested but potentially useful now, particularly through the memoize method shown above. It’s a small and simple codebase. I would love to hear if people find this kind of caching policy valuable.

I plan to think about the following in the near future:

How do we build a hierarchy of caches to share between memory and disk?
How do we cleanly integrate cachey into dask (see PR #502) and how do we make dask amenable to caching (see PR #510)?

Custom Parallel Workflows

2015-07-23T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We motivate the expansion of parallel programming beyond big collections. We discuss the usability custom of dask graphs.

Parallel databases, Spark, and Dask collections all provide large distributed collections that handle parallel computation for you. You put data into the collection, program with a small set of operations like map or groupby, and the collections handle the parallel processing. This idea has become so popular that there are now a dozen projects promising big and friendly Pandas clones.

This is good. These collections provide usable, high-level interfaces for a large class of common problems.

Custom Workloads

However, many workloads are too complex for these collections. Workloads might be complex either because they come from sophisticated algorithms (as we saw in a recent post on SVD) or because they come from the real world, where problems tend to be messy.

In these cases I tend to see people do two things

Fall back to multiprocessing, MPI or some other explicit form of parallelism
Perform mental gymnastics to fit their problem into Spark using a clever choice of keys. These cases often fail to acheive much speedup.

Direct Dask Graphs

Historically I’ve recommended the manual construction of dask graphs in these cases. Manual construction of dask graphs lets you specify fairly arbitrary workloads that then use the dask schedulers to execute in parallel. The dask docs hold the following example of a simple data processing pipeline:

def load(filename):
    ...
def clean(data):
    ...
def analyze(sequence_of_data):
    ...
def store(result):
    ...

dsk = {'load-1': (load, 'myfile.a.data'),
       'load-2': (load, 'myfile.b.data'),
       'load-3': (load, 'myfile.c.data'),
       'clean-1': (clean, 'load-1'),
       'clean-2': (clean, 'load-2'),
       'clean-3': (clean, 'load-3'),
       'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]),
       'store': (store, 'analyze')}

from dask.multiprocessing import get
get(dsk, 'store')  # executes in parallel

Feedback from users is that this is interesting and powerful but that programming directly in dictionaries is not inutitive, doesn’t integrate well with IDEs, and is prone to error.

Introducing dask.do

To create the same custom parallel workloads using normal-ish Python code we use the dask.do function. This do function turns any normal Python function into a delayed version that adds to a dask graph. The do function lets us rewrite the computation above as follows:

from dask import do

loads = [do(load)('myfile.a.data'),
         do(load)('myfile.b.data'),
         do(load)('myfile.c.data')]

cleaned = [do(clean)(x) for x in loads]

analysis = do(analyze)(cleaned)
result = do(store)(analysis)

The explicit function calls here don’t perform work directly; instead they build up a dask graph which we can then execute in parallel with our choice of scheduler.

from dask.multiprocessing import get
result.compute(get=get)

This interface was suggested by Gael Varoquaux based on his experience with joblib. It was implemented by Jim Crist in PR (#408).

Example: Nested Cross Validation

I sat down with a Machine learning student, Gabriel Krummenacher and worked to parallelize a small code to do nested cross validation. Below is a comparison of a sequential implementation that has been parallelized using dask.do:

You can safely skip reading this code in depth. The take-away is that it’s somewhat involved but that the addition of parallelism is light.

The parallel version runs about four times faster on my notebook. Disclaimer: The sequential version presented here is just a downgraded version of the parallel code, hence why they look so similar. This is available on github.

So the result of our normal imperative-style for-loop code is a fully parallelizable dask graph. We visualize that graph below.

test_score.visualize()

Help

Is this a useful interface? It would be great if people could try this out and generate feedback on dask.do.

For more information on dask.do see the dask imperative documentation.

Write Complex Parallel Algorithms

2015-06-26T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We discuss the use of complex dask graphs for non-trivial algorithms. We show off an on-disk parallel SVD.

Most parallel workloads today are fairly trivial:

>>> import dask.bag as db
>>> b = db.from_s3('githubarchive-data', '2015-01-01-*.json.gz')
          .map(json.loads)
          .map(lambda d: d['type'] == 'PushEvent')
          .count()

Graphs for these computations look like the following:

This is great; these are simple problems to solve efficiently in parallel. Generally these simple computations occur at the beginning of our analyses.

Sophisticated Algorithms can be Complex

Later in our analyses we want more complex algorithms for statistics , machine learning, etc.. Often this stage fits comfortably in memory, so we don’t worry about parallelism and can use statsmodels or scikit-learn on the gigabyte result we’ve gleaned from terabytes of data.

However, if our reduced result is still large then we need to think about sophisticated parallel algorithms. This is fresh space with lots of exciting academic and software work.

Example: Parallel, Stable, Out-of-Core SVD

I’d like to show off work by Mariano Tepper, who is responsible for dask.array.linalg. In particular he has a couple of wonderful algorithms for the Singular Value Decomposition (SVD) (also strongly related to Principal Components Analysis (PCA).) Really I just want to show off this pretty graph.

>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd(x)

This algorithm computes the exact SVD (up to numerical precision) of a large tall-and-skinny matrix in parallel in many small chunks. This allows it to operate out-of-core (from disk) and use multiple cores in parallel. At the bottom we see the construction of our trivial array of ones, followed by many calls to np.linalg.qr on each of the blocks. Then there is a lot of rearranging of various pieces as they are stacked, multiplied, and undergo more rounds of np.linalg.qr and np.linalg.svd. The resulting arrays are available in many chunks at the top and second-from-top rows.

The dask dict for one of these arrays, s, looks like the following:

>>> s.dask
{('x', 0, 0): (np.ones, (1000, 1000)),
 ('x', 1, 0): (np.ones, (1000, 1000)),
 ('x', 2, 0): (np.ones, (1000, 1000)),
 ('x', 3, 0): (np.ones, (1000, 1000)),
 ('x', 4, 0): (np.ones, (1000, 1000)),
 ('tsqr_2_QR_st1', 0, 0): (np.linalg.qr, ('x', 0, 0)),
 ('tsqr_2_QR_st1', 1, 0): (np.linalg.qr, ('x', 1, 0)),
 ('tsqr_2_QR_st1', 2, 0): (np.linalg.qr, ('x', 2, 0)),
 ('tsqr_2_QR_st1', 3, 0): (np.linalg.qr, ('x', 3, 0)),
 ('tsqr_2_QR_st1', 4, 0): (np.linalg.qr, ('x', 4, 0)),
 ('tsqr_2_R', 0, 0): (operator.getitem, ('tsqr_2_QR_st2', 0, 0), 1),
 ('tsqr_2_R_st1', 0, 0): (operator.getitem,('tsqr_2_QR_st1', 0, 0), 1),
 ('tsqr_2_R_st1', 1, 0): (operator.getitem, ('tsqr_2_QR_st1', 1, 0), 1),
 ('tsqr_2_R_st1', 2, 0): (operator.getitem, ('tsqr_2_QR_st1', 2, 0), 1),
 ('tsqr_2_R_st1', 3, 0): (operator.getitem, ('tsqr_2_QR_st1', 3, 0), 1),
 ('tsqr_2_R_st1', 4, 0): (operator.getitem, ('tsqr_2_QR_st1', 4, 0), 1),
 ('tsqr_2_R_st1_stacked', 0, 0): (np.vstack,
                                   [('tsqr_2_R_st1', 0, 0),
                                    ('tsqr_2_R_st1', 1, 0),
                                    ('tsqr_2_R_st1', 2, 0),
                                    ('tsqr_2_R_st1', 3, 0),
                                    ('tsqr_2_R_st1', 4, 0)])),
 ('tsqr_2_QR_st2', 0, 0): (np.linalg.qr, ('tsqr_2_R_st1_stacked', 0, 0)),
 ('tsqr_2_SVD_st2', 0, 0): (np.linalg.svd, ('tsqr_2_R', 0, 0)),
 ('tsqr_2_S', 0): (operator.getitem, ('tsqr_2_SVD_st2', 0, 0), 1)}

So to write complex parallel algorithms we write down dictionaries of tuples of functions.

The dask schedulers take care of executing this graph in parallel using multiple threads. Here is a profile result of a larger computation on a 30000x1000 array:

Low Barrier to Entry

Looking at this graph you may think “Wow, Mariano is awesome” and indeed he is. However, he is more an expert at linear algebra than at Python programming. Dask graphs (just dictionaries) are simple enough that a domain expert was able to look at them say “Yeah, I can do that” and write down the very complex algorithms associated to his domain, leaving the execution of those algorithms up to the dask schedulers.

You can see the source code that generates the above graphs on GitHub.

Randomized Parallel Out-of-Core SVD

A few weeks ago a genomics researcher asked for an approximate/randomized variant to SVD. Mariano had a solution up in a few days.

>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd_compressed(x, k=100, n_power_iter=2)

I’ll omit the full dict for obvious space reasons.

Final Thoughts

Dask graphs let us express parallel algorithms with very little extra complexity. There are no special objects or frameworks to learn, just dictionaries of tuples of functions. This allows domain experts to write sophisticated algorithms without fancy code getting in their way.

Distributed Scheduling

2015-06-23T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We evaluate dask graphs with a variety of schedulers and introduce a new distributed memory scheduler.

Dask.distributed is new and is not battle-tested. Use at your own risk and adjust expectations accordingly.

Most dask users use the dask collections, Array, Bag, and DataFrame. These collections are convenient ways to produce dask graphs. A dask graph is a dictionary of tasks. A task is a tuple with a function and arguments.

The graph comprising a dask collection (like a dask.array) is available through its .dask attribute.

>>> import dask.array as da
>>> x = da.arange(15, chunks=(5,))  # 0..14 in three chunks of size five

>>> x.dask  # dask array holds the graph to create the full array
{('x', 0): (np.arange, 0, 5),
 ('x', 1): (np.arange, 5, 10),
 ('x', 2): (np.arange, 10, 15)}

Further operations on x create more complex graphs

>>> z = (x + 100).sum()
>>> z.dask
{('x', 0): (np.arange, 0, 5),
 ('x', 1): (np.arange, 5, 10),
 ('x', 2): (np.arange, 10, 15),
 ('y', 0): (add, ('x', 0), 100),
 ('y', 1): (add, ('x', 1), 100),
 ('y', 2): (add, ('x', 2), 100),
 ('z', 0): (np.sum, ('y', 0)),
 ('z', 1): (np.sum, ('y', 1)),
 ('z', 2): (np.sum, ('y', 2)),
 ('z',): (sum, [('z', 0), ('z', 1), ('z', 2)])}

Hand-made dask graphs

We can make dask graphs by hand without dask collections. This involves creating a dictionary of tuples of functions.

>>> def add(a, b):
...     return a + b

>>> # x = 1
>>> # y = 2
>>> # z = add(x, y)

>>> dsk = {'x': 1,
...        'y': 2,
...        'z': (add, 'x', 'y')}

We evaluate these graphs with one of the dask schedulers

>>> from dask.threaded import get
>>> get(dsk, 'z')   # Evaluate graph with multiple threads
3

>>> from dask.multiprocessing import get
>>> get(dsk, 'z')   # Evaluate graph with multiple processes
3

We separate the evaluation of the graphs from their construction.

Distributed Scheduling

The separation of graphs from evaluation allows us to create new schedulers. In particular there exists a scheduler that operates on multiple machines in parallel, communicating over ZeroMQ.

This system has a single centralized scheduler, several workers, and potentially several clients.

Clients send graphs to the central scheduler which farms out those tasks to workers and coordinates the execution of the graph. While the scheduler centralizes metadata, the workers themselves handle transfer of intermediate data in a peer-to-peer fashion. Once the graph completes the workers send data to the scheduler which passes it through to the appropriate user/client.

Example

And so now we can execute our dask graphs in parallel across multiple machines.

$ ipython  # On your laptop                 $ ipython  # Remote Process #1:  Scheduler
>>> def add(a, b):                          >>> from dask.distributed import Scheduler
...     return a + b                        >>> s = Scheduler(port_to_workers=4444,
                                            ...               port_to_clients=5555,
>>> dsk = {'x': 1,                          ...               hostname='notebook')
...        'y': 2,
...        'z': (add, 'x', 'y')}            $ ipython  # Remote Process #2:  Worker
                                            >>> from dask.distributed import Worker
>>> from dask.threaded import get           >>> w = Worker('tcp://notebook:4444')
>>> get(dsk, 'z')  # use threads
3                                           $ ipython  # Remote Process #3:  Worker
                                            >>> from dask.distributed import Worker
                                            >>> w = Worker('tcp://notebook:4444')

>>> from dask.distributed import Client
>>> c = Client('tcp://notebook:5555')

>>> c.get(dsk, 'z') # use distributed network
3

Choose Your Scheduler

This graph is small. We didn’t need a distributed network of machines to compute it (a single thread would have been much faster) but this simple example can be easily extended to more important cases, including generic use with the dask collections (Array, Bag, DataFrame). You can control the scheduler with a keyword argument to any compute call.

>>> import dask.array as da
>>> x = da.random.normal(10, 0.1, size=(1000000000,), chunks=(1000000,))

>>> x.mean().compute(get=get)    # use threads
>>> x.mean().compute(get=c.get)  # use distributed network

Alternatively you can set the default scheduler in dask with dask.set_options

>>> import dask
>>> dask.set_options(get=c.get)  # use distributed scheduler by default

Known Limitations

We intentionally made the simplest and dumbest distributed scheduler we could think of. Because dask separates graphs from schedulers we can iterate on this problem many times; building better schedulers after learning what is important. This current scheduler learns from our single-memory system but is the first dask scheduler that has to think about distributed memory. As a result it has the following known limitations:

It does not consider data locality. While linear chains of tasks will execute on the same machine we don’t think much about executing multi-input tasks on nodes where only some of the data is local.
In particular, this scheduler isn’t optimized for data-local file-systems like HDFS. It’s still happy to read data from HDFS, but this results in unnecessary network communication. We’ve found that it’s great when paired with S3.
This scheduler is new and hasn’t yet had its tires kicked. Vocal beta users are most welcome.
We haven’t thought much about deployment. E.g. somehow you need to ssh into a bunch of machines and start up workers, then tear them down when you’re done. Dask.distributed can bootstrap off of an IPython Parallel cluster, and we’ve integrated it into anaconda-cluster but deployment remains a tough problem.

The dask.distributed module is available in the last release but I suggest using the development master branch. There will be another release in early July.

Further Information

Blake Griffith has been playing with dask.distributed and dask.bag together on data from http://githubarchive.org. He plans to write a blogpost to give a better demonstration of the use of dask.distributed on real world problems. Look for that post in the next week or two.

You can read more about the internal design of dask.distributed at the dask docs.

Thanks

Special thanks to Min Regan-Kelley, John Jacobsen, Ben Zaitlen, and Hugo Shi for their advice on building distributed systems.

Also thanks to Blake Griffith for serving as original user/developer and for smoothing over the user experience.

State of Dask

2015-05-19T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We lay out the pieces of Dask, a system for parallel computing

Dask started five months ago as a parallel on-disk array; it has since broadened out. I’ve enjoyed writing about its development tremendously. With the recent 0.5.0 release I decided to take a moment to give an overview of dask’s various pieces, their state, and current development.

Collections, graphs, and schedulers

Dask modules can be separated as follows:

On the left there are collections like arrays, bags, and dataframes. These copy APIs for NumPy, PyToolz, and Pandas respectively and are aimed towards data science users, allowing them to interact with larger datasets. Operations on these dask collections produce task graphs which are recipes to compute the desired result using many smaller computations that each fit in memory. For example if we want to sum a trillion numbers then we might break the numbers into million element chunks, sum those, and then sum the sums. A previously impossible task becomes a million and one easy ones.

On the right there are schedulers. Schedulers execute task graphs in different situations, usually in parallel. Notably there are a few schedulers for a single machine, and a new prototype for a distributed scheduler.

In the center is the directed acyclic graph. This graph serves as glue between collections and schedulers. The dask graph format is simple and doesn’t include any dask classes; it’s just functions, dicts, and tuples and so is easy to build on and low-tech enough to understand immediately. This separation is very useful to dask during development; improvements to one side immediately affect the other and new developers have had surprisingly little trouble. Also developers from a variety of backgrounds have been able to come up to speed in about an hour.

This separation is useful to other projects too. Directed acyclic graphs are popular today in many domains. By exposing dask’s schedulers publicly, other projects can bypass dask collections and go straight for the execution engine.

A flattering quote from a github issue:

dask has been very helpful so far, as it allowed me to skip implementing all of the usual graph operations. Especially doing the asynchronous execution properly would have been a lot of work.

Who uses dask?

Dask developers work closely with a few really amazing users:

Stephan Hoyer at Climate Corp has integrated dask.array into xray a library to manage large volumes of meteorlogical data (and other labeled arrays.)
Scikit image now includes an apply_parallel operation (github PR) that uses dask.array to parallelize image processing routines. (work by Blake Griffith)
Mariano Tepper a postdoc at Duke, uses dask in his research on matrix factorizations. Mariano is also the primary author of the dask.array.linalg module, which includes efficient and stable QR and SVD for tall and skinny matrices. See Mariano’s paper on arXiv.
Finally I personally use dask on daily work related to the XData project. This tends to drive some of the newer features.

A few other groups pop up on github from time to time; I’d love to know more detail about how people use dask.

What works and what doesn’t

Dask is modular. Each of the collections and each of the schedulers are effectively separate projects. These subprojects are at different states of development. Knowing the stability of each subproject can help you to determine how you use and depend on dask.

Dask.array and dask.threaded work well, are stable, and see constant use. They receive relatively minor bug reports which are dealt with swiftly.

Dask.bag and dask.multiprocessing undergo more API churn but are mostly ready for public use with a couple of caveats. Neither dask.dataframe nor

dask.distributed are ready for public use; they undergo significant API churn and have known errors.

Current work

The current state of development as I see it is as follows:

Dask.bag and dask.dataframe are progressing nicely. My personal work depends on these modules, so they see a lot of attention.
- At the moment I focus on grouping and join operations through fast shuffles; I hope to write about this problem soon.
- The Pandas API is large and complex. Reimplementing a subset of it in a blocked way is straightforward but also detailed and time consuming. This would be a great place for community contributions.
Dask.distributed is new. It needs it tires kicked but it’s an exciting development.
- For deployment we’re planning to bootstrap off of IPython parallel which already has decent coverage of many parallel job systems, (see #208 by Blake)
Dask.array development these days focuses on outreach. We’ve found application domains where dask is very useful; we’d like to find more.
The collections (Array, Bag, DataFrame) don’t cover all cases. I would like to start finding uses for the task schedulers in isolation. They serve as a release valve in complex situations.

More information

You can install dask with conda

conda install dask

or with pip

pip install dask
or
pip install dask[array]
or
pip install dask[bag]

You can read more about dask at the docs or github.

Towards Out-of-core DataFrames

2015-03-11T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

This post primarily targets developers. It is on experimental code that is not ready for users.

tl;dr Can we build dask.frame? One approach involves indexes and a lot of shuffling.

Over the last two months we’ve watched the creation of dask, a task scheduling specification, and dask.array a project to implement the out-of-core nd-arrays using blocked algorithms. (blogposts: 1, 2, 3, 4, 5, 6). This worked pretty well. Dask.array is available on the main conda channel and on PyPI and, for the most part, is a pleasant drop-in replacement for a subset of NumPy operations. I’m really happy with it.

conda install dask
or
pip install dask

There is still work to do, in particular I’d like to interact with people who have real-world problems, but for the most part dask.array feels ready.

On to dask frames

Can we do for Pandas what we’ve just done for NumPy?

Question: Can we represent a large DataFrame as a sequence of in-memory DataFrames and perform most Pandas operations using task scheduling?

Answer: I don’t know. Lets try.

Naive Approach

If represent a dask.array as an N-d grid of NumPy ndarrays, then maybe we should represent a dask.frame as a 1-d grid of Pandas DataFrames; they’re kind of like arrays.

	dask.array	Naive dask.frame

This approach supports the following operations:

Elementwise operations df.a + df.b
Row-wise filtering df[df.a > 0]
Reductions df.a.mean()
Some split-apply-combine operations that combine with a standard reduction like df.groupby('a').b.mean(). Essentially anything you can do with df.groupby(...).agg(...)

The reductions and split-apply-combine operations require some cleverness. This is how Blaze works now and how it does the does out-of-core operations in these notebooks: Blaze and CSVs, Blaze and Binary Storage.

However this approach does not support the following operations:

Joins
Split-apply-combine with more complex transform or apply combine steps
Sliding window or resampling operations
Anything involving multiple datasets

Partition on the Index values

Instead of partitioning based on the size of blocks we instead partition on value ranges of the index.

	Partition on block size	Partition on index value

This opens up a few more operations

Joins are possible when both tables share the same index. Because we have information about index values we we know which blocks from one side need to communicate to which blocks from the other.
Split-apply-combine with transform/apply steps are possible when the grouper is the index. In this case we’re guaranteed that each group is in the same block. This opens up general df.gropuby(...).apply(...)
Rolling or resampling operations are easy on the index if we share a small amount of information between blocks as we do in dask.array for ghosting operations.

We note the following theme:

Complex operations are easy if the logic aligns with the index

And so a recipe for many complex operations becomes:

Re-index your data along the proper column
Perform easy computation

Re-indexing out-of-core data

To be explicit imagine we have a large time-series of transactions indexed by time and partitioned by day. The data for every day is in a separate DataFrame.

Block 1
-------
                     credit    name
time
2014-01-01 00:00:00     100     Bob
2014-01-01 01:00:00     200   Edith
2014-01-01 02:00:00    -300   Alice
2014-01-01 03:00:00     400     Bob
2014-01-01 04:00:00    -500  Dennis
...

Block 2
-------
                     credit    name
time
2014-01-02 00:00:00     300    Andy
2014-01-02 01:00:00     200   Edith
...

We want to reindex this data and shuffle all of the entries so that now we partiion on the name of the person. Perhaps all of the A’s are in one block while all of the B’s are in another.

Block 1
-------
                       time  credit
name
Alice   2014-04-30 00:00:00     400
Alice   2014-01-01 00:00:00     100
Andy    2014-11-12 00:00:00    -200
Andy    2014-01-18 00:00:00     400
Andy    2014-02-01 00:00:00    -800
...

Block 2
-------
                       time  credit
name
Bob     2014-02-11 00:00:00     300
Bob     2014-01-05 00:00:00     100
...

Re-indexing and shuffling large data is difficult and expensive. We need to find good values on which to partition our data so that we get regularly sized blocks that fit nicely into memory. We also need to shuffle entries from all of the original blocks to all of the new ones. In principle every old block has something to contribute to every new one.

We can’t just call DataFrame.sort because the entire data might not fit in memory and most of our sorting algorithms assume random access.

We do this in two steps

Find good division values to partition our data. These should partition the data into blocks of roughly equal size.
Shuffle our old blocks into new blocks along the new partitions found in step one.

Find divisions by external sorting

One approach to find new partition values is to pull out the new index from each block, perform an out-of-core sort, and then take regularly spaced values from that array.

Pull out new index column from each block

indexes = [block['new-column-index'] for block in blocks]

Perform out-of-core sort on that column

sorted_index = fancy_out_of_core_sort(indexes)

Take values at regularly spaced intervals, e.g.
```
partition_values = sorted_index[::1000000]
```

We implement this using parallel in-block sorts, followed by a streaming merge process using the heapq module. It works but is slow.

Possible Improvements

This could be accelerated through one of the following options:

A streaming numeric solution that works directly on iterators of NumPy arrays (numtoolz anyone?)
Not sorting at all. We only actually need approximate regularly spaced quantiles. A brief literature search hints that there might be some good solutions.

Shuffle

Now that we know the values on which we want to partition we ask each block to shard itself into appropriate pieces and shove all of those pieces into a spill-to-disk dictionary. Another process then picks up these pieces and calls pd.concat to merge them in to the new blocks.

For the out-of-core dict we’re currently using Chest. Turns out that serializing DataFrames and writing them to disk can be tricky. There are several good methods with about an order of magnitude performance difference between them.

This works but my implementation is slow

Here is an example with snippet of the NYCTaxi data (this is small)

In [1]: import dask.frame as dfr

In [2]: d = dfr.read_csv('/home/mrocklin/data/trip-small.csv', chunksize=10000)

In [3]: d.head(3)   # This is fast
Out[3]:
                          medallion                      hack_license  \
0  89D227B655E5C82AECF13C3F540D4CF4  BA96DE419E711691B9445D6A6307C170
1  0BD7C8F5BA12B88E0B67BED28BEA73D8  9FD8F69F0804BDB5549F40E9DA1BE472
2  0BD7C8F5BA12B88E0B67BED28BEA73D8  9FD8F69F0804BDB5549F40E9DA1BE472

  vendor_id  rate_code store_and_fwd_flag      pickup_datetime  \
0       CMT          1                  N  2013-01-01 15:11:48
1       CMT          1                  N  2013-01-06 00:18:35
2       CMT          1                  N  2013-01-05 18:49:41

      dropoff_datetime  passenger_count  trip_time_in_secs  trip_distance  \
0  2013-01-01 15:18:10                4                382            1.0
1  2013-01-06 00:22:54                1                259            1.5
2  2013-01-05 18:54:23                1                282            1.1

   pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude
0        -73.978165        40.757977         -73.989838         40.751171
1        -74.006683        40.731781         -73.994499         40.750660
2        -74.004707        40.737770         -74.009834         40.726002

In [4]: d2 = d.set_index(d.passenger_count, out_chunksize=10000)   # This takes some time

In [5]: d2.head(3)
Out[5]:
                                        medallion  \
passenger_count
0                3F3AC054811F8B1F095580C50FF16090
1                4C52E48F9E05AA1A8E2F073BB932E9AA
1                FF00E5D4B15B6E896270DDB8E0697BF7

                                     hack_license vendor_id  rate_code  \
passenger_count
0                E00BD74D8ADB81183F9F5295DC619515       VTS          5
1                307D1A2524E526EE08499973A4F832CF       VTS          1
1                0E8CCD187F56B3696422278EBB620EFA       VTS          1

                store_and_fwd_flag      pickup_datetime     dropoff_datetime  \
passenger_count
0                              NaN  2013-01-13 03:25:00  2013-01-13 03:42:00
1                              NaN  2013-01-13 16:12:00  2013-01-13 16:23:00
1                              NaN  2013-01-13 15:05:00  2013-01-13 15:15:00

                 passenger_count  trip_time_in_secs  trip_distance  \
passenger_count
0                              0               1020           5.21
1                              1                660           2.94
1                              1                600           2.18

                 pickup_longitude  pickup_latitude  dropoff_longitude  \
passenger_count
0                      -73.986900        40.743736         -74.029747
1                      -73.976753        40.790123         -73.984802
1                      -73.982719        40.767147         -73.982170

                 dropoff_latitude
passenger_count
0                       40.741348
1                       40.758518
1                       40.746170

In [6]: d2.blockdivs  # our new partition values
Out[6]: (2, 3, 6)

In [7]: d.blockdivs   # our original partition values
Out[7]: (10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000)

Some Problems

First, we have to evaluate the dask as we go. Every set_index operation (and hence many groupbys and joins) forces an evaluation. We can no longer, as in the dask.array case, endlessly compound high-level operations to form more and more complex graphs and then only evaluate at the end. We need to evaluate as we go.
Sorting/shuffling is slow. This is for a few reasons including the serialization of DataFrames and sorting being hard.
How feasible is it to frequently re-index a large amount of data? When do we reach the stage of “just use a database”?
Pandas doesn’t yet release the GIL, so this is all single-core. See post on PyData and the GIL.
My current solution lacks basic functionality. I’ve skipped the easy things to first ensure sure that the hard stuff is doable.

Help

I know less about tables than about arrays. I’m ignorant of the literature and common solutions in this field. If anything here looks suspicious then please speak up. I could really use your help.

Additionally the Pandas API is much more complex than NumPy’s. If any experienced devs out there feel like jumping in and implementing fairly straightforward Pandas features in a blocked way I’d be obliged.

Towards Out-of-core ND-Arrays -- Dask + Toolz = Bag

2015-02-17T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We use dask to build a parallel Python list.

This is the seventh in a sequence of posts constructing an out-of-core nd-array using NumPy, and dask. You can view these posts here:

Today we take a break from ND-Arrays and show how task scheduling can attack other collections like the simple list of Python objects.

Unstructured Data

Often before we have an ndarray or a table/DataFrame we have unstructured data like log files. In this case tuned subsets of the language (e.g. numpy/pandas) aren’t sufficient, we need the full Python language.

My usual approach to the inconveniently large dump of log files is to use Python streaming iterators along with multiprocessing or IPython Parallel on a single large machine. I often write/speak about this workflow when discussing toolz.

This workflow grows complex for most users when you introduce many processes. To resolve this we build our normal tricks into a new dask.Bag collection.

Bag

In the same way that dask.array mimics NumPy operations (e.g. matrix multiply, slicing), dask.bag mimics functional operations like map, filter, reduce found in the standard library as well as many of the streaming functions found in toolz.

Dask array = NumPy + threads
Dask bag = Python/Toolz + processes

Example

Here’s the obligatory wordcount example

>>> from dask.bag import Bag

>>> b = Bag.from_filenames('data/*.txt')

>>> def stem(word):
...     """ Stem word to primitive form """
...     return word.lower().rstrip(",.!:;'-\"").lstrip("'\"")

>>> dict(b.map(str.split).map(concat).map(stem).frequencies())
{...}

We use all of our cores and stream through memory on each core. We use multiprocessing but could get fancier with some work.

Design

As before, a Bag is just a dict holding tasks, along with a little meta data.

>>> d = {('x', 0): (range, 5),
...      ('x', 1): (range, 5),
...      ('x', 2): (range, 5)}

>>> from dask.bag import Bag
>>> b = Bag(d, 'x', npartitions=3)

In this way we break up one collection

[0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

into three independent pieces

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]

When we abstractly operate on the large collection…

>>> b2 = b.map(lambda x: x * 10)

… we generate new tasks to operate on each of the components.

>>> b2.dask
{('x', 0): (range, 5),
 ('x', 1): (range, 5),
 ('x', 2): (range, 5)}
 ('bag-1', 0): (map, lambda x: x * 10, ('x', 0)),
 ('bag-1', 1): (map, lambda x: x * 10, ('x', 1)),
 ('bag-1', 2): (map, lambda x: x * 10, ('x', 2))}

And when we ask for concrete results (the call to list) we spin up a scheduler to execute the resulting dependency graph of tasks.

>>> list(b2)
[0, 10, 20, 30, 40, 0, 10, 20, 30, 40, 0, 10, 20, 30, 40]

More complex operations yield more complex dasks. Beware, dask code is pretty Lispy. Fortunately these dasks are internal; users don’t interact with them.

>>> iseven = lambda x: x % 2 == 0
>>> b3 = b.filter(iseven).count().dask
{'bag-3': (sum, [('bag-2', 1), ('bag-2', 2), ('bag-2', 0)]),
 ('bag-2', 0): (count,
                (filter, iseven, (range, 5))),
 ('bag-2', 1): (count,
                (filter, iseven, (range, 5))),
 ('bag-2', 2): (count,
                (filter, iseven, (range, 5)))}

The current interface for Bag has the following operations:

all             frequencies         min
any             join                product
count           map                 std
filter          map_partitions      sum
fold            max                 topk
foldby          mean                var

Manipulations of bags create task dependency graphs. We eventually execute these graphs in parallel.

Execution

We repurpose the threaded scheduler we used for arrays to support multiprocessing to provide parallelism even on pure Python code. We’re careful to avoid unnecessary data transfer. None of the operations listed above requires significant communication. Notably we don’t have any concept of shuffling or scatter/gather.

We use dill to take care to serialize functions properly and collect/report errors, two issues that plague naive use of multiprocessing in Python.

>>> list(b.map(lambda x: x * 10))  # This works!
[0, 10, 20, 30, 40, 0, 10, 20, 30, 40, 0, 10, 20, 30, 40]

>>> list(b.map(lambda x: x / 0))   # This errs gracefully!
ZeroDivisionError:  Execption in remote Process

integer division or modulo by zero

Traceback:
    ...

These tricks remove need for user expertise.

Productive Sweet Spot

I think that there is a productive sweet spot in the following configuration

Pure Python functions
Streaming/lazy data
Multiprocessing
A single large machine or a few machines in an informal cluster

This setup is common and it’s capable of handling terabyte scale workflows. In my brief experience people rarely take this route. They use single-threaded in-memory Python until it breaks, and then seek out Big Data Infrastructure like Hadoop/Spark at relatively high productivity overhead.

Your workstation can scale bigger than you think.

Example

Here is about a gigabyte of network flow data, recording which computers made connections to which other computers on the UC-Berkeley campus in 1996.

846890339:661920 846890339:755475 846890340:197141 168.237.7.10:1163 83.153.38.208:80 2 8 4294967295 4294967295 846615753 176 2462 39 GET 21068906053917068819..html HTTP/1.0

846890340:989181 846890341:2147 846890341:2268 13.35.251.117:1269 207.83.232.163:80 10 0 842099997 4294967295 4294967295 64 1 38 GET 20271810743860818265..gif HTTP/1.0

846890341:80714 846890341:90331 846890341:90451 13.35.251.117:1270 207.83.232.163:80 10 0 842099995 4294967295 4294967295 64 1 38 GET 38127854093537985420..gif HTTP/1.0

This is actually relatively clean. Many of the fields are space delimited (not all) and I’ve already compiled and run the decade old C-code needed to decompress it from its original format.

Lets use Bag and regular expressions to parse this.

In [1]: from dask.bag import Bag, into

In [2]: b = Bag.from_filenames('UCB-home-IP*.log')

In [3]: import re

In [4]: pattern = """
   ...: (?P<request_time>\d+:\d+)
   ...: (?P<response_start>\d+:\d+)
   ...: (?P<response_end>\d+:\d+)
   ...: (?P<client_ip>\d+\.\d+\.\d+\.\d+):(?P<client_port>\d+)
   ...: (?P<server_ip>\d+\.\d+\.\d+\.\d+):(?P<server_port>\d+)
   ...: (?P<client_headers>\d+)
   ...: (?P<server_headers>\d+)
   ...: (?P<if_modified_since>\d+)
   ...: (?P<response_header_length>\d+)
   ...: (?P<response_data_length>\d+)
   ...: (?P<request_url_length>\d+)
   ...: (?P<expires>\d+)
   ...: (?P<last_modified>\d+)
   ...: (?P<method>\w+)
   ...: (?P<domain>\d+..)\.(?P<extension>\w*)(?P<rest_of_url>\S*)
   ...: (?P<protocol>.*)""".strip().replace('\n', '\s+')

In [5]: prog = re.compile(pattern)

In [6]: records = b.map(prog.match).map(lambda m: m.groupdict())

This returns instantly. We only compute results when necessary. We trigger computation by calling list.

In [7]: list(records.take(1))
Out[7]:
[{'client_headers': '2',
  'client_ip': '168.237.7.10',
  'client_port': '1163',
  'domain': '21068906053917068819.',
  'expires': '2462',
  'extension': 'html',
  'if_modified_since': '4294967295',
  'last_modified': '39',
  'method': 'GET',
  'protocol': 'HTTP/1.0',
  'request_time': '846890339:661920',
  'request_url_length': '176',
  'response_data_length': '846615753',
  'response_end': '846890340:197141',
  'response_header_length': '4294967295',
  'response_start': '846890339:755475',
  'rest_of_url': '',
  'server_headers': '8',
  'server_ip': '83.153.38.208',
  'server_port': '80'}]

Because bag operates lazily this small result also returns immediately.

To demonstrate depth we find the ten client/server pairs with the most connections.

In [8]: counts = records.pluck(['client_ip', 'server_ip']).frequencies()

In [9]: %time list(counts.topk(10, key=lambda x: x[1]))
CPU times: user 11.2 s, sys: 1.15 s, total: 12.3 s
Wall time: 50.4 s
Out[9]:
[(('247.193.34.56', '243.182.247.102'), 35353),
 (('172.219.28.251', '47.61.128.1'), 22333),
 (('240.97.200.0', '108.146.202.184'), 17492),
 (('229.112.177.58', '47.61.128.1'), 12993),
 (('146.214.34.69', '119.153.78.6'), 12554),
 (('17.32.139.174', '179.135.20.36'), 10166),
 (('97.166.76.88', '65.81.49.125'), 8155),
 (('55.156.159.21', '157.229.248.255'), 7533),
 (('55.156.159.21', '124.77.75.86'), 7506),
 (('55.156.159.21', '97.5.181.76'), 7501)]

Comparison with Spark

First, it is silly and unfair to compare with PySpark running locally. PySpark offers much more in a distributed context.

In [1]: import pyspark

In [2]: sc = pyspark.SparkContext('local')

In [3]: from glob import glob
In [4]: filenames = sorted(glob('UCB-home-*.log'))
In [5]: rdd = sc.parallelize(filenames, numSlices=4)

In [6]: import re
In [7]: pattern = ...
In [8]: prog = re.compile(pattern)

In [9]: lines = rdd.flatMap(lambda fn: list(open(fn)))
In [10]: records = lines.map(lambda line: prog.match(line).groupdict())
In [11]: ips = records.map(lambda rec: (rec['client_ip'], rec['server_ip']))

In [12]: from toolz import topk
In [13]: %time dict(topk(10, ips.countByValue().items(), key=1))
CPU times: user 1.32 s, sys: 52.2 ms, total: 1.37 s
Wall time: 1min 21s
Out[13]:
{('146.214.34.69', '119.153.78.6'): 12554,
 ('17.32.139.174', '179.135.20.36'): 10166,
 ('172.219.28.251', '47.61.128.1'): 22333,
 ('229.112.177.58', '47.61.128.1'): 12993,
 ('240.97.200.0', '108.146.202.184'): 17492,
 ('247.193.34.56', '243.182.247.102'): 35353,
 ('55.156.159.21', '124.77.75.86'): 7506,
 ('55.156.159.21', '157.229.248.255'): 7533,
 ('55.156.159.21', '97.5.181.76'): 7501,
 ('97.166.76.88', '65.81.49.125'): 8155}

So, given a compute-bound mostly embarrassingly parallel task (regexes are comparatively expensive) on a single machine they are comparable.

Reasons you would want to use Spark

You want to use many machines and interact with HDFS
Shuffling operations

Reasons you would want to use dask.bag

Trivial installation
No mucking about with JVM heap sizes or config files
Nice error reporting. Spark error reporting includes the typical giant Java Stack trace that comes with any JVM solution.
Easier/simpler for Python programmers to hack on. The implementation is 350 lines including comments.

Again, this is really just a toy experiment to show that the dask model isn’t just about arrays. I absolutely do not want to throw Dask in the ring with Spark.

Conclusion

However I do want to stress the importance of single-machine parallelism. Dask.bag targets this application well and leverages common hardware in a simple way that is both natural and accessible to most Python programmers.

A skilled developer could extend this to work in a distributed memory context. The logic to create the task dependency graphs is separate from the scheduler.

Special thanks to Erik Welch for finely crafting the dask optimization passes that keep the data flowly smoothly.

Towards Out-of-core ND-Arrays -- Slicing and Stacking

2015-02-13T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr Dask.arrays can slice and stack. This is useful for weather data.

This is the sixth in a sequence of posts constructing an out-of-core nd-array using NumPy, and dask. You can view these posts here:

Now we talk about slicing and stacking. We use meteorological data as an example use case.

Slicing

Dask.array now supports most of the NumPy slicing syntax. In particular it supports the following:

Slicing by integers and slices x[0, :5]
Slicing by a list/np.ndarray of integers x[[1, 2, 4]]
Slicing by a list/np.ndarray of booleans x[[False, True, True, False, True]]

It does not currently support the following:

Slicing one dask.array with another x[x > 0]
Slicing with lists in multiple axes x[[1, 2, 3], [3, 2, 1]]

Stack and Concatenate

We often store large arrays on disk in many different files. We want to stack or concatenate these arrays together into one logical array. Dask solves this problem with the stack and concatenate functions, which stitch many arrays together into a single array, either creating a new dimension with stack or along an existing dimension with concatenate.

Stack

We stack many existing dask arrays into a new array, creating a new dimension as we go.

>>> import dask.array as da
>>> arrays = [from_array(np.ones((4, 4)), blockshape=(2, 2))
...            for i in range(3)]  # A small stack of dask arrays

>>> da.stack(arrays, axis=0).shape
(3, 4, 4)

>>> da.stack(arrays, axis=1).shape
(4, 3, 4)

>>> da.stack(arrays, axis=2).shape
(4, 4, 3)

This creates a new dimension with length equal to the number of slices

Concatenate

We concatenate existing arrays into a new array, extending them along an existing dimension

>>> import dask.array as da
>>> arrays = [from_array(np.ones((4, 4)), blockshape=(2, 2))
...            for i in range(3)]  # small stack of dask arrays

>>> da.concatenate(arrays, axis=0).shape
(12, 4)

>>> da.concatenate(arrays, axis=1).shape
(4, 12)

Case Study with Meteorological Data

To test this new functionality we download meteorological data from the European Centre for Medium-Range Weather Forecasts. In particular we have the temperature for the Earth every six hours for all of 2014 with spatial resolution of a quarter degree. We download this data using this script (please don’t hammer their servers unnecessarily) (Thanks due to Stephan Hoyer for pointing me to this dataset).

As a result, I now have a bunch of netCDF files!

$ ls
2014-01-01.nc3  2014-03-18.nc3  2014-06-02.nc3  2014-08-17.nc3  2014-11-01.nc3
2014-01-02.nc3  2014-03-19.nc3  2014-06-03.nc3  2014-08-18.nc3  2014-11-02.nc3
2014-01-03.nc3  2014-03-20.nc3  2014-06-04.nc3  2014-08-19.nc3  2014-11-03.nc3
2014-01-04.nc3  2014-03-21.nc3  2014-06-05.nc3  2014-08-20.nc3  2014-11-04.nc3
...             ...             ...             ...             ...

>>> import netCDF4
>>> t = netCDF4.Dataset('2014-01-01.nc3').variables['t2m']
>>> t.shape
(4, 721, 1440)

The shape corresponds to four measurements per day (24h / 6h), 720 measurements North/South (180 / 0.25) and 1440 measurements East/West (360/0.25). There are 365 files.

Great! We collect these under one logical dask array, concatenating along the time axis.

>>> from glob import glob
>>> filenames = sorted(glob('2014-*.nc3'))
>>> temps = [netCDF4.Dataset(fn).variables['t2m'] for fn in filenames]

>>> import dask.array as da
>>> arrays = [da.from_array(t, blockshape=(4, 200, 200)) for t in temps]
>>> x = da.concatenate(arrays, axis=0)

>>> x.shape
(1464, 721, 1440)

Now we can play with x as though it were a NumPy array.

avg = x.mean(axis=0)
diff = x[1000] - avg

If we want to actually compute these results we have a few options

>>> diff.compute()  # compute result, return as array, float, int, whatever is appropriate
>>> np.array(diff)  # compute result and turn into `np.ndarray`
>>> diff.store(anything_that_supports_setitem)  # For out-of-core storage

Alternatively, because many scientific Python libraries call np.array on inputs, we can just feed our da.Array objects directly in to matplotlib (hooray for the __array__ protocol!):

>>> from matplotlib import imshow
>>> imshow(x.mean(axis=0), cmap='bone')
>>> imshow(x[1000] - x.mean(axis=0), cmap='RdBu_r')

I suspect that the temperature scale is in Kelvin. It looks like the random day is taken during Northern Summer. Another fun one, lets look at the difference between the temperatures at 00:00 and at 12:00

>>> imshow(x[::4].mean(axis=0) - x[2::4].mean(axis=0), cmap='RdBu_r')

Even though this looks and feels like NumPy we’re actually operating off of disk using blocked algorithms. We execute these operations using only a small amount of memory. If these operations were computationally intense (they aren’t) then we also would also benefit from multiple cores.

What just happened

To be totally clear the following steps just occurred:

Open up a bunch of netCDF files and located a temperature variable within each file. This is cheap.
For each of those temperature variables create a da.Array object, specifying how we want to block up that variable. This is also cheap.
Make a new da.Array by concatenating all of our da.Arrays for each day. This, like the other steps, is just book-keeping. We haven’t loaded data or computed anything yet.
Write numpy-style code x[::2].mean(axis=0) - x[2::2].mean(axis=0). This creates yet another da.Array with a more complex task graph. It takes a few hundred milliseconds to create this dictionary.
Callimshow on our da.Array object
imshow calls np.array on its input, this starts the multi-core task scheduler
A flurry of chunks fly out of all the netCDF files. These chunks meet various NumPy functions and create new chunks. Well organized magic occurs and an np.ndarray emerges.
Matplotlib makes a pretty graph

Problems that Popped Up

The threaded scheduler is introducing significant overhead in its planning. For this workflow the single-threaded naive scheduler is actually significantly faster. We’ll have to find better solutions to reduce scheduling overhead.

Conclusion

I hope that this shows off how dask.array can be useful when dealing with collections of on-disk arrays. As always I’m very happy to hear how we can make this project more useful for your work. If you have large n-dimensional datasets I’d love to hear about what you do and how dask.array can help. I can be reached either in the comments below or at blaze-dev@continuum.io.

Acknowledgements

First, other projects can already do this. In particular if this seemed useful for your work then you should probably also know about Biggus, produced by the UK Met office, which has been around for much longer than dask.array and is used in production.

Second, this post shows off work from the following people:

Erik Welch (Continuum) wrote optimization passes to clean up dask graphs before execution.
Wesley Emeneker (Continuum) wrote a good deal of the slicing code
Stephan Hoyer (Climate Corp) talked me through the application and pointed me to the data. If you’d like to see dask integrated with xray then you should definitely bug Stephan :)

Towards Out-of-core ND-Arrays -- Spilling to Disk

2015-01-16T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We implement a dictionary that spills to disk when we run out of memory. We connect this to our scheduler.

This is the fifth in a sequence of posts constructing an out-of-core nd-array using NumPy, Blaze, and dask. You can view these posts here:

We now present chest a dict type that spills to disk when we run out of memory. We show how it prevents large computations from flooding memory.

Intermediate Data

If you read the post on scheduling you may recall our goal to minimize intermediate storage during a multi-worker computation. The image on the right shows a trace of our scheduler as it traverses a task dependency graph. We want to compute the entire graph quickly while keeping only a small amount of data in memory at once.

Sometimes we fail and our scheduler stores many large intermediate results. In these cases we want to spill excess intermediate data to disk rather than flooding local memory.

Chest

Chest is a dict-like object that writes data to disk once it runs out of memory.

>>> from chest import Chest
>>> d = Chest(available_memory=1e9)  # Only use a GigaByte

It satisfies the MutableMapping interface so it looks and feels exactly like a dict. Below we show an example using a chest with only enough data to store one Python integer in memory.

>>> d = Chest(available_memory=24)  # enough space for one Python integer

>>> d['one'] = 1
>>> d['two'] = 2
>>> d['three'] = 3
>>> d['three']
3
>>> d.keys()
['one', 'two', 'three']

We keep some data in memory

>>> d.inmem
{'three': 3}

While the rest lives on disk

>>> d.path
'/tmp/tmpb5ouAb.chest'
>>> os.listdir(d.path)
['one', 'two']

By default we store data with pickle but chest supports any protocol with the dump/load interface (pickle, json, cbor, joblib, ….)

A quick point about pickle. Frequent readers of my blog may know of my sadness at how this library serializes functions and the crippling effect on multiprocessing. That sadness does not extend to normal data. Pickle is fine for data if you use the protocol= keyword to pickle.dump correctly . Pickle isn’t a good cross-language solution, but that doesn’t matter in our application of temporarily storing numpy arrays on disk.

Recent tweaks

In using chest alongside dask with any reasonable success I had to make the following improvements to the original implementation:

A basic LRU mechanism to write only infrequently used data
A policy to avoid writing the same data to disk twice if it hasn’t changed
Thread safety

Now we can execute more dask workflows without risk of flooding memory

A = ...
B = ...
expr = A.T.dot(B) - B.mean(axis=0)

cache = Chest(available_memory=1e9)

into('myfile.hdf5::/result', expr, cache=cache)

Now we incur only moderate slowdown when we schedule poorly and run into large quantities of intermediate data.

Conclusion

Chest is only useful when we fail to schedule well. We can still improve scheduling algorithms to avoid keeping data in memory but it’s nice to have chest as a backup for when these algorithms fail. Resilience is reassuring.

Towards Out-of-core ND-Arrays -- Benchmark MatMul

2015-01-14T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We benchmark dask on an out-of-core dot product. We also compare and motivate the use of an optimized BLAS.

Here are the results

Performance (GFLOPS)	In-Memory	On-Disk
Reference BLAS	6	18
OpenBLAS one thread	11	23
OpenBLAS four threads	22	11

Disclaimer: This post is on experimental buggy code. This is not ready for public use.

This is the fourth in a sequence of posts constructing an out-of-core nd-array using NumPy, Blaze, and dask. You can view these posts here:

We now give performance numbers on out-of-core matrix-matrix multiplication.

Matrix-Matrix Multiplication

Dense matrix-matrix multiplication is compute-bound, not I/O bound. We spend most of our time doing arithmetic and relatively little time shuffling data around. As a result we may be able to read large data from disk without performance loss.

When multiplying two \(n\times n\) matrices we read \(n^2\) bytes but perform \(n^3\) computations. There are \(n\) computations to do per byte so, relatively speaking, I/O is cheap.

We normally measure speed for single CPUs in Giga Floating Point Operations Per Second (GFLOPS). Lets look at how my laptop does on single-threaded in-memory matrix-matrix multiplication using NumPy.

>>> import numpy as np
>>> x = np.ones((1000, 1000), dtype='f8')
>>> %timeit x.dot(x)  # Matrix-matrix multiplication
10 loops, best of 3: 176 ms per loop

>>> 1000 ** 3 / 0.176 / 1e9  # n cubed computations / seconds / billion
>>> 5.681818181818182

OK, so NumPy’s matrix-matrix multiplication clocks in at 6 GFLOPS more or less. The np.dot function ties in to the GEMM operation in the BLAS library on my machine. Currently my numpy just uses reference BLAS. (you can check this with np.show_config().)

Matrix-Matrix Multiply From Disk

For matrices too large to fit in memory we compute the solution one part at a time, loading blocks from disk when necessary. We parallelize this with multiple threads. Our last post demonstrates how NumPy+Blaze+Dask automates this for us.

We perform a simple numerical experiment, using HDF5 as our on-disk store.

We install stuff

conda install -c blaze blaze
pip install git+https://github.com/ContinuumIO/dask

We set up a fake dataset on disk

>>> import h5py
>>> f = h5py.File('myfile.hdf5')
>>> A = f.create_dataset(name='A', shape=(200000, 4000), dtype='f8',
...                                chunks=(250, 250), fillvalue=1.0)
>>> B = f.create_dataset(name='B', shape=(4000, 4000), dtype='f8',
...                                chunks=(250, 250), fillvalue=1.0)

We tell Dask+Blaze how to interact with that dataset

>>> from dask.obj import Array
>>> from blaze import Data, into

>>> a = into(Array, 'myfile.hdf5::/A', blockshape=(1000, 1000))  # dask things
>>> b = into(Array, 'myfile.hdf5::/B', blockshape=(1000, 1000))
>>> A = Data(a)  # Blaze things
>>> B = Data(b)

We compute our desired result, storing back onto disk

>>> %time into('myfile.hdf5::/result', A.dot(B))
2min 49s

>>> 200000 * 4000 * 4000 / 169 / 1e9
18.934911242

18.9 GFLOPS, roughly 3 times faster than the in-memory solution. At first glance this is confusing - shouldn’t we be slower coming from disk? Our speedup is due to our use of four cores in parallel. This is good, we don’t experience much slowdown coming from disk.

It’s as if all of our hard drive just became memory.

OpenBLAS

Reference BLAS is slow; it was written long ago. OpenBLAS is a modern implementation. I installed OpenBLAS with my system installer (apt-get) and then reconfigured and rebuilt numpy. OpenBLAS supports many cores. We’ll show timings with one and with four threads.

export OPENBLAS_NUM_THREADS=1
ipython

>>> import numpy as np
>>> x = np.ones((1000, 1000), dtype='f8')
>>> %timeit x.dot(x)
10 loops, best of 3: 89.8 ms per loop

>>> 1000 ** 3 / 0.0898 / 1e9  # compute GFLOPS
11.135857461024498

export OPENBLAS_NUM_THREADS=4
ipython

>>> %timeit x.dot(x)
10 loops, best of 3: 46.3 ms per loop

>>> 1000 ** 3 / 0.0463 / 1e9  # compute GFLOPS
21.598272138228943

This is about four times faster than reference. If you’re not already parallelizing in some other way (like with dask) then you should use a modern BLAS like OpenBLAS or MKL.

OpenBLAS + dask

Finally we run on-disk our experiment again, now with OpenBLAS. We do this both with OpenBLAS running with one thread and with many threads.

We’ll skip the code (it’s identical to what’s above) and give a comprehensive table of results below.

Sadly the out-of-core solution doesn’t improve much by using OpenBLAS. Acutally when both OpenBLAS and dask try to parallelize we lose performance.

Results

Performance (GFLOPS)	In-Memory	On-Disk
Reference BLAS	6	18
OpenBLAS one thread	11	23
OpenBLAS four threads	22	11

tl:dr When doing compute intensive work, don’t worry about using disk, just don’t use two mechisms of parallelism at the same time.

Main Take-Aways

We don’t lose much by operating from disk in compute-intensive tasks
Actually we can improve performance when an optimized BLAS isn’t avaiable.
Dask doesn’t benefit much from an optimized BLAS. This is sad and surprising. I expected performance to scale with single-core in-memory performance. Perhaps this is indicative of some other limiting factor
One shouldn’t extrapolate too far with these numbers. They’re only relevant for highly compute-bound operations like matrix-matrix multiply

Also, thanks to Wesley Emeneker for finding where we were leaking memory, making results like these possible.

Towards Out-of-core ND-Arrays -- Multi-core Scheduling

2015-01-06T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We show off a multi-threaded shared-memory task scheduler. We share two techniques for space-constrained computing. We end with pretty GIFs.

Disclaimer: This post is on experimental buggy code. This is not ready for public use.

Disclaimer 2: This post is technical and intended for people who care about task scheduling, not for traditional users.

My last two posts (post 1, post 2) construct an ND-Array library out of a simple task scheduler, NumPy, and Blaze.

In this post we discuss a more sophisticated scheduler. In this post we outline a less elegent but more effective scheduler that uses multiple threads and caching to achieve performance on an interesting class of array operations.

We create scheduling policies to minimize the memory footprint of our computation.

Example

First, we establish value by doing a hard thing well. Given two large arrays stored in HDF5:

import h5py
f = h5py.File('myfile.hdf5')
A = f.create_dataset(name='A', shape=(4000, 2000000), dtype='f8',
                     chunks=(250, 250), fillvalue=1.0)
B = f.create_dataset(name='B', shape=(4000, 4000), dtype='f8',
                     chunks=(250, 250), fillvalue=1.0)
f.close()

We do a transpose and dot product.

from blaze import Data, into
from dask.obj import Array

f = h5py.File('myfile.hdf5')

a = into(Array, f['A'], blockshape=(1000, 1000), name='A')
b = into(Array, f['B'], blockshape=(1000, 1000), name='B')

A = Data(a)
B = Data(b)

expr = A.T.dot(B)

result = into('myfile.hdf5::/C', expr)

This uses all of our cores and can be done with only 100MB or so of ram. This is impressive because neither the inputs, outputs, nor any intermediate stage of the computation can fit in memory.

We failed to achieve this exactly (see note at bottom) but still, in theory, we’re great!

Avoid Intermediates

To keep a small memory footprint we avoid holding on to unnecessary intermediate data. The full computation graph of a smaller problem might look like the following:

Boxes represent data, circles represent functions that run on that data , arrows specify which functions produce/consume which data.

The top row of circles represent the actual blocked dot products (note the many data dependence arrows originating from them). The bottom row of circles represents pulling blocks of data from the the A HDF5 dataset to in-memory numpy arrays. The second row transposes the blocks from the first row and adds more blocks from B.

Naively performed, this graph can be very bad; we replicate our data four times here, once in each of the rows. We pull out all of the chunks, transpose each of them, and then finally do a dot product. If we couldn’t fit the original data in memory then there is no way that this will work.

Function Inlining

We resolve this in two ways. First, we don’t cache intermediate values for fast-running functions (like np.transpose). Instead we inline fast functions into tasks involving expensive functions (like np.dot). We may end up running the same quick function twice, but at least we won’t have to store the result. We trade computation for memory.

The result of the graph above with all access and transpose operations inlined looks like the following:

Now our tasks nest (see below). We run all functions within a nested task as a single operation. (non-LISP-ers avert your eyes):

('x_6', 6, 0): (dotmany, [(np.transpose, (ndslice, 'A', (1000, 1000), 0, 6)),
                          (np.transpose, (ndslice, 'A', (1000, 1000), 1, 6))],
                          [(ndslice, 'B', (1000, 1000), 0, 0),
                           (ndslice, 'B', (1000, 1000), 1, 0)]),

This effectively shoves all of the storage responsibility back down to the HDF5 store. We end up pulling out the same blocks multiple times, but repeated disk access is inevitable on large complex problems.

This is automatic. Dask now includes an inline function that does this for you. You just give it a set of “fast” functions to ignore, e.g.

dsk2 = inline(dsk, [np.transpose, ndslice, add, mul, ...])

Scheduler

Now that we have a nice dask to crunch on, we run those tasks with multiple worker threads. This is the job of a scheduler.

We build and document such a scheduler here. It targets a shared-memory single-process multi-threaded environment. It replaces the elegant 20 line reference solution with a large blob of ugly code filled with locks and mutable state. Still, it manages the computation sanely, performs some critical optimizations, and uses all of our hardware cores (Moore’s law is dead! Long live Moore’s law!)

Many NumPy operations release the GIL and so are highly amenable to multi-threading. NumPy programs do not suffer the single-active-core-per-process limitation of most Python code.

Approach

We follow a fairly standard model. We create a ThreadPool with a fixed number of workers. We analyze the dask to determine “ready to run” tasks. We send a task to each of our worker threads. As they finish they update the run-state, marking jobs as finished, inserting their results into a shared cache, and then marking new jobs as ready based on the newly available data. This update process is fully indexed and handled by the worker threads themselves (with appropriate locks) making the overhead negligible and hopefully scalable to complex workloads.

When a newly available worker selects a new ready task they often have several to choose from. We have a choice. The choice we make here is very powerful. The next section talks about our selection policy:

Select tasks that immediately free data resources on completion

Optimizations

Our policy to prefer tasks that free resources lets us run many computations in a very small space. We now show three expressions, their resulting schedules, and an animation showing the scheduler walk through those schedules. These were taken from live runs.

Example: Embarrassingly parallel computation

On the right we show an animated GIF of the progress of the following embarrassingly parallel computation:

expr = (((A + 1) * 2) ** 3)

Circles represent computations, boxes represent data.

Red means actively taking up resources. Red is bad.

Red circles: tasks currently executing in a thread
Red boxes: data currently residing in the cache occupying precious memory

Blue means finished or released. Blue is good.

Blue circles: finished tasks
Blue boxes: data released from memory because we no longer need it for any task

We want to turn all nodes blue while minimizing the number of red boxes we have at any given time.

The policy to execute tasks that free resources causes “vertical” execution when possible. In this example our approach is optimal because the number of red boxes is kept small throughout the computation. We have one only red box for each of our four threads.

Example: More complex computation with Reductions

We show a more complex expression:

expr = (B - B.mean(axis=0)) + (B.T / B.std())

This extends the class of expressions that we’ve seen so far to reductions and reductions along an axis. The per-chunk reductions start at the bottom and depend only on the chunk from which they originated. These per-chunk results are then concatenated together and re-reduced with the large circles (zoom in to see the text concatenate in there.) The next level takes these (small) results and the original data again (note the long edges back down the bottom data resources) which result in per-chunk subtractions and divisions.

From there on out the work is embarrassing, resembling the computation above. In this case we have relatively little parallelism, so the frontier of red boxes covers the entire image; fortunately the dataset is small.

Example: Fail Case

We show a case where our greedy solution fails miserably:

expr = (A.T.dot(B) - B.mean(axis=0))

The two large functions on the second level from the bottom are the computation of the mean. These are cheap and, once completed, allow each of the blocks of the dot product to terminate quickly and release memory.

Tragically these mean computations execute at the last possible moment, keeping all of the dot product result trapped in the cache. We have almost a full row of red boxes at one point in the computation.

In this case our greedy solution was short sighted; a slightly more global solution would quickly select these large circles to run quickly. Perhaps betweenness centrality would resole this problem.

On-disk caching

We’ll never build a good enough scheduler. We need to be able to fall back to on-disk caching. Actually this isn’t terrible. High performance SSDs get close to 1 GB/second throughput and in the complex cases where data-aware scheduling fails we probably compute slower than that anyway.

I have a little disk-backed dictionary project, chest, for this but it’s immature. In general I’d like to see more projects implement the dict interface with interesting policies.

Trouble with binary data stores

I have a confession, the first computation, the very large dot product, sometimes crashes my machine. While then scheduler manages memory well I have a memory leak somewhere. I suspect that I use HDF5 improperly.

I also tried doing this with bcolz. Sadly nd-chunking is not well supported. email thread here and here.

Expression Scope

Blaze currently generates dasks for the following:

Elementwise operations (like +, *, exp, log, …)
Dimension shuffling like np.transpose
Tensor contraction like np.tensordot
Reductions like np.mean(..., axis=...)
All combinations of the above

We also comply with the NumPy API on these operations.. At the time of writing notable missing elements include the following:

Slicing (though this should be easy to add)
Solve, SVD, or any more complex linear algebra. There are good solutions to this implemented in other linear algebra software (Plasma, Flame, Elemental, …) but I’m not planning to go down that path until lots of people start asking for it.
Anything that NumPy can’t do.

I’d love to hear what’s important to the community. Re-implementing all of NumPy is hard, re-implementing a few choice pieces of NumPy is relatively straightforward. Knowing what those few choices pieces are requires community involvement.

Bigger ideas

My experience building dynamic schedulers is limited and my approach is likely suboptimal. It would be great to see other approaches. None of the logic in this post is specific to Blaze or even to NumPy. To build a scheduler you only need to understand the model of a graph of computations with data dependencies.

If we were ambitious we might consider a distributed scheduler to execute these task graphs across many nodes in a distributed-memory situation (like a cluster). This is a hard problem but it would open up a different class of computational solutions. The Blaze code to generate the dasks would not change ; the graphs that we generate are orthogonal to our choice of scheduler.

Help

I could use help with all of this, either as open source work (links below) or for money. Continuum has funding for employees and ample interesting problems.

Links:

Dask Working Notes - Posted in 2015

Dask is one year old

dask.array

dask.bag

dask.dataframe

dask.imperative

General thoughts on collections

Avoid Parallelism

Scheduling

Single Machine

Distributed Cluster

Distributed Prototype

Start up a Cluster

Pool

distributed.dask

dask.imperative

dask.array

Mix and Match

Discussion

Disclaimer

Caching

Caching in other contexts

Limited caching resources

LRU doesn’t fit analytic workloads

Proposed Caching Policy

Cachey

Example

More interesting example

Memoize everything

Cachey and Dask

Future Work

Custom Parallel Workflows

Custom Workloads

Direct Dask Graphs

Introducing dask.do

Example: Nested Cross Validation

Help

Write Complex Parallel Algorithms

Sophisticated Algorithms can be Complex

Example: Parallel, Stable, Out-of-Core SVD

Low Barrier to Entry

Randomized Parallel Out-of-Core SVD

Final Thoughts

Distributed Scheduling

Hand-made dask graphs

Distributed Scheduling

Example

Choose Your Scheduler

Known Limitations

Further Information

Thanks

State of Dask

Collections, graphs, and schedulers

Who uses dask?

What works and what doesn’t

Current work

More information

Towards Out-of-core DataFrames

On to dask frames

Naive Approach

Partition on the Index values

Re-indexing out-of-core data

Find divisions by external sorting

Possible Improvements

Shuffle

This works but my implementation is slow

Some Problems

Help

Towards Out-of-core ND-Arrays -- Dask + Toolz = Bag

Unstructured Data

Bag

Example

Related Work

Design

Execution

Productive Sweet Spot

Example

Comparison with Spark

Conclusion

Towards Out-of-core ND-Arrays -- Slicing and Stacking