---
blogpost: true
date: Dec 21, 2015
title: Dask is one year old
tags: Programming, dask
---

_This work is supported by [Continuum Analytics](http://continuum.io)
and the [XDATA Program](http://www.darpa.mil/program/XDATA)
as part of the [Blaze Project](http://blaze.pydata.org)_

**tl;dr: Dask turned one yesterday. We discuss success and failures.**

Dask began one year ago yesterday with the [following
commit](https://github.com/blaze/dask/commit/05488db498c1561d266c7b676b8a89021c03a9e7)
(with slight edits here for clarity's sake).

```python
def istask(x):
    return isinstance(x, tuple) and x and callable(x[0])


def get(d, key):
    v = d[key]
    if istask(v):
        func, args = v[0], v[1:]
        return func(*[get(d, arg) for arg in args])
    else:
        return v

 ... (and around 50 lines of tests)
```

_this is a very inefficient scheduler_

Since then dask has matured, expanded to new domains, gathered [excellent
developers](https://raw.githubusercontent.com/blaze/dask/master/AUTHORS.md),
and spawned other open source projects. I thought it'd be a good time to look
back on what worked, what didn't, and what we should work on in the future.

## Collections

Most users experience dask through the high-level collections of
`dask.array/bag/dataframe/imperative`. Each of these evolve as projects of
their own with different user groups and different levels of maturity.

### dask.array

The parallel larger-than-memory array module dask.array has seen the most
success of the dask components. It is the oldest, most mature, and most
sophisticated subproject. Much of dask.array's use comes from downstream
projects, notably [xray](http://xray.readthedocs.org/en/stable/) which seems to
have taken off in climate science. Dask.array also sees a fair amount of use
in imaging, genomics, and numerical algorithms research.

People that I don't know now use dask.array to do scientific research. From my
perspective that's mission accomplished.

There are still tweaks to make to algorithms, particularly as we scale out to
distributed systems (see far below).

### dask.bag

Dask.bag started out as a weekend project and didn't evolve much beyond that.
Fortunately there wasn't much to do and this submodule probably has the highest
value/effort ratio .

Bag doesn't get as much attention as its older sibling array though. It's
handy but not as well used and so not as robust.

### dask.dataframe

Dataframe is an interesting case, it's both pretty sophisticated, pretty
mature, and yet also probably generates the most user frustration.

Dask.dataframe gains a lot of value by leveraging Pandas both under the hood
(one dask DataFrame is many pandas DataFrames) and by copying its API (Pandas
users can use dask.dataframe without learning a new API.) However, because
dask.dataframe only implements a core subset of Pandas, users end up tripping
up on the missing functionality.

This can be decomposed into to issues:

1. It's not clear that there exists a core subset of Pandas that would handle most
   use cases. Users touch many diffuse parts of Pandas in a single workflow.
   What one user considers core another user considers fringe. It's not clear
   how to agree on a sufficient subset to implement.
2. Once you implement this subset (and we've done our best) it's hard to
   convey expectations to the user about what is and is not available.

That being said, dask.dataframe is pretty solid. It's very fast, expressive,
and handles common use cases well. It probably generates the most
StackOverflow questions. This signals both confusion and active use.

Special thanks here go out to Jeff Reback, for making Pandas release the GIL
and to Masaaki Horikoshi (@sinhrks) for greatly improving the maturity of
dask.dataframe.

### dask.imperative

Also known as `dask.do` this little backend remains one of the most powerful
and one of the least used (outside of myself.) We should rethink the API here
and improve learning materials.

### General thoughts on collections

_Warning: this section is pretty subjective_

Big data collections are cool but perhaps less useful than people expect.
Parallel applications are often more complex than can be easily described by a
big array or a big dataframe. Many real-world parallel computations end up
being more particular in their parallelism needs. That's not to say that the
array and dataframe abstractions aren't central to parallel computing, it's
just that we should not restrict ourselves to them. The world is more complex.

However, it's reasonable to break this "world is complex" rule within
particular domains. NDArrays seem to work well in climate science.
Specialized large dataframes like Dato's SFrame seem to be effective for a
particular class of machine learning algorithms. The SQL table is inarguably
an effective abstraction in business intelligence. Large collections are
useful in specific contexts, but they are perhaps the focus of too much
attention. The big dataframe in particular is over-hyped.

Most of the really novel and impressive work I've seen with dask has been done
either with custom graphs or with the dask.imperative API. I think we should
consider APIs that enable users to more easily express custom algorithms.

## Avoid Parallelism

When giving talks on parallelism I've started to give a brief "avoid
parallelism" section. From the problems I see on stack overflow and from
general interactions when people run into performance challenges their first
solution seems to be to parallelize. This is sub-optimal. It's often far
cheaper to improve storage formats, use better algorithms, or use C/Numba
accelerated code than it is to parallelize. Unfortunately storage formats
and C aren't as sexy as big data parallelism, so they're not in the forefront
of people's minds. We should change this.

I'll proudly buy a beer for anyone that helps to make storage formats a sexier
topic.

## Scheduling

### Single Machine

The single machine dynamic task scheduler is very very solid. It has roughly
two objectives:

1. Use all the cores of a machine
2. Choose tasks that allow the release of intermediate results

This is what allows us to quickly execute complex workflows in small space.
This scheduler underlies all execution within dask. I'm very happy with it. I
would like to find ways to expose it more broadly to other libraries.
Suggestions are very welcome here.

We still run into cases where it doesn't perform optimally
(see [issue 874](https://github.com/blaze/dask/issues/874)),
but so far we've always been able to enhance the scheduler whenever these cases
arise.

### Distributed Cluster

Over the last few months we've been working on another scheduler for
distributed memory computation. It should be a nice extension to the existing
dask collections out to "big data" systems. It's experimental but usable now
with documentation at the follow links:

- <http://distributed.readthedocs.org/en/latest/>
- <https://github.com/blaze/distributed>
- <http://matthewrocklin.com/distributed-by-example/>

Feedback is welcome. I recommend waiting for a month or two if you prefer
clean and reliable software. It will undergo a name-change to something less
generic.
