---
blogpost: true
date: Feb 20, 2017
title: Dask Development Log
tags: Programming, Python, scipy
---

_This work is supported by [Continuum Analytics](http://continuum.io)
the [XDATA Program](http://www.darpa.mil/program/XDATA)
and the Data Driven Discovery Initiative from the [Moore
Foundation](https://www.moore.org/)_

To increase transparency I'm blogging weekly(ish) about the work done on Dask
and related projects during the previous week. This log covers work done
between 2017-02-01 and 2017-02-20. Nothing here is ready for production. This
blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

1.  Profiling experiments with Dask-GLM
2.  Subsequent graph optimizations, both non-linear fusion and avoiding
    repeatedly creating new graphs
3.  Tensorflow and Keras experiments
4.  XGBoost experiments
5.  Dask tutorial refactor
6.  Google Cloud Storage support
7.  Cleanup of Dask + SKLearn project

### Dask-GLM and iterative algorithms

Dask-GLM is currently just a bunch of solvers like Newton, Gradient Descent,
BFGS, Proximal Gradient Descent, and ADMM. These are useful in solving
problems like logistic regression, but also several others. The mathematical
side of this work is mostly done by [Chris White](https://github.com/moody-marlin/)
and [Hussain Sultan](https://github.com/hussainsultan) at Capital One.

We've been using this project also to see how Dask can scale out machine
learning algorithms. To this end we ran a few benchmarks here:
https://github.com/dask/dask-glm/issues/26 . This just generates and solves
some random problems, but at larger scales.

What we found is that some algorithms, like ADMM perform beautifully, while
for others, like gradient descent, scheduler overhead can become a substantial
bottleneck at scale. This is mostly just because the actual in-memory
NumPy operations are so fast; any sluggishness on Dask's part becomes very
apparent. Here is a profile of gradient descent:

<iframe src="https://cdn.rawgit.com/mrocklin/e7bcb979e147102bf9ac428ed9074000/raw/d38234f3e9072816bea98d032f1e4f9e618242c3/task-stream-glm-gradient-descent.html"
        width="800" height="400"></iframe>

Notice all the white space. This is Dask figuring out what to do during
different iterations. We're now working to bring this down to make all of the
colored parts of this graph squeeze together better. This will result in
general overhead improvements throughout the project.

### Graph Optimizations - Aggressive Fusion

We're approaching this in two ways:

1.  More aggressively fuse tasks together so that there are fewer blocks for
    the scheduler to think about
2.  Avoid repeated work when generating very similar graphs

In the first case, Dask already does standard task fusion. For example, if you
have the following to tasks:

```python
x = f(w)
y = g(x)
z = h(y)
```

Dask (along with every other compiler-like project since the 1980's) already
turns this into the following:

```python
z = h(g(f(w)))
```

What's tricky with a lot of these mathematical or optimization algorithms
though is that they are mostly, but not entirely linear. Consider the
following example:

```python
y = exp(x) - 1/x
```

Visualized as a node-link diagram, this graph looks like a diamond like the following:

```
         o  exp(x) - 1/x
        / \
exp(x) o   o   1/x
        \ /
         o  x
```

Graphs like this generally don't get fused together because we _could_ compute
both `exp(x)` and `1/x` in parallel. However when we're bound by scheduling
overhead and when we have plenty of parallel work to do, we'd prefer to fuse
these into a single task, even though we lose some potential parallelism.
There is a tradeoff here and we'd like to be able to exchange some parallelism
(of which we have a lot) for less overhead.

PR here [dask/dask #1979](https://github.com/dask/dask/pull/1979) by [Erik
Welch](https://github.com/eriknw) (Erik has written and maintained most of
Dask's graph optimizations).

### Graph Optimizations - Structural Sharing

Additionally, we no longer make copies of graphs in dask.array. Every
collection like a dask.array or dask.dataframe holds onto a Python dictionary
holding all of the tasks that are needed to construct that array. When we
perform an operation on a dask.array we get a new dask.array with a new
dictionary pointing to a new graph. The new graph generally has all of the
tasks of the old graph, plus a few more. As a result, we frequently make
copies of the underlying task graph.

```python
y = (x + 1)
assert set(y.dask).issuperset(x.dask)
```

Normally this doesn't matter (copying graphs is usually cheap) but it can
become very expensive for large arrays when you're doing many mathematical
operations.

Now we keep dask graphs in a custom mapping (dict-like object) that shares
subgraphs with other arrays. As a result, we rarely make unnecessary copies
and some algorithms incur far less overhead. Work done in
[dask/dask #1985](https://github.com/dask/dask/pull/1985).

### TensorFlow and Keras experiments

Two weeks ago I gave a talk with [Stan Seibert](https://github.com/seibert)
(Numba developer) on Deep Learning (Stan's bit) and Dask (my bit). As part of
that talk I decided to launch tensorflow from Dask and feed Tensorflow from a
distributed Dask array. See [this
blogpost](/2017/02/11/dask-tensorflow) for
more information.

That experiment was nice in that it showed how easy it is to deploy and
interact with other distributed servies from Dask. However from a deep
learning perspective it was immature. Fortunately, it succeeded in attracting
the attention of other potential developers (the true goal of all blogposts)
and now [Brett Naul](https://github.com/bnaul) is using Dask to manage his GPU
workloads with Keras. Brett [contributed
code](https://github.com/dask/distributed/pull/878) to help Dask move around
Keras models. He seems to particularly value [Dask's ability to manage
resources](http://distributed.readthedocs.io/en/latest/resources.html) to help
him fully saturate the GPUs on his workstation.

### XGBoost experiments

After deploying Tensorflow we asked what would it take to do the same for
XGBoost, another very popular (though very different) machine learning library.
The conversation for that is here: [dmlc/xgboost
#2032](https://github.com/dmlc/xgboost/issues/2032) with prototype code here
[mrocklin/dask-xgboost](https://github.com/mrocklin/dask-xgboost). As with
TensorFlow, the integration is relatively straightforward (if perhaps a bit
simpler in this case). The challenge for me is that I have little concrete
experience with the applications that these libraries were designed to solve.
Feedback and collaboration from open source developers who use these libraries
in production is welcome.

### Dask tutorial refactor

The [dask/dask-tutorial](https://github.com/dask/dask-tutorial) project on
Github was originally written or PyData Seattle in July 2015 (roughly 19 months
ago). Dask has evolved substantially since then but this is still our only
educational material. Fortunately [Martin
Durant](http://martindurant.github.io/) is doing a [pretty
serious rewrite](https://github.com/dask/dask-tutorial/pull/29), both correcting parts that are no longer modern API, and also
adding in new material around distributed computing and debugging.

### Google Cloud Storage

Dask developers (mostly Martin) maintain libraries to help Python users connect
to distributed file systems like HDFS (with
[hdfs3](http://hdfs3.readthedocs.io/en/latest/), S3 (with
[s3fs](http://s3fs.readthedocs.io/en/latest/), and Azure Data Lake (with
[adlfs](https://github.com/Azure/azure-data-lake-store-python)), which
subsequently become usable from Dask. Martin has been working on support for
Google Cloud Storage (with [gcsfs](https://github.com/martindurant/gcsfs)) with
another small project that uses the same API.

### Cleanup of Dask+SKLearn project

Last year Jim Crist published
[three](http://jcrist.github.io/dask-sklearn-part-1.html)
[great](http://jcrist.github.io/dask-sklearn-part-2.html)
[blogposts](http://jcrist.github.io/dask-sklearn-part-3.html) about using Dask
with SKLearn. The result was a small library
[dask-learn](https://github.com/dask/dask-learn) that had a variety of
features, some incredibly useful, like a cluster-ready Pipeline and
GridSearchCV, other less so. Because of the experimental nature of this work
we had labeled the library "not ready for use", which drew some curious
responses from potential users.

Jim is now busy dusting off the project, removing less-useful parts and
generally reducing scope to strictly model-parallel algorithms.
