---
blogpost: true
date: Jan 18, 2017
title: Dask Development Log
tags: Programming, Python, scipy
---

_This work is supported by [Continuum Analytics](http://continuum.io)
the [XDATA Program](http://www.darpa.mil/program/XDATA)
and the Data Driven Discovery Initiative from the [Moore
Foundation](https://www.moore.org/)_

To increase transparency I'm blogging weekly about the work done on Dask and
related projects during the previous week. This log covers work done between
2017-01-01 and 2016-01-17. Nothing here is ready for production. This
blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

1.  Stability enhancements for the distributed scheduler and micro-release
2.  NASA Grant writing
3.  Dask-EC2 script
4.  Dataframe categorical flexibility (work in progress)
5.  Communication refactor (work in progress)

### Stability enhancements and micro-release

We've released dask.distributed version 1.15.1, which includes important
bugfixes after the recent 1.15.0 release. There were a number of small issues
that coordinated to remove tasks erroneously. This was generally OK
because the Dask scheduler was able to heal the missing pieces (using the
same machinery that makes Dask resilience) and so we didn't notice the flaw
until the system was deployed in some of the more serious Dask deployments in
the wild.
PR [dask/distributed #804](https://github.com/dask/distributed/pull/804)
contains a full writeup in case anyone is interested. The writeup ends with
the following line:

_This was a nice exercise in how coupling mostly-working components can easily
yield a faulty system._

This also adds other fixes, like a compatibility issue with the new Bokeh
0.12.4 release and others.

### NASA Grant Writing

I've been writing a proposal to NASA to help fund distributed Dask+XArray work
for atmospheric and oceanographic science at the 100TB scale. Many thanks to
our scientific collaborators who are offering support here.

### Dask-EC2 startup

The [Dask-EC2 project](https://github.com/dask/dask-ec2) deploys Anaconda, a
Dask cluster, and Jupyter notebooks on Amazon's Elastic Compute Cloud (EC2)
with a small command line interface:

```
pip install dask-ec2 --upgrade
dask-ec2 up --keyname KEYNAME \
            --keypair /path/to/ssh-key \
            --type m4.2xlarge
            --count 8
```

This project can be either _very useful_ for people just getting started
and for Dask developers when we run benchmarks, or it can be _horribly broken_
if AWS or Dask interfaces change and we don't keep this project maintained.
Thanks to a great effort from [Ben Zaitlen](http://github.com/quasiben/)
`dask-ec2 is again in the _very useful_ state, where I'm hoping it will stay
for some time.

If you've always wanted to try Dask on a real cluster and if you already have
AWS credentials then this is probably the easiest way.

This already seems to be paying dividends. There have been a few unrelated
pull requests from new developers this week.

### Dataframe Categorical Flexibility

Categoricals can [significantly improve
performance](/2015/06/18/Categoricals) on
text-based data. Currently Dask's dataframes support categoricals, but they
expect to know all of the categories up-front. This is easy if this set is
small, like the `["Healthy", "Sick"]` categories that might arise in medical
research, but requires a full dataset read if the categories are not known
ahead of time, like the names of all of the patients.

[Jim Crist](http://jcrist.github.io/) is changing this so that Dask can
operates on categorical columns with unknown categories at [dask/dask
#1877](https://github.com/dask/dask/pull/1877). The constituent pandas
dataframes all have possibly different categories that are merged as necessary.
This distinction may seem small, but it limits performance in a surprising
number of real-world use cases.

### Communication Refactor

Since the recent worker refactor and optimizations it has become clear that
inter-worker communication has become a dominant bottleneck in some intensive
applications. [Antoine Pitrou](http://github.com/pitrou) is currently
[refactoring Dask's network communication layer](https://github.com/dask/distributed/pull/810),
making room for more communication options in the future. This is an ambitious
project. I for one am very happy to have someone like Antoine looking into
this.
