---
blogpost: true
date: Dec 6, 2017
title: Dask Development Log
tags: Programming, Python, scipy, dask
---

_This work is supported by [Anaconda Inc](http://anaconda.com) and the
Data Driven Discovery Initiative from the [Moore
Foundation](https://www.moore.org/)_

To increase transparency I'm trying to blog more often about the current work
going on around Dask and related projects. Nothing here is ready for
production. This blogpost is written in haste, so refined polish should not be
expected.

Current development in Dask and Dask-related projects includes the following
efforts:

1.  A possible change to our community communication model
2.  A rewrite of the distributed scheduler
3.  Kubernetes and Helm Charts for Dask
4.  Adaptive deployment fixes
5.  Continued support for NumPy and Pandas' growth
6.  Spectral Clustering in Dask-ML

### Community Communication

Dask community communication generally happens in Github issues for bug and
feature tracking, the Stack Overflow #dask tag for user questions, and an
infrequently used Gitter chat.

Separately, Dask developers who work for Anaconda Inc (there are about five of
us part-time) use an internal company chat and a closed weekly video meeting.
We're now trying to migrate away from closed systems when possible.

Details about future directions are in [dask/dask
#2945](https://github.com/dask/dask/issues/2945). Thoughts and comments on
that issue would be welcome.

### Scheduler Rewrite

When you start building clusters with 1000 workers the distributed scheduler
can become a bottleneck on some workloads. After working with PyPy and Cython
development teams we've decided to rewrite parts of the scheduler to make it
more amenable to acceleration by those technologies. Note that no actual
acceleration has occurred yet, just a refactor of internal state.

Previously the distributed scheduler was focused around a large set of Python
dictionaries, sets, and lists that indexed into each other heavily. This was
done both for low-tech code technology reasons and for performance reasons
(Python core data structures are fast). However, compiler technologies like
PyPy and Cython can optimize Python object access down to C speeds, so we're
experimenting with switching away from Python data structures to Python objects
to see how much this is able to help.

This change will be invisible operationally (the full test suite remains
virtually unchanged), but will be a significant change to the scheduler's
internal state. We're keeping around a compatibility layer, but people who
were building their own diagnostics around the internal state should check out
with the new changes.

Ongoing work by Antoine Pitrou in [dask/distributed #1594](https://github.com/dask/distributed/pull/1594)

### Kubernetes and Helm Charts for Dask

In service of the [Pangeo](https://pangeo-data.github.io) project to enable
scalable data analysis of atmospheric and oceanographic data we've been
improving the tooling around launching Dask on Cloud infrastructure,
particularly leveraging Kubernetes.

To that end we're making some flexible Docker containers and Helm Charts for
Dask, and hope to combine them with JupyterHub in the coming weeks.

Work done by myself in the following repositories. Feedback would be very
welcome. I am learning on the job with Helm here.

- [https://github.com/dask/dask-docker](https://github.com/dask/helm-chart)
- [https://github.com/dask/helm-chart](https://github.com/dask/helm-chart)

If you use Helm on Kubernetes then you might want to try the following:

```
helm repo add dask https://dask.github.io/helm-chart
helm update
helm install dask/dask
```

This installs a full Dask cluster and a Jupyter server. The Docker containers
contain entry points that allow their environments to be updated with custom
packages easily.

This work extends prior work on the previous package,
[dask-kubernetes](https://github.com/dask/dask-kubernetes), but is slightly
more modular for use alongside other systems.

### Adaptive deployment fixes

Adaptive deployments, where a cluster manager scales a Dask cluster up or down
based on current workloads recently got a makeover, including a number of bug
fixes around odd or infrequent behavior.

Work done by [Russ Bubley](https://github.com/rbubley) here:

- [dask/distributed #1607](https://github.com/dask/distributed/pull/1607)
- [dask/distributed #1608](https://github.com/dask/distributed/pull/1608)
- [dask/distributed #1609](https://github.com/dask/distributed/pull/1609)

### Keeping up with NumPy and Pandas

NumPy 1.14 is due to release soon. Dask.array had to update how it handled
structured dtypes in [dask/dask #2694](https://github.com/dask/dask/pull/2964)
(Work by Tom Augspurger).

Dask.dataframe is gaining the ability to merge/join simultaneously on columns
and indices, following a similar feature released in Pandas 0.22. Work done by
Jon Mease in [dask/dask #2960](https://github.com/dask/dask/pull/2960)

### Spectral Clustering in Dask-ML

Dask-ML recently added an approximate and scalable Spectral Clustering
algorithm in [dask/dask-ml #91](https://github.com/dask/dask-ml/pull/91)
([gallery example](http://dask-ml.readthedocs.io/en/latest/auto_examples/plot_spectral_clustering.html)).
