---
blogpost: true
date: Dec 18, 2016
title: Dask Development Log
tags: Programming, Python, scipy
---

_This work is supported by [Continuum Analytics](http://continuum.io)
the [XDATA Program](http://www.darpa.mil/program/XDATA)
and the Data Driven Discovery Initiative from the [Moore
Foundation](https://www.moore.org/)_

To increase transparency I'm blogging weekly about the work done on Dask and
related projects during the previous week. This log covers work done between
2016-12-11 and 2016-12-18. Nothing here is ready for production. This
blogpost is written in haste, so refined polish should not be expected.

Themes of last week:

1.  Benchmarking new scheduler and worker on larger systems
2.  Kubernetes and Google Container Engine
3.  Fastparquet on S3

### Rewriting Load Balancing

In the last two weeks we rewrote a significant fraction of the worker and
scheduler. This enables future growth, but also resulted in a loss of our load
balancing and work stealing algorithms (the old one no longer made sense in the
context of the new system.) Careful dynamic load balancing is essential to
running atypical workloads (which are surprisingly typical among Dask users) so
rebuilding this has been all-consuming this week for me personally.

Briefly, Dask initially assigns tasks to workers taking into account the
expected runtime of the task, the size and location of the data that the task
needs, the duration of other tasks on every worker, and where each piece of data
sits on all of the workers. Because the number of tasks can grow into the
millions and the number of workers can grow into the thousands, Dask needs to
figure out a near-optimal placement in near-constant time, which is hard.
Furthermore, after the system runs for a while, uncertainties in our estimates
build, and we need to rebalance work from saturated workers to idle workers
relatively frequently. Load balancing intelligently and responsively is
essential to a satisfying user experience.

We have a decently strong test suite around these behaviors, but it's hard to
be comprehensive on performance-based metrics like this, so there has also been
a lot of benchmarking against real systems to identify new failure modes.
We're doing what we can to create isolated tests for every failure mode that we
find to make future rewrites retain good behavior.

Generally working on the Dask distributed scheduler has taught me the
brittleness of unit tests. As we have repeatedly rewritten internals while
maintaining the same external API our testing strategy has evolved considerably
away from fine-grained unit tests to a mixture of behavioral integration tests
and a very strict runtime validation system.

Rebuilding the load balancing algorithms has been high priority for me
personally because these performance issues inhibit current power-users from
using the development version on their problems as effectively as with the
latest release. I'm looking forward to seeing load-balancing humming nicely
again so that users can return to git-master and so that I can return to
handling a broader base of issues. (Sorry to everyone I've been ignoring the
last couple of weeks).

### Test deployments on Google Container Engine

I've personally started switching over my development cluster from Amazon's EC2
to Google's Container Engine. Here are some pro's and con's from my particular
perspective. Many of these probably have more to do with how I use each
particular tool rather than intrinsic limitations of the service itself.

In Google's Favor

1.  Native and immediate support for Kubernetes and Docker, the combination of
    which allows me to more quickly and dynamically create and scale clusters
    for different experiments.
2.  Dynamic scaling from a single node to a hundred nodes and back ten minutes
    later allows me to more easily run a much larger range of scales.
3.  I like being charged by the minute rather than by the hour, especially
    given the ability to dynamically scale up
4.  Authentication and billing feel simpler

In Amazon's Favor

1.  I already have tools to launch Dask on EC2
2.  All of my data is on Amazon's S3
3.  I have nice data acquisition tools,
    [s3fs](http://s3fs.readthedocs.io/en/latest/), for S3 based on boto3.
    Google doesn't seem to have a nice Python 3 library for accessing Google
    Cloud Storage :(

I'm working from Olivier Grisel's repository
[docker-distributed](https://github.com/ogrisel/docker-distributed) although
updating to newer versions and trying to use as few modifications from naive
deployment as possible. My current branch is
[here](https://github.com/mrocklin/docker-distributed/tree/update). I hope to
have something more stable for next week.

### Fastparquet on S3

We gave fastparquet and Dask.dataframe a spin on some distributed S3 data on
Friday. I was surprised that everything seemed to work out of the box. Martin
Durant, who built both fastparquet and s3fs has done some nice work to make
sure that all of the pieces play nicely together. We ran into some performance
issues pulling bytes from S3 itself. I expect that there will be some tweaking
over the next few weeks.
