---
blogpost: true
date: Feb 12, 2018
title: Dask Release 0.17.0
tags: Programming, Python, scipy, dask
---

_This work is supported by [Anaconda Inc.](http://anaconda.com)
and the Data Driven Discovery Initiative from the [Moore
Foundation](https://www.moore.org/)._

I'm pleased to announce the release of Dask version 0.17.0. This a significant
major release with new features, breaking changes, and stability improvements.
This blogpost outlines notable changes since the 0.16.0 release on November
21st.

You can conda install Dask:

    conda install dask -c conda-forge

or pip install from PyPI:

    pip install dask[complete] --upgrade

Full changelogs are available here:

- [dask/dask](https://github.com/dask/dask/blob/master/docs/source/changelog.rst)
- [dask/distributed](https://github.com/dask/distributed/blob/master/docs/source/changelog.rst)

Some notable changes follow.

### Deprecations

- Removed `dask.dataframe.rolling_*` methods, which were previously deprecated both in dask.dataframe and in pandas. These are replaced with the `rolling.*` namespace
- We've generally stopped maintenance of the `dask-ec2` project to launch dask clusters on Amazon's EC2 using Salt. We generally recommend kubernetes instead both for Amazon's EC2, and for Google and Azure as well

  [dask.pydata.org/en/latest/setup/kubernetes.html](http://dask.pydata.org/en/latest/setup/kubernetes.html)

- Internal state of the distributed scheduler has changed significantly. This may affect advanced users who were inspecting this state for debugging or diagnostics.

### Task Ordering

As Dask encounters more complex problems from more domains
we continually run into problems where its current heuristics do not perform optimally.
This release includes a rewrite of our static task prioritization heuristics.
This will improve Dask's ability to traverse complex computations
in a way that keeps memory use low.

To aid debugging we also integrated these heuristics into the GraphViz-style plots
that come from the visualize method.

```python
x = da.random.random(...)
...
x.visualize(color='order', cmap='RdBu')
```

<a href="https://user-images.githubusercontent.com/306380/35012109-86df75fa-fad6-11e7-9fa8-a43a697a4a17.png">
  <img src="https://user-images.githubusercontent.com/306380/35012109-86df75fa-fad6-11e7-9fa8-a43a697a4a17.png"
     width="80%"
     align="center"></a>

- [dask/dask #3066](https://github.com/dask/dask/pull/3066)
- [dask/dask #3057](https://github.com/dask/dask/pull/3057)

### Nested Joblib

Dask supports parallelizing Scikit-Learn
by extending Scikit-Learn's underlying library for parallelism,
[Joblib](http://tomaugspurger.github.io/distributed-joblib.html).
This allows Dask to distribute _some_ SKLearn algorithms across a cluster
just by wrapping them with a context manager.

This relationship has been strengthened,
and particular attention has been focused
when nesting one parallel computation within another,
such as occurs when you train a parallel estimator, like `RandomForest`,
within another parallel computation, like `GridSearchCV`.
Previously this would result in spawning too many threads/processes
and generally oversubscribing hardware.

Due to recent combined development within both Joblib and Dask,
these sorts of situations can now be resolved efficiently by handing them off to Dask,
providing speedups even in single-machine cases:

```python
from sklearn.externals import joblib
import distributed.joblib  # register the dask joblib backend

from dask.distributed import Client
client = Client()

est = ParallelEstimator()
gs = GridSearchCV(est)

with joblib.parallel_backend('dask'):
    gs.fit()
```

See Tom Augspurger's recent post with more details about this work:

- [http://tomaugspurger.github.io/distributed-joblib.html](http://tomaugspurger.github.io/distributed-joblib.html)
- [joblib/joblib #595](https://github.com/joblib/joblib/pull/595)
- [dask/distributed #1705](https://github.com/dask/distributed/pull/1705)
- [joblib/joblib #613](https://github.com/joblib/joblib/pull/613)

Thanks to [Tom Augspurger](https://github.com/TomAugspurger),
[Jim Crist](https://github.com/jcrist), and
[Olivier Grisel](https://github.com/ogrisel) who did most of this work.

### Scheduler Internal Refactor

The distributed scheduler has been significantly refactored to change it from a forest of dictionaries:

```python
priority = {'a': 1, 'b': 2, 'c': 3}
dependencies = {'a': {'b'}, 'b': {'c'}, 'c': []}
nbytes = {'a': 1000, 'b': 1000, 'c': 28}
```

To a bunch of objects:

```python
tasks = {'a': Task('a', priority=1, nbytes=1000, dependencies=...),
         'b': Task('b': priority=2, nbytes=1000, dependencies=...),
         'c': Task('c': priority=3, nbytes=28, dependencies=[])}
```

(there is _much_ more state than what is listed above,
but hopefully the examples above are clear.)

There were a few motivations for this:

1.  We wanted to try out Cython and PyPy, for which objects like this might be more effective than dictionaries.
2.  We believe that this is probably a bit easier for developers new to the schedulers to understand. The proliferation of state dictionaries was not highly discoverable.

Goal one ended up not working out.
We have not yet been able to make the scheduler significantly faster under Cython or PyPy with this new layout. There is even a slight memory increase with these changes.
However we have been happy with the results in code readability, and we hope that others find this useful as well.

Thanks to [Antoine Pitrou](https://github.com/pitrou),
who did most of the work here.

### User Priorities

You can now submit tasks with different priorities.

```python
x = client.submit(f, 1, priority=10)   # Higher priority preferred
y = client.submit(f, 1, priority=-10)  # Lower priority happens later
```

To be clear, Dask has always had priorities, they just weren't easily user-settable.
Higher priorities are given precedence. The default priority for all tasks is zero.
You can also submit priorities for collections (like arrays and dataframes)

```python
df = df.persist(priority=5)  # give this computation higher priority.
```

- [dask/distributed #1651](https://github.com/dask/distributed/pull/1651)

## Related projects

Several related projects are also undergoing releases:

- Tornado is updating to version 5.0 (there is a beta out now).
  This is a major change that will put Tornado on the Asyncio event loop in Python 3.
  It also includes many performance enhancements for high-bandwidth networks.
- Bokeh 0.12.14 was just released.

  Note that you will need to update Dask to work with this version of Bokeh

- [Daskernetes](http://daskernetes.readthedocs.io/en/latest/), a new project for launching Dask on Kubernetes clusters

## Acknowledgements

The following people contributed to the dask/dask repository since the 0.16.0
release on November 14th:

- Albert DeFusco
- Apostolos Vlachopoulos
- castalheiro
- James Bourbeau
- Jon Mease
- Ian Hopkinson
- Jakub Nowacki
- Jim Crist
- John A Kirkham
- Joseph Lin
- Keisuke Fujii
- Martijn Arts
- Martin Durant
- Matthew Rocklin
- Markus Gonser
- Nir
- Rich Signell
- Roman Yurchak
- S. Andrew Sheppard
- sephib
- Stephan Hoyer
- Tom Augspurger
- Uwe L. Korn
- Wei Ji
- Xander Johnson

The following people contributed to the dask/distributed repository since the
1.20.0 release on November 14th:

- Alexander Ford
- Antoine Pitrou
- Brett Naul
- Brian Broll
- Bruce Merry
- Cornelius Riemenschneider
- Daniel Li
- Jim Crist
- Kelvin Yang
- Matthew Rocklin
- Min RK
- rqx
- Russ Bubley
- Scott Sievert
- Tom Augspurger
- Xander Johnson
