---
blogpost: true
date: Sep 17, 2018
title: Dask Development Log
tags: Programming, Python, scipy, dask
---

_This work is supported by [Anaconda Inc](http://anaconda.com)_

To increase transparency I'm trying to blog more often about the current work
going on around Dask and related projects. Nothing here is ready for
production. This blogpost is written in haste, so refined polish should not be
expected.

Since the last update in the [0.19.0 release blogpost](/2018/09/05/dask-0.19.0) two weeks ago we've seen activity in the following areas:

1.  Update Dask examples to use JupyterLab on Binder
2.  Render Dask examples into static HTML pages for easier viewing
3.  Consolidate and unify disparate documentation
4.  Retire the [hdfs3 library](https://hdfs3.readthedocs.io/en/latest/) in favor of the solution in [Apache Arrow](https://arrow.apache.org/docs/python/filesystems.html).
5.  Continue work on hyper-parameter selection for incrementally trained models
6.  Publish two small bugfix releases
7.  Blogpost from the Pangeo community about combining Binder with Dask
8.  Skein/Yarn Update

## 1: Update Dask Examples to use JupyterLab extension

The new [dask-labextension](https://github.com/dask/dask-labextension) embeds
Dask's dashboard plots into a JupyterLab session so that you can get easy
access to information about your computations from Jupyter directly. This was
released a few weeks ago as part of the previous release post.

However since then we've hooked this up to our live examples system that lets
users try out Dask on a small cloud instance using
[mybinder.org](https://mybinder.org). If you want to try out Dask and
JupyterLab together then head here:

[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/dask/dask-examples/main?urlpath=lab)

Thanks to [Ian Rose](https://github.com/ian-r-rose) for managing this.

## 2: Render Dask Examples as static documentation

Using the [nbsphinx](https://nbsphinx.readthedocs.io/en/0.3.5/) Sphinx
extension to automatically run and render Jupyter Notebooks we've turned our
live examples repository into static documentation for easy viewing.

These examples are currently available at
[https://dask.org/dask-examples/](https://dask.org/dask-examples/) but will
soon be available at [examples.dask.org](https://dask.org/dask-examples/) and
from the navbar at all dask pages.

Thanks to [Tom Augspurger](https://tomaugspurger.github.io/) for putting this
together.

## 3: Consolidate documentation under a single org and style

Dask documentation is currently spread out in many small hosted sites, each
associated to a particular subpackage like dask-ml, dask-kubernetes,
dask-distributed, etc.. This eases development (developers are encouraged to
modify documentation as they modify code) but results in a fragmented
experience because users don't know how to discover and efficiently explore our
full documentation.

To resolve this we're doing two things:

1.  Moving all sites under the dask.org domain

    Anaconda Inc, the company that employs several of the Dask developers
    (myself included) recently donated the domain [dask.org](http://dask.org)
    to NumFOCUS. We've been slowly moving over all of our independent sites to
    use that location for our documentation.

2.  Develop a uniform Sphinx theme [dask-sphinx-theme](http://github.com/dask/dask-sphinx-theme)

    This has both uniform styling and also includes a navbar that gets
    automatically shared between the projects. The navbar makes it easy to
    discover and explore content and is something that we can keep up-to-date
    in a single repository.

You can see how this works by going to any of the Dask sites, like
[docs.dask.org](http://docs.dask.org/en/latest/docs.html).

Thanks to [Tom Augspurger](https://tomaugspurger.github.io/) for managing this
work and [Andy Terrel](http://andy.terrel.us/) for patiently handling things on
the NumFOCUS side and domain name side.

## 4: Retire the hdfs3 library

For years the Dask community has maintained the
[hdfs3](https://hdfs3.readthedocs.io/en/latest/) library that allows for native
access to the Hadoop file system from Python. This used Pivotal's libhdfs3
library written in C++ and was, for a long while the only performant way to
maturely manipulate HDFS from Python.

Since then though PyArrow has developed efficient bindings to the standard
libhdfs library and exposed it through their Pythonic [file system
interface](https://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs),
which is fortunately Dask-compatible.

We've been telling people to use the Arrow solution for a while now and thought
we'd now do so officially
(see [dask/hdfs3 #170](https://github.com/dask/hdfs3/pull/170)). As of the
last bugfix release Dask will use Arrow by default and, while the `hdfs3`
library is still available, Dask maintainers probably won't spend much time on
it in the future.

Thanks to [Martin Durant](https://hdfs3.readthedocs.io/en/latest/) for building
and maintaining HDFS3 over all this time.

## 5: Hyper-parameter selection for incrementally trained models

In Dask-ML we continue to work on hyper-parameter selection for models that
implement the `partial_fit` API. We've built algorithms and infrastructure to
handle this well, and are currently fine tuning API, parameter names, etc..

If you have any interest in this process, come on over to [dask/dask-ml #356](https://github.com/dask/dask-ml/pull/356).

Thanks to [Tom Augspurger](https://tomaugspurger.github.io/) and [Scott
Sievert](https://stsievert.com/) for this work.

## 6: Two small bugfix releases

We've been trying to increase the frequency of bugfix releases while things are
stable. Since our last writing there have been two minor bugfix releases. You
can read more about them here:

- [dask/dask](https://github.com/dask/dask/blob/master/docs/source/changelog.rst)
- [dask/distributed](https://github.com/dask/distributed/blob/master/docs/source/changelog.rst)

## 7: Binder + Dask

The Pangeo community has done work to integrate Binder with Dask and has
written about the process here: [Pangeo meets Binder](https://medium.com/pangeo/pangeo-meets-binder-2ea923feb34f)

Thanks to [Joe Hamman](http://joehamman.com/) for this work and the blogpost.

## 8: Skein/Yarn Update

The Dask-Yarn connection to deploy Dask on Hadoop clusters uses a library
[Skein](https://jcrist.github.io/skein/) to easily manage Yarn jobs from
Python.

Skein has seen a lot of activity over the last few weeks, including the
following:

1.  A Web UI for the project. See [jcrist/skein #68](https://github.com/jcrist/skein/pull/68)
2.  A Tensorflow on Yarn project from Criteo that uses Skein. See
    [github.com/criteo/tf-yarn](https://github.com/criteo/tf-yarn)

This work is mostly managed by [Jim Crist](http://jcrist.github.io/) and other
Skein contributors.