---
blogpost: true
date: Jul 17, 2018
title: Dask Development Log, Scipy 2018
tags: Programming, Python, scipy, dask
---

_This work is supported by [Anaconda Inc](http://anaconda.com)_

To increase transparency I'm trying to blog more often about the current work
going on around Dask and related projects. Nothing here is ready for
production. This blogpost is written in haste, so refined polish should not be
expected.

Last week many Dask developers gathered for the annual SciPy 2018 conference.
As a result, very little work was completed, but many projects were started or
discussed. To reflect this change in activity this blogpost will highlight
possible changes and opportunities for readers to further engage in
development.

## Dask on HPC Machines

The [dask-jobqueue](https://dask-jobqueue.readthedocs.io/) project was a hit at
the conference. Dask-jobqueue helps people launch Dask on traditional job
schedulers like PBS, SGE, SLURM, Torque, LSF, and others that are commonly
found on high performance computers. These are _very common_ among scientific,
research, and high performance machine learning groups but commonly a bit hard
to use with anything other than MPI.

This project came up in the [Pangeo talk](https://youtu.be/2rgD5AJsAbE),
lightning talks, and the Dask Birds of a Feather session.

During sprints a number of people came up and we went through the process of
configuring Dask on common supercomputers like Cheyenne, Titan, and Cori. This
process usually takes around fifteen minutes and will likely be the subject of
a future blogpost. We published known-good configurations for these clusters
on our [configuration documentation](http://dask-jobqueue.readthedocs.io/en/latest/configurations.html)

Additionally, there is a [JupyterHub
issue](https://github.com/jupyterhub/batchspawner/issues/101) to improve
documentation on best practices to deploy JupyterHub on these machines. The
community has done this well a few times now, and it might be time to write up
something for everyone else.

### Get involved

If you have access to a supercomputer then please try things out. There is a
30-minute Youtube video screencast on the
[dask-jobqueue](https://dask-jobqueue.readthedocs.io/) documentation that should
help you get started.

If you are an administrator on a supercomputer you might consider helping to
build a configuration file and place it in `/etc/dask` for your users. You
might also want to get involved in the [JupyterHub on
HPC](http://dask-jobqueue.readthedocs.io/en/latest/configurations.html)
conversation.

## Dask / Scikit-learn talk

Olivier Grisel and Tom Augspurger prepared and delivered a great talk on the
current state of the new Dask-ML project.

<iframe width="560" height="315" src="https://www.youtube.com/embed/ccfsbuqsjgI"
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

## MyBinder and Bokeh Servers

Not a Dask change, but Min Ragan-Kelley showed how to run services through
[mybinder.org](https://mybinder.org/) that are not only Jupyter. As an example,
here is a repository that deploys a Bokeh server application with a single
click.

- [Github repository](https://github.com/minrk/binder-bokeh-server)
- [Binder link](https://mybinder.org/v2/gh/minrk/binder-bokeh-server/master?urlpath=%2Fproxy%2F5006%2Fbokeh-app)

I think that by composing with Binder Min effectively just created the
free-to-use hosted Bokeh server service. Presumably this same model could be
easily adapted to other applications just as easily.

## Dask and Automated Machine Learning with TPOT

Dask and TPOT developers are discussing paralellizing the
automatic-machine-learning tool [TPOT](http://epistasislab.github.io/tpot/).

TPOT uses genetic algorithms to search over a space of scikit-learn style
pipelines to automatically find a decently performing pipeline and model. This
involves a fair amount of computation which Dask can help to parallelize out to
multiple machines.

- [Issue: EpistasisLab/tpot #304](https://github.com/EpistasisLab/tpot/issues/304)
- [PR: EpistasisLab/tpot #730](https://github.com/EpistasisLab/tpot/pull/730)

<iframe width="560" height="315" src="https://www.youtube.com/embed/QrJlj0VCHys"
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

### Get involved

Trivial things work now, but to make this efficient we'll need to dive in a bit
more deeply. Extending that pull request to dive within pipelines would be a
good task if anyone wants to get involved. This would help to share
intermediate results between pipelines.

## Dask and Scikit-Optimize

Among various features, [Scikit-optimize](https://scikit-optimize.github.io/)
offers a [BayesSearchCV](https://scikit-optimize.github.io/#skopt.BayesSearchCV)
object that is like Scikit-Learn's `GridSearchCV` and `RandomSearchCV`, but is a
bit smarter about how to choose new parameters to test given previous results.
Hyper-parameter optimization is a low-hanging fruit for Dask-ML workloads today,
so we investigated how the project might help here.

So far we're just experimenting using Scikit-Learn/Dask integration through
joblib to see what opportunities there are. Dicussion among Dask and
Scikit-Optimize developers is happening here:

- [Issue: dask/dask-ml #300](https://github.com/dask/dask-ml/issues/300)

## Centralize PyData/Scipy tutorials on Binder

We're putting a bunch of the PyData/Scipy tutorials on Binder, and hope to
embed snippets of Youtube videos into the notebooks themselves.

This effort lives here:

- [pydata-tutorials.readthedocs.io](https://pydata-tutorials.readthedocs.io)

### Motivation

The PyData and SciPy community delivers tutorials as part of most conferences.
This activity generates both educational Jupyter notebooks and explanatory
videos that teach people how to use the ecosystem.

However, this content isn't very discoverable _after_ the conference. People
can search on Youtube for their topic of choice and hopefully find a link to
the notebooks to download locally, but this is a somewhat noisy process. It's
not clear which tutorial to choose and it's difficult to match up the video
with the notebooks during exercises.
We're probably not getting as much value out of these resources as we could be.

To help increase access we're going to try a few things:

1. Produce a centralized website with links to recent tutorials delivered for
   each topic
2. Ensure that those notebooks run easily on Binder
3. Embed sections of the talk on Youtube within each notebook so that the
   explanation of the section is tied to the exercises

### Get involved

This only really works long-term under a community maintenance model. So far
we've only done a few hours of work and there is still plenty to do in the
following tasks:

1. Find good tutorials for inclusion
2. Ensure that they work well on [mybinder.org](https://mybinder.org/)
   - are self-contained and don't rely on external scripts to run
   - have an environment.yml or requirements.txt
   - don't require a lot of resources
3. Find video for the tutorial
4. Submit a pull request to the tutorial repository that embeds a link to the
   youtube talk at the top cell of the notebook at the proper time for each
   notebook

## Dask, Actors, and Ray

I really enjoyed the [talk on Ray](https://youtu.be/D_oz7E4v-U0) another
distributed task scheduler for Python. I suspect that Dask will steal ideas
for [actors for stateful operation](https://github.com/dask/distributed/issues/2109).
I hope that Ray takes on ideas for using standard Python interfaces so that
more of the community can adopt it more quickly. I encourage people to check
out the talk and give Ray a try. It's pretty slick.

<iframe width="560" height="315" src="https://www.youtube.com/embed/D_oz7E4v-U0"
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

## Planning conversations for Dask-ML

Dask and Scikit-learn developers had the opportunity to sit down again and
raise a number of issues to help plan near-term development. This focused
mostly around building important case studies to motivate future development,
and identifying algorithms and other projects to target for near-term
integration.

### Case Studies

- [What is the purpose of a case study: dask/dask-ml #302](https://github.com/dask/dask-ml/issues/302)
- [Case study: Sparse Criteo Dataset: dask/dask-ml #295](https://github.com/dask/dask-ml/issues/295)
- [Case study: Large scale text classification: dask/dask-ml #296](https://github.com/dask/dask-ml/issues/296)
- [Case study: Transfer learning from pre-trained model: dask/dask-ml #297](https://github.com/dask/dask-ml/issues/297)

### Algorithms

- [Gradient boosted trees with Numba: dask/dask-ml #299](https://github.com/dask/dask-ml/issues/299)
- [Parallelize Scikit-Optimize for hyperparameter optimization: dask/dask-ml #300](https://github.com/dask/dask-ml/issues/300)

### Get involved

We could use help in building out case studies to drive future development in
the project. There are also several algorithmic places to get involved.
Dask-ML is a young and fast-moving project with many opportunities for new
developers to get involved.

## Dask and UMAP for low-dimensional embeddings

Leland McKinnes gave a great talk [Uniform Manifold Approximation and
Projection for Dimensionality Reduction](https://youtu.be/nq6iPZVUxZU) in which
he lays out a well founded algorithm for dimensionality reduction, similar to
PCA or T-SNE, but with some nice properties. He worked together with some Dask
developers where we identified some challenges due to dask array slicing with
random-ish slices.

A proposal to fix this problem lives here, if anyone wants a fun problem to work on:

- [dask/dask #3409 (comment)](https://github.com/dask/dask/issues/3409#issuecomment-405254656)

<iframe width="560" height="315" src="https://www.youtube.com/embed/nq6iPZVUxZU"
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

## Dask stories

We soft-launched [Dask Stories](http://dask-stories.readthedocs.io/en/latest/)
a webpage and project to collect user and share stories about how people use
Dask in practice. We're also delivering a separate blogpost about this today.

See blogpost: [Who uses Dask?](/2018/07/16/dask-stories)

If you use Dask and want to share your story we would absolutely welcome your
experience. Having people like yourself share how they use Dask is incredibly
important for the project.
