---
blogpost: true
date: Jul 8, 2018
title: Dask Development Log
tags: Programming, Python, scipy, dask
---

_This work is supported by [Anaconda Inc](http://anaconda.com)_

To increase transparency I'm trying to blog more often about the current work
going on around Dask and related projects. Nothing here is ready for
production. This blogpost is written in haste, so refined polish should not be
expected.

Current efforts for June 2018 in Dask and Dask-related projects include
the following:

1.  Yarn Deployment
2.  More examples for machine learning
3.  Incremental machine learning
4.  HPC Deployment configuration

### Yarn deployment

Dask developers often get asked _How do I deploy Dask on my Hadoop/Spark/Hive
cluster?_. We haven't had a very good answer until recently.

Most Hadoop/Spark/Hive clusters are actually _Yarn_ clusters. Yarn is the most
common cluster manager used by many clusters that are typically used to run
Hadoop/Spark/Hive jobs including any cluster purchased from a vendor like
Cloudera or Hortonworks. If your application can run on Yarn then it can be a
first class citizen here.

Unfortunately Yarn has really only been accessible through a Java API, and so
has been difficult for Dask to interact with. That's changing now with a few
projects, including:

- [dask-yarn](https://dask-yarn.readthedocs.io): an easy way to launch Dask on
  Yarn clusters
- [skein](https://jcrist.github.io/skein/): an easy way to launch generic
  services on Yarn clusters (this is primarily what backs dask-yarn)
- [conda-pack](https://conda.github.io/conda-pack/): an easy way to bundle
  together a conda package into a redeployable environment, such as is useful
  when launching Python applications on Yarn

This work is all being done by [Jim Crist](http://jcrist.github.io/) who is, I
believe, currently writing up a blogpost about the topic at large. Dask-yarn
was soft-released last week though, so people should give it a try and report
feedback on the [dask-yarn issue tracker](https://github.com/dask/dask-yarn).
If you ever wanted direct help on your cluster, now is the right time because
Jim is working on this actively and is not yet drowned in user requests so
generally has a fair bit of time to investigate particular cases.

```python
from dask_yarn import YarnCluster
from dask.distributed import Client

# Create a cluster where each worker has two cores and eight GB of memory
cluster = YarnCluster(environment='environment.tar.gz',
                      worker_vcores=2,
                      worker_memory="8GB")
# Scale out to ten such workers
cluster.scale(10)

# Connect to the cluster
client = Client(cluster)
```

### More examples for machine learning

Dask maintains a Binder of simple examples that show off various ways to use
the project. This allows people to click a link on the web and quickly be
taken to a Jupyter notebook running on the cloud. It's a fun way to quickly
experience and learn about a new project.

Previously we had a single example for arrays, dataframes, delayed, machine
learning, etc.

Now [Scott Sievert](https://stsievert.com/) is expanding the examples within
the machine learning section. He has submitted the following two so far:

1.  [Incremental training with Scikit-Learn and large datasets](https://mybinder.org/v2/gh/dask/dask-examples/main?urlpath=%2Ftree%2Fmachine-learning%2Fincremental.ipynb)
2.  [Dask and XGBoost](https://mybinder.org/v2/gh/dask/dask-examples/main?urlpath=%2Ftree%2Fmachine-learning%2Fxgboost.ipynb)

I believe he's planning on more. If you use
[dask-ml](http://dask-ml.readthedocs.io/en/latest/) and have recommendations or
want to help, you might want to engage in the [dask-ml issue
tracker](https://github.com/dask/dask-ml/issues/new) or [dask-examples issue
tracker](https://github.com/dask/dask-examples/issues/new).

### Incremental training

The incremental training mentioned as an example above is also new-ish. This
is a Scikit-Learn style meta-estimator that wraps around other estimators that
support the `partial_fit` method. It enables training on large datasets in an
incremental or batchwise fashion.

#### Before

```python
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(...)

import pandas as pd

for filename in filenames:
    df = pd.read_csv(filename)
    X, y = ...

    sgd.partial_fit(X, y)
```

#### After

```python
from sklearn.linear_model import SGDClassifier
from dask_ml.wrappers import Incremental

sgd = SGDClassifier(...)
inc = Incremental(sgd)

import dask.dataframe as dd

df = dd.read_csv(filenames)
X, y = ...
inc.fit(X, y)
```

#### Analysis

From a parallel computing perspective this is a very simple and un-sexy way of
doing things. However my understanding is that it's also quite pragmatic. In
a distributed context we leave a lot of possible computation on the table (the
solution is inherently sequential) but it's fun to see the model jump around
the cluster as it absorbs various chunks of data and then moves on.

<img src="https://user-images.githubusercontent.com/1320475/42237033-2bddf11e-7eec-11e8-88c5-5f0ebd2fb4df.png"
     width="70%"
     alt="Incremental training with Dask-ML">

There's ongoing work on how best to combine this with other work like pipelines
and hyper-parameter searches to fill in the extra computation.

This work was primarily done by [Tom Augspurger](https://tomaugspurger.github.io/)
with help from [Scott Sievert](https://stsievert.com/)

### Dask User Stories

Dask developers are often asked "Who uses Dask?". This is a hard question to
answer because, even though we're inundated with thousands of requests for
help from various companies and research groups, it's never fully clear who
minds having their information shared with others.

We're now trying to crowdsource this information in a more explicit way by
having users tell their own stories. Hopefully this helps other users in their
field understand how Dask can help and when it might (or might not) be useful
to them.

We originally collected this information in a [Google
Form](https://goo.gl/forms/JEebEFTOPrWa3P4h1) but have since then moved it to a
[Github repository](https://github.com/mrocklin/dask-stories). Eventually
we'll publish this as a [proper web
site](https://github.com/mrocklin/dask-stories/issues/7) and include it in our
documentation.

If you use Dask and want to share your story this is a great way to contribute
to the project. Arguably Dask needs more help with spreading the word than it
does with technical solutions.

### HPC Deployments

The [Dask Jobqueue](http://dask-jobqueue.readthedocs.io/en/latest/) package for
deploying Dask on traditional HPC machines is nearing another release. We've
changed around a lot of the parameters and configuration options in order to
improve the onboarding experience for new users. It has been going very
smoothly in recent engagements with new groups, but will mean a breaking
change for existing users of the sub-project.