Dask Working Notes - Posted in 2019

Dask Deployment Updates

2019-11-01T00:00:00+00:00

Over the last six months many Dask developers have worked on making Dask easier to deploy in a wide variety of situations. This post summarizes those efforts, and provides links to ongoing work.

What we mean by Deployment

In order to run Dask on a cluster, you need to setup a scheduler on one machine:

$ dask-scheduler
Scheduler running at tcp://192.168.0.1

And start Dask workers on many other machines

$ dask-worker tcp://192.168.0.1
Waiting to connect to:       tcp://scheduler:8786

$ dask-worker tcp://192.168.0.1
Waiting to connect to:       tcp://scheduler:8786

$ dask-worker tcp://192.168.0.1
Waiting to connect to:       tcp://scheduler:8786

$ dask-worker tcp://192.168.0.1
Waiting to connect to:       tcp://scheduler:8786

For informal clusters people might do this manually, by logging into each machine and running these commands themselves. However it’s much more common to use a cluster resource manager such as Kubernetes, Yarn (Hadoop/Spark), HPC batch schedulers (SGE, PBS, SLURM, LSF …), some cloud service or some custom system.

As Dask is used by more institutions and used more broadly within those institutions, making deployment smooth and natural becomes increasingly important. This is so important in fact, that there have been seven separate efforts to improve deployment in some regard or another by a few different groups.

We’ll briefly summarize and link to this work below, and then we’ll finish up by talking about some internal design that helped to make this work more consistent.

Dask-SSH

According to our user survey, the most common deployment mechanism was still SSH. Dask has had a command line dask-ssh tool to make it easier to deploy with SSH for some time. We recently updated this to also include a Python interface, which provides more control.

>>> from dask.distributed import Client, SSHCluster
>>> cluster = SSHCluster(
...     ["host1", "host2", "host3", "host4"],
...     connect_options={"known_hosts": None},
...     worker_options={"nthreads": 2},
...     scheduler_options={"port": 0, "dashboard_address": ":8797"}
... )
>>> client = Client(cluster)

This isn’t what we recommend for large institutions, but it can be helpful for more informal groups who are just getting started.

Dask-Jobqueue and Dask-Kubernetes Rewrite

We’ve rewritten Dask-Jobqueue for SLURM/PBS/LSF/SGE cluster managers typically found in HPC centers and Dask-Kubernetes. These now share a common codebase along with Dask SSH, and so are much more consistent and hopefully bug free.

Ideally users shouldn’t notice much difference with existing workloads, but new features like asynchronous operation, integration with the Dask JupyterLab extension, and so on are more consistently available. Also, we’ve been able to unify development and reduce our maintenance burden considerably.

The new version of Dask Jobqueue where these changes take place is 0.7.0, and the work was done in dask/dask-jobqueue #307. The new version of Dask Kubernetes is 0.10.0 and the work was done in dask/dask-kubernetes #162.

Dask-CloudProvider

For cloud deployments we generally recommend using a hosted Kubernetes or Yarn service, and then using Dask-Kubernetes or Dask-Yarn on top of these.

However, some institutions have made decisions or commitments to use certain vendor specific technologies, and it’s more convenient to use APIs that are more native to the particular cloud. The new package Dask Cloudprovider handles this today for Amazon’s ECS API, which has been around for a long while and is more universally accepted.

from dask_cloudprovider import ECSCluster
cluster = ECSCluster(cluster_arn="arn:aws:ecs:<region>:<acctid>:cluster/<clustername>")

from dask_cloudprovider import FargateCluster
cluster = FargateCluster()

Dask-Gateway

In some cases users may not have access to the cluster manager. For example the institution may not give all of their data science users access to the Yarn or Kubernetes cluster. In this case the Dask-Gateway project may be useful. It can launch and manage Dask jobs, and provide a proxy connection to these jobs if necessary. It is typically deployed with elevated permissions but managed directly by IT, giving them a point of greater control.

GPUs and Dask-CUDA

While using Dask with multi-GPU deployments the NVIDIA RAPIDS has needed the ability to specify increasingly complex setups of Dask workers. They recommend the following deployment strategy:

One Dask-worker per GPU on a machine
Specify the CUDA_VISIBLE_DEVICES environment variable to pin that worker to that GPU
If your machine has multiple network interfaces then choose the network interface that has the best connection to that GPU
If your machine has multiple CPUs then set thread affinities to use the closest CPU
… and more

For this reason we wanted to specify these configurations in code, like the following:

specification = {
    "worker-0": {
        "cls": dask.distributed.Nanny,
        "options": {"nthreads": 1, "env": {"CUDA_VISIBLE_DEVICES": "0,1,2,3"}, interface="ib0"},
    },
    "worker-1": {
        "cls": dask.distributed.Nanny,
        "options": {"nthreads": 1, "env": {"CUDA_VISIBLE_DEVICES": "1,2,3,0"}, interface="ib0"},
    },
    "worker-2": {
        "cls": dask.distributed.Nanny,
        "options": {"nthreads": 1, "env": {"CUDA_VISIBLE_DEVICES": "2,3,0,1"}, interface="ib1"},
    },
    "worker-2": {
        "cls": dask.distributed.Nanny,
        "options": {"nthreads": 1, "env": {"CUDA_VISIBLE_DEVICES": "3,0,1,2"}, interface="ib1"},
    },
}

And the new SpecCluster class to deploy these workers:

cluster = SpecCluster(workers=specification)

We’ve used this technique in the Dask-CUDA project to provide convenient functions for deployment on multi-GPU systems.

This class was generic enough that it ended up forming the base of the SSH, Jobqueue, and Kubernetes solutions as well.

Standards and Conventions

The solutions above are built by different teams that work in different companies. This is great because those teams have hands-on experience with the cluster managers in the wild, but has historically been somewhat challenging to standardize user experience. This is particularly challenging when we build other tools like IPython widgets or the Dask JupyterLab extension, which want to interoperate with all of the Dask deployment solutions.

The recent rewrite of Dask-SSH, Dask-Jobqueue, Dask-Kubernetes, and the new Dask-Cloudprovider and Dask-CUDA libraries place them all under the same dask.distributed.SpecCluster superclass. So we can expect a high degree of uniformity from them. Additionally, all of the classes now match the dask.distributed.Cluster interface, which standardizes things like adaptivity, IPython widgets, logs, and some basic reporting.

Cluster
- SpecCluster
  - Kubernetes
  - JobQueue (PBS/SLURM/LSF/SGE/Torque/Condor/Moab/OAR)
  - SSH
  - CloudProvider (ECS)
  - CUDA (LocalCUDACluster, DGX)
  - LocalCluster
- Yarn
- Gateway

Future Work

There is still plenty to do. Here are some of the themes we’ve seen among current development:

Move the Scheduler off to a separate job/pod/container in the network, which is often helpful for complex networking situations
Improve proxying of the dashboard in these situations
Optionally separate the life-cycle of the cluster from the lifetime of the Python process that requested the cluster
Write up best practices how to compose GPU support generally with all of the cluster managers

DataFrame Groupby Aggregations

2019-10-08T00:00:00+00:00

In this post we’ll dive into how Dask computes groupby aggregations. These are commonly used operations for ETL and analysis in which we split data into groups, apply a function to each group independently, and then combine the results back together. In the PyData/R world this is often referred to as the split-apply-combine strategy (first coined by Hadley Wickham) and is used widely throughout the Pandas ecosystem.

Image courtesy of swcarpentry.github.io

Dask leverages this idea using a similarly catchy name: apply-concat-apply or aca for short. Here we’ll explore the aca strategy in both simple and complex operations.

First, recall that a Dask DataFrame is a collection of DataFrame objects (e.g. each partition of a Dask DataFrame is a Pandas DataFrame). For example, let’s say we have the following Pandas DataFrame:

>>> import pandas as pd
>>> df = pd.DataFrame(dict(a=[1, 1, 2, 3, 3, 1, 1, 2, 3, 3, 99, 10, 1],
...                        b=[1, 3, 10, 3, 2, 1, 3, 10, 3, 3, 12, 0, 9],
...                        c=[2, 4, 5, 2, 3, 5, 2, 3, 9, 2, 44, 33, 2]))
>>> df
     a   b   c
0    1   1   2
1    1   3   4
2    2  10   5
3    3   3   2
4    3   2   3
5    1   1   5
6    1   3   2
7    2  10   3
8    3   3   9
9    3   3   2
10  99  12  44
11  10   0  33
12   1   9   2

To create a Dask DataFrame with three partitions from this data, we could partition df between the indices of: (0, 4), (5, 9), and (10, 12). We can perform this partitioning with Dask by using the from_pandas function with npartitions=3:

>>> import dask.dataframe as dd
>>> ddf = dd.from_pandas(df, npartitions=3)

The 3 partitions are simply 3 individual Pandas DataFrames:

>>> ddf.partitions[0].compute()
   a   b  c
1   1  2
1   3  4
2  10  5
3   3  2
3   2  3

Apply-concat-apply

When Dask applies a function and/or algorithm (e.g. sum, mean, etc.) to a Dask DataFrame, it does so by applying that operation to all the constituent partitions independently, collecting (or concatenating) the outputs into intermediary results, and then applying the operation again to the intermediary results to produce a final result. Internally, Dask re-uses the same apply-concat-apply methodology for many of its internal DataFrame calculations.

Let’s break down how Dask computes ddf.groupby(['a', 'b']).c.sum() by going through each step in the aca process. We’ll begin by splitting our df Pandas DataFrame into three partitions:

>>> df_1 = df[:5]
>>> df_2 = df[5:10]
>>> df_3 = df[-3:]

Apply

Next we perform the same groupby(['a', 'b']).c.sum() operation on each of our three partitions:

>>> sr1 = df_1.groupby(['a', 'b']).c.sum()
>>> sr2 = df_2.groupby(['a', 'b']).c.sum()
>>> sr3 = df_3.groupby(['a', 'b']).c.sum()

These operations each produce a Series with a MultiIndex:

>>> sr1 a b 1 1 2 3 4 2 10 5 3 2 3 3 2 Name: c, dtype: int64	>>> sr2 a b 1 1 5 3 2 2 10 3 3 3 11 Name: c, dtype: int64	>>> sr3 a b 1 9 2 10 0 33 99 12 44 Name: c, dtype: int64

The conCat!

After the first apply, the next step is to concatenate the intermediate sr1, sr2, and sr3 results. This is fairly straightforward to do using the Pandas concat function:

>>> sr_concat = pd.concat([sr1, sr2, sr3])
>>> sr_concat
a   b
 1      2
    4
 10     5
 2      3
    2
 1      5
    2
 10     3
 3     11
 9      2
0     33
12    44
Name: c, dtype: int64

Apply Redux

Our final step is to apply the same groupby(['a', 'b']).c.sum() operation again on the concatenated sr_concat Series. However we no longer have columns a and b, so how should we proceed?

Zooming out a bit, our goal is to add the values in the column which have the same index. For example, there are two rows with the index (1, 1) with corresponding values: 2, 5. So how can we groupby the indices with the same value? A MutliIndex uses levels to define what the value is at a give index. Dask determines and uses these levels in the final apply step of the apply-concat-apply calculation. In our case, the level is [0, 1], that is, we want both the index at the 0th level and the 1st level and if we group by both, 0, 1, we will have effectively grouped the same indices together:

>>> total = sr_concat.groupby(level=[0, 1]).sum()

>>> total
a   b
1   1      7
    3      6
    9      2
2   10     8
3   2      3
    3     13
10  0     33
99  12    44
Name: c, dtype: int64

>>> ddf.groupby(['a', 'b']).c.sum().compute()
a   b
1   1      7
    3      6
2   10     8
3   2      3
    3     13
1   9      2
10  0     33
99  12    44
Name: c, dtype: int64

>>> df.groupby(['a', 'b']).c.sum()
a   b
1   1      7
    3      6
    9      2
2   10     8
3   2      3
    3     13
10  0     33
99  12    44
Name: c, dtype: int64

Additionally, we can easily examine the steps of this apply-concat-apply calculation by visualizing the task graph for the computation:

>>> ddf.groupby(['a', 'b']).c.sum().visualize()

sum is rather a straight-forward calculation. What about something a bit more complex like mean?

>>> ddf.groupby(['a', 'b']).c.mean().visualize()

Mean is a good example of an operation which doesn’t directly fit in the aca model – concatenating mean values and taking the mean again will yield incorrect results. Like any style of computation: vectorization, Map/Reduce, etc., we sometime need to creatively fit the computation to the style/mode. In the case of aca we can often break down the calculation into constituent parts. For mean, this would be sum and count:

\[ \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}\]

From the task graph above, we can see that two independent tasks for each partition: series-groupby-count-chunk and series-groupby-sum-chunk. The results are then aggregated into two final nodes: series-groupby-count-agg and series-groupby-sum-agg and then we finally calculate the mean: total sum / total count.

Better and faster hyperparameter optimization with Dask

2019-09-30T00:00:00+00:00

Scott Sievert wrote this post. The original post lives at https://stsievert.com/blog/2019/09/27/dask-hyperparam-opt/ with better styling. This work is supported by Anaconda, Inc.

Dask’s machine learning package, Dask-ML now implements Hyperband, an advanced “hyperparameter optimization” algorithm that performs rather well. This post will

describe “hyperparameter optimization”, a common problem in machine learning
describe Hyperband’s benefits and why it works
show how to use Hyperband via example alongside performance comparisons

In this post, I’ll walk through a practical example and highlight key portions of the paper “Better and faster hyperparameter optimization with Dask”, which is also summarized in a ~25 minute SciPy 2019 talk.

Machine learning requires data, an untrained model and “hyperparameters”, parameters that are chosen before training begins that help with cohesion between the model and data. The user needs to specify values for these hyperparameters in order to use the model. A good example is adapting ridge regression or LASSO to the amount of noise in the data with the regularization parameter.[1]

Model performance strongly depends on the hyperparameters provided. A fairly complex example is with a particular visualization tool, t-SNE. This tool requires (at least) three hyperparameters and performance depends radically on the hyperparameters. In fact, the first section in “How to Use t-SNE Effectively” is titled “Those hyperparameters really matter”.

Finding good values for these hyperparameters is critical and has an entire Scikit-learn documentation page, “Tuning the hyperparameters of an estimator.” Briefly, finding decent values of hyperparameters is difficult and requires guessing or searching.

How can these hyperparameters be found quickly and efficiently with an advanced task scheduler like Dask? Parallelism will pose some challenges, but the Dask architecture enables some advanced algorithms.

Note: this post presumes knowledge of Dask basics. This material is covered in Dask’s documentation on Why Dask?, a ~15 minute video introduction to Dask, a video introduction to Dask-ML and a blog post I wrote on my first use of Dask.

Contributions

Dask-ML can quickly find high-performing hyperparameters. I will back this claim with intuition and experimental evidence.

Specifically, this is because Dask-ML now implements an algorithm introduced by Li et. al. in “Hyperband: A novel bandit-based approach to hyperparameter optimization”. Pairing of Dask and Hyperband enables some exciting new performance opportunities, especially because Hyperband has a simple implementation and Dask is an advanced task scheduler.[2]

Let’s go through the basics of Hyperband then illustrate its use and performance with an example. This will highlight some key points of the corresponding paper.

Hyperband basics

The motivation for Hyperband is to find high performing hyperparameters with minimal training. Given this goal, it makes sense to spend more time training high performing models – why waste more time training time a model if it’s done poorly in the past?

One method to spend more time on high performing models is to initialize many models, start training all of them, and then stop training low performing models before training is finished. That’s what Hyperband does. At the most basic level, Hyperband is a (principled) early-stopping scheme for RandomizedSearchCV.

Deciding when to stop the training of models depends on how strongly the training data effects the score. There are two extremes:

when only the training data matter
- i.e., when the hyperparameters don’t influence the score at all
when only the hyperparameters matter
- i.e., when the training data don’t influence the score at all

Hyperband balances these two extremes by sweeping over how frequently models are stopped. This sweep allows a mathematical proof that Hyperband will find the best model possible with minimal partial_fit calls[3].

Hyperband has significant parallelism because it has two “embarrassingly parallel” for-loops – Dask can exploit this. Hyperband has been implemented in Dask, specifically in Dask’s machine library Dask-ML.

How well does it perform? Let’s illustrate via example. Some setup is required before the performance comparison in Performance.

Example

Note: want to try HyperbandSearchCV out yourself? Dask has an example use. It can even be run in-browser!

I’ll illustrate with a synthetic example. Let’s build a dataset with 4 classes:

>>> from experiment import make_circles
>>> X, y = make_circles(n_classes=4, n_features=6, n_informative=2)
>>> scatter(X[:, :2], color=y)

Note: this content is pulled from stsievert/dask-hyperband-comparison, or makes slight modifications.

Let’s build a fully connected neural net with 24 neurons for classification:

>>> from sklearn.neural_network import MLPClassifier
>>> model = MLPClassifier()

Building the neural net with PyTorch is also possible[4] (and what I used in development).

This neural net’s behavior is dictated by 6 hyperparameters. Only one controls the model of the optimal architecture (hidden_layer_sizes, the number of neurons in each layer). The rest control finding the best model of that architecture. Details on the hyperparameters are in the Appendix.

>>> params = ...  # details in appendix
>>> params.keys()
dict_keys(['hidden_layer_sizes', 'alpha', 'batch_size', 'learning_rate'
           'learning_rate_init', 'power_t', 'momentum'])
>>> params["hidden_layer_sizes"]  # always 24 neurons
[(24, ), (12, 12), (6, 6, 6, 6), (4, 4, 4, 4, 4, 4), (12, 6, 3, 3)]

I choose these hyperparameters to have a complex search space that mimics the searches performed for most neural networks. These searches typically involve hyperparameters like “dropout”, “learning rate”, “momentum” and “weight decay”.[5] End users don’t care hyperparameters like these; they don’t change the model architecture, only finding the best model of a particular architecture.

How can high performing hyperparameter values be found quickly?

Finding the best parameters

First, let’s look at the parameters required for Dask-ML’s implementation of Hyperband (which is in the class HyperbandSearchCV).

Hyperband parameters: rule-of-thumb

HyperbandSearchCV has two inputs:

max_iter, which determines how many times to call partial_fit
the chunk size of the Dask array, which determines how many data each partial_fit call receives.

These fall out pretty naturally once it’s known how long to train the best model and very approximately how many parameters to sample:

n_examples = 50 * len(X_train)  # 50 passes through dataset for best model
n_params = 299  # sample about 300 parameters

# inputs to hyperband
max_iter = n_params
chunk_size = n_examples // n_params

The inputs to this rule-of-thumb are exactly what the user cares about:

a measure of how complex the search space is (via n_params)
how long to train the best model (via n_examples)

Notably, there’s no tradeoff between n_examples and n_params like with Scikit-learn’s RandomizedSearchCV because n_examples is only for some models, not for all models. There’s more details on this rule-of-thumb in the “Notes” section of the HyperbandSearchCV docs.

With these inputs a HyperbandSearchCV object can easily be created.

Finding the best performing hyperparameters

This model selection algorithm Hyperband is implemented in the class HyperbandSearchCV. Let’s create an instance of that class:

>>> from dask_ml.model_selection import HyperbandSearchCV
>>>
>>> search = HyperbandSearchCV(
...     est, params, max_iter=max_iter, aggressiveness=4
... )

aggressiveness defaults to 3. aggressiveness=4 is chosen because this is an initial search; I know nothing about how this search space. Then, this search should be more aggressive in culling off bad models.

Hyperband hides some details from the user (which enables the mathematical guarantees), specifically the details on the amount of training and the number of models created. These details are available in the metadata attribute:

>>> search.metadata["n_models"]
378
>>> search.metadata["partial_fit_calls"]
5721

Now that we have some idea on how long the computation will take, let’s ask it to find the best set of hyperparameters:

>>> from dask_ml.model_selection import train_test_split
>>> X_train, y_train, X_test, y_test = train_test_split(X, y)
>>>
>>> X_train = X_train.rechunk(chunk_size)
>>> y_train = y_train.rechunk(chunk_size)
>>>
>>> search.fit(X_train, y_train)

The dashboard will be active during this time[6]:

Your browser does not support the video tag.

How well do these hyperparameters perform?

>>> search.best_score_
0.9019221418447483

HyperbandSearchCV mirrors Scikit-learn’s API for RandomizedSearchCV, so it has access to all the expected attributes and methods:

>>> search.best_params_
{"batch_size": 64, "hidden_layer_sizes": [6, 6, 6, 6], ...}
>>> search.score(X_test, y_test)
0.8989070100111217
>>> search.best_model_
MLPClassifier(...)

Details on the attributes and methods are in the HyperbandSearchCV documentation.

Performance

I ran this 200 times on my personal laptop with 4 cores. Let’s look at the distribution of final validation scores:

The “passive” comparison is really RandomizedSearchCV configured so it takes an equal amount of work as HyperbandSearchCV. Let’s see how this does over time:

This graph shows the mean score over the 200 runs with the solid line, and the shaded region represents the interquartile range. The dotted green line indicates the data required to train 4 models to completion. “Passes through the dataset” is a good proxy for “time to solution” because there are only 4 workers.

This graph shows that HyperbandSearchCV will find parameters at least 3 times quicker than RandomizedSearchCV.

Dask opportunities

What opportunities does combining Hyperband and Dask create? HyperbandSearchCV has a lot of internal parallelism and Dask is an advanced task scheduler.

The most obvious opportunity involves job prioritization. Hyperband fits many models in parallel and Dask might not have that workers available. This means some jobs have to wait for other jobs to finish. Of course, Dask can prioritize jobs[7] and choose which models to fit first.

Let’s assign the priority for fitting a certain model to be the model’s most recent score. How does this prioritization scheme influence the score? Let’s compare the prioritization schemes in a single run of the 200 above:

These two lines are the same in every way except for the prioritization scheme. This graph compares the “high scores” prioritization scheme and the Dask’s default prioritization scheme (“fifo”).

This graph is certainly helped by the fact that is run with only 4 workers. Job priority does not matter if every job can be run right away (there’s nothing to assign priority too!).

Amenability to parallelism

How does Hyperband scale with the number of workers?

I ran another separate experiment to measure. This experiment is described more in the corresponding paper, but the relevant difference is that a PyTorch neural network is used through skorch instead of Scikit-learn’s MLPClassifier.

I ran the same experiment with a different number of Dask workers.[8] Here’s how HyperbandSearchCV scales:

Training one model to completion requires 243 seconds (which is marked by the white line). This is a comparison with patience, which stops training models if their scores aren’t increasing enough. Functionally, this is very useful because the user might accidentally specify n_examples to be too large.

It looks like the speedups start to saturate somewhere between 16 and 24 workers, at least for this example. Of course, patience doesn’t work as well for a large number of workers.[9]

Future work

There’s some ongoing pull requests to improve HyperbandSearchCV. The most significant of these involves tweaking some Hyperband internals so HyperbandSearchCV works better with initial or very exploratory searches (dask/dask-ml #532).

The biggest improvement I see is treating dataset size as the scarce resource that needs to be preserved instead of training time. This would allow Hyperband to work with any model, instead of only models that implement partial_fit.

Serialization is an important part of the distributed Hyperband implementation in HyperbandSearchCV. Scikit-learn and PyTorch can easily handle this because they support the Pickle protocol[10], but Keras/Tensorflow/MXNet present challenges. The use of HyperbandSearchCV could be increased by resolving this issue.

Appendix

I choose to tune 7 hyperparameters, which are

hidden_layer_sizes, which controls the activation function used at each neuron
alpha, which controls the amount of regularization

More hyperparameters control finding the best neural network:

batch_size, which controls the number of examples the optimizer uses to approximate the gradient
learning_rate, learning_rate_init, power_t, which control some basic hyperparameters for the SGD optimizer I’ll be using
momentum, a more advanced hyperparameter for SGD with Nesterov’s momentum.

Co-locating a Jupyter Server and Dask Scheduler

2019-09-13T00:00:00+00:00

If you want, you can have Dask set up a Jupyter notebook server for you, co-located with the Dask scheduler. There are many ways to do this, but this blog post lists two.

Sometimes people inside of large institutions have complex deployment pains. It takes them a while to stand up a process running on a machine in their cluster, with all of the appropriate networking ports open and such. In that situation, it can sometimes be nice to do this just once, say for Dask, rather than twice, say for Dask and for Jupyter.

Probably in these cases people should invest in a long term solution like JupyterHub, or one of its enterprise variants, but this blogpost gives a couple of hacks in the meantime.

Hack 1: Create a Jupyter server from a Python function call

If your Dask scheduler is already running, connect to it with a Client and run a Python function that starts up a Jupyter server.

from dask.distributed import Client

client = Client("scheduler-address:8786")

def start_juptyer_server():
    from notebook.notebookapp import NotebookApp
    app = NotebookApp()
    app.initialize([])  # add command line args here if you want

client.run_on_scheduler(start_jupyter_server)

If you have a complex networking setup (maybe you’re on the cloud or HPC and had to open up a port explicitly) then you might want to install jupyter-server-proxy (which Dask also uses by default if installed), and then go to http://scheduler-address:8787/proxy/8888 . The Dask dashboard can route your connection to Jupyter (Jupyter is also kind enough to do the same for Dask if it is the main service).

Hack 2: Preload script

This is also a great opportunity to learn about the various ways of adding custom startup and teardown. One such way, is a preload script like the following:

# jupyter-preload.py
from notebook.notebookapp import NotebookApp

def dask_setup(scheduler):
    app = NotebookApp()
    app.initialize([])

dask-scheduler --preload jupyter-preload.py

That script will run at an appropriate time during scheduler startup. You can also put this into configuration

distributed:
  scheduler:
    preload: ["/path/to/jupyter-preload.py"]

Really though, you should use something else

This is mostly a hack. If you’re at an institution then you should ask for something like JuptyerHub.

Or, you might also want to run this in a separate subprocess, so that Jupyter and the Dask scheduler don’t collide with each other. This shouldn’t be so much of a problem (they’re both pretty light weight), but isolating them probably makes sense.

Thanks Nick!

Thanks to Nick Bollweg, who answered a questions on this topic here

Dask on HPC: a case study

2019-08-28T00:00:00+00:00

Dask is deployed on traditional HPC machines with increasing frequency. In the past week I’ve personally helped four different groups get set up. This is a surprisingly individual process, because every HPC machine has its own idiosyncrasies. Each machine uses a job scheduler like SLURM/PBS/SGE/LSF/…, a network file system, and fast interconnect, but each of those sub-systems have slightly different policies on a machine-by-machine basis, which is where things get tricky.

Typically we can solve these problems in about 30 minutes if we have both:

Someone familiar with the machine, like a power-user or an IT administrator
Someone familiar with setting up Dask

These systems span a large range of scale. At different ends of this scale this week I’ve seen both:

A small in-house 24-node SLURM cluster for research work inside of a bio-imaging lab
Summit, the world’s most powerful supercomputer

In this post I’m going to share a few notes of what I went through in dealing with Summit, which was particularly troublesome. Hopefully this gives a sense for the kinds of situations that arise. These tips likely don’t apply to your particular system, but hopefully they give a flavor of what can go wrong, and the processes by which we track things down.

First, Summit is an IBM PowerPC machine, meaning that packages compiled on normal Intel chips won’t work. Fortunately, Anaconda maintains a download of their distribution that works well with the Power architecture, so that gave me a good starting point.

https://www.anaconda.com/distribution/#linux

Packages do seem to be a few months older than for the normal distribution, but I can live with that.

Install Dask-Jobqueue and configure basic information

We need to tell Dask how many cores and how much memory is on each machine. This process is fairly straightforward, is well documented at jobqueue.dask.org with an informative screencast, and even self-directing with error messages.

In [1]: from dask_jobqueue import PBSCluster
In [2]: cluster = PBSCluster()
ValueError: You must specify how many cores to use per job like ``cores=8``

I’m going to skip this section for now because, generally, novice users are able to handle this. For more information, consider watching this YouTube video (30m).

Invalid operations in the job script

So we make a cluster object with all of our information, we call .scale and we get some error message from the job scheduler.

from dask_jobqueue import LSFCluster
cluster = LSFCluster(
    cores=128,
    memory="600 GB",
    project="GEN119",
    walltime="00:30",
)
cluster.scale(3)  # ask for three nodes

Command:
bsub /tmp/tmp4874eufw.sh
stdout:

Typical usage:
  bsub [LSF arguments] jobscript
  bsub [LSF arguments] -Is $SHELL
  bsub -h[elp] [options]
  bsub -V

NOTES:
 * All jobs must specify a walltime (-W) and project id (-P)
 * Standard jobs must specify a node count (-nnodes) or -ln_slots. These jobs cannot specify a resource string (-R).
 * Expert mode jobs (-csm y) must specify a resource string and cannot specify -nnodes or -ln_slots.

stderr:
ERROR: Resource strings (-R) are not supported in easy mode. Please resubmit without a resource string.
ERROR: -n is no longer supported. Please request nodes with -nnodes.
ERROR: No nodes requested. Please request nodes with -nnodes.

Dask-Jobqueue tried to generate a sensible job script from the inputs that you provided, but the resource manager that you’re using may have additional policies that are unique to that cluster. We debug this by looking at the generated script, and comparing against scripts that are known to work on the HPC machine.

print(cluster.job_script())

#!/usr/bin/env bash

#BSUB -J dask-worker
#BSUB -P GEN119
#BSUB -n 128
#BSUB -R "span[hosts=1]"
#BSUB -M 600000
#BSUB -W 00:30
JOB_ID=${LSB_JOBID%.*}

/ccs/home/mrocklin/anaconda/bin/python -m distributed.cli.dask_worker tcp://scheduler:8786 --nthreads 16 --nprocs 8 --memory-limit 75.00GB --name name --nanny --death-timeout 60 --interface ib0 --interface ib0

After comparing notes with existing scripts that we know to work on Summit, we modify keywords to add and remove certain lines in the header.

cluster = LSFCluster(
    cores=128,
    memory="500 GB",
    project="GEN119",
    walltime="00:30",
    job_extra=["-nnodes 1"],          # <--- new!
    header_skip=["-R", "-n ", "-M"],  # <--- new!
)

And when we call scale this seems to make LSF happy. It no longer dumps out large error messages.

>>> cluster.scale(3)  # things seem to pass
>>>

Workers don’t connect to the Scheduler

So things seem fine from LSF’s perspective, but when we connect up a client to our cluster we don’t see anything arriving.

>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> client
<Client: scheduler='tcp://10.41.0.34:41107' processes=0 cores=0>

Two things to check, have the jobs actually made it through the queue? Typically we use a resource manager operation, like qstat, squeue, or bjobs for this. Maybe our jobs are trapped in the queue?

$ bash
JOBID   USER       STAT   SLOTS    QUEUE       START_TIME    FINISH_TIME   JOB_NAME
600785  mrocklin   RUN    43       batch       Aug 26 13:11  Aug 26 13:41  dask-worker
600786  mrocklin   RUN    43       batch       Aug 26 13:11  Aug 26 13:41  dask-worker
600784  mrocklin   RUN    43       batch       Aug 26 13:11  Aug 26 13:41  dask-worker

Nope, it looks like they’re in a running state. Now we go and look at their logs. It can sometimes be tricky to track down the log files from your jobs, but your IT administrator should know where they are. Often they’re where you ran your job from, and have the Job ID in the filename.

$ cat dask-worker.600784.err
distributed.worker - INFO -       Start worker at: tcp://128.219.134.81:44053
distributed.worker - INFO -          Listening to: tcp://128.219.134.81:44053
distributed.worker - INFO -          dashboard at:       128.219.134.81:34583
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         16
distributed.worker - INFO -                Memory:                   75.00 GB
distributed.worker - INFO -       Local Directory: /autofs/nccs-svm1_home1/mrocklin/worker-ybnhk4ib
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
...

So the worker processes have started, but they’re having difficulty connecting to the scheduler. When we ask at IT administrator they identify the address here as on the wrong network interface:

128.219.134.74  <--- not accessible network address

So we run ifconfig, and find the infiniband network interface, ib0, which is more broadly accessible.

cluster = LSFCluster(
    cores=128,
    memory="500 GB",
    project="GEN119",
    walltime="00:30",
    job_extra=["-nnodes 1"],
    header_skip=["-R", "-n ", "-M"],
    interface="ib0",                    # <--- new!
)

We try this out and still, no luck :(

Interactive nodes

The expert user then says “Oh, our login nodes are pretty locked-down, lets try this from an interactive compute node. Things tend to work better there”. We run some arcane bash command (I’ve never seen two of these that look alike so I’m going to omit it here), and things magically start working. Hooray!

We run a tiny Dask computation just to prove that we can do some work.

>>> client = Client(cluster)
>>> client.submit(lambda x: x + 1, 10).result()
11

Actually, it turns out that we were eventually able to get things running from the login nodes on Summit using a slightly different bsub command in LSF, but I’m going to omit details here because we’re fixing this in Dask and it’s unlikely to affect future users (I hope?). Locked down login nodes remain a common cause of no connections across a variety of systems. I’ll say something like 30% of the systems that I interact with.

SSH Tunneling

It’s important to get the dashboard up and running so that you can see what’s going on. Typically we do this with SSH tunnelling. Most HPC people know how to do this and it’s covered in the Youtube screencast above, so I’m going to skip it here.

Jupyter Lab

Many interactive Dask users on HPC today are moving towards using JupyterLab. This choice gives them a notebook, terminals, file browser, and Dask’s dashboard all in a single web tab. This greatly reduces the number of times they have to SSH in, and, with the magic of web proxies, means that they only need to tunnel once.

I conda installed JupyterLab and a proxy library, and then tried to set up the Dask JupyterLab extension.

conda install jupyterlab
pip install jupyter-server-proxy  # to route dashboard through Jupyter's port

Next, we’re going to install the Dask Labextension into JupyterLab in order to get the Dask Dashboard directly into our Jupyter session.. For that, we need nodejs in order to install things into JupyterLab. I thought that this was going to be a pain, given the Power architecture, but amazingly, this also seems to be in Anaconda’s default Power channel.

mrocklin@login2.summit $ conda install nodejs  # Thanks conda packaging devs!

Then I install Dask-Labextension, which is both a Python and a JavaScript package:

pip install dask_labextension
jupyter labextension install dask-labextension

Then I set up a password for my Jupyter sessions

jupyter notebook password

And run JupyterLab in a network friendly way

mrocklin@login2.summit $ jupyter lab --no-browser --ip="login2"

And set up a single SSH tunnel from my home machine to the login node

# Be sure to match the login node's hostname and the Jupyter port below

mrocklin@my-laptop $ ssh -L 8888:login2:8888 summit.olcf.ornl.gov

I can now connect to Jupyter from my laptop by navigating to http://localhost:8888 , run the cluster commands above in a notebook, and things work great. Additionally, thanks to jupyter-server-proxy, Dask’s dashboard is also available at http://localhost:8888/proxy/####/status , where #### is the port currently hosting Dask’s dashboard. You can probably find this by looking at cluster.dashboard_link. It defaults to 8787, but if you’ve started a bunch of Dask schedulers on your system recently it’s possible that that port is taken up and so Dask had to resort to using random ports.

Configuration files

I don’t want to keep typing all of these commands, so now I put things into a single configuration file, and plop that file into ~/.config/dask/summit.yaml (any filename that ends in .yaml will do).

jobqueue:
  lsf:
    cores: 128
    processes: 8
    memory: 500 GB
    job-extra:
      - "-nnodes 1"
    interface: ib0
    header-skip:
      - "-R"
      - "-n "
      - "-M"

labextension:
  factory:
    module: "dask_jobqueue"
    class: "LSFCluster"
    args: []
    kwargs:
      project: your-project-id

Slow worker startup

Now that things are easier to use I find myself using the system more, and some other problems arise.

I notice that it takes a long time to start up a worker. It seems to hang intermittently during startup, so I add a few lines to distributed/__init__.py to print out the state of the main Python thread every second, to see where this is happening:

import threading, sys, time
from . import profile

main_thread = threading.get_ident()

def f():
    while True:
        time.sleep(1)
        frame = sys._current_frames()[main_thread]
        print("".join(profile.call_stack(frame)

thread = threading.Thread(target=f, daemon=True)
thraed.start()

This prints out a traceback that brings us to this code in Dask:

if is_locking_enabled():
    try:
        self._lock_path = os.path.join(self.dir_path + DIR_LOCK_EXT)
        assert not os.path.exists(self._lock_path)
        logger.debug("Locking %r...", self._lock_path)
        # Avoid a race condition before locking the file
        # by taking the global lock
        try:
                with workspace._global_lock():
                    self._lock_file = locket.lock_file(self._lock_path)
                    self._lock_file.acquire()

It looks like Dask is trying to use a file-based lock. Unfortunately some NFS systems don’t like file-based locks, or handle them very slowly. In the case of Summit, the home directory is actually mounted read-only from the compute nodes, so a file-based lock will simply fail. Looking up the is_locking_enabled function we see that it checks a configuration value.

def is_locking_enabled():
    return dask.config.get("distributed.worker.use-file-locking")

So we add that to our config file. At the same time I switch from the forkserver to spawn multiprocessing method (I thought that this might help, but it didn’t), which is relatively harmless.

distributed:
  worker:
    multiprocessing-method: spawn
    use-file-locking: False

jobqueue:
  lsf:
    cores: 128
    processes: 8
    memory: 500 GB
    job-extra:
      - "-nnodes 1"
    interface: ib0
    header-skip:
    - "-R"
    - "-n "
    - "-M"

labextension:
  factory:
     module: 'dask_jobqueue'
     class: 'LSFCluster'
     args: []
     kwargs:
       project: your-project-id

Conclusion

This post outlines many issues that I ran into when getting Dask to run on one specific HPC system. These problems aren’t universal, so you may not run into them, but they’re also not super-rare. Mostly my objective in writing this up is to give people a sense of the sorts of problems that arise when Dask and an HPC system interact.

None of the problems above are that serious. They’ve all happened before and they all have solutions that can be written down in a configuration file. Finding what the problem is though can be challenging, and often requires the combined expertise of individuals that are experienced with Dask and with that particular HPC system.

There are a few configuration files posted here jobqueue.dask.org/en/latest/configurations.html, which may be informative. The Dask Jobqueue issue tracker is also a fairly friendly place, full of both IT professionals and Dask experts.

Also, as a reminder, you don’t need to have an HPC machine in order to use Dask. Dask is conveniently deployable from other Cloud, Hadoop, and local systems. See the Dask setup documentation for more information.

Future work: GPUs

Summit is fast because it has a ton of GPUs. I’m going to work on that next, but that will probably cover enough content to fill up a whole other blogpost :)

Branches

For anyone playing along at home (or on Summit). I’m operating from the following development branches:

Although hopefully within a month of writing this article, everything should be in a nicely released state.

Dask and ITK for large scale image analysis

2019-08-09T00:00:00+00:00

This post explores using the ITK suite of image processing utilities in parallel with Dask Array.

We cover …

A simple but common example of applying deconvolution across a stack of 3d images
Tips on how to make these two libraries work well together
Challenges that we ran into and opportunities for future improvements.

A Worked Example

Let’s start with a full example applying Richardson Lucy deconvolution to a stack of light sheet microscopy data. This is the same data that we showed how to load in our last blogpost on image loading. You can access the data as tiff files from google drive here, and the access the corresponding point spread function images here.

# Load our data from last time¶
import dask.array as da
imgs = da.from_zarr("AOLLSMData_m4_raw.zarr/", "data")

	Array	Chunk
Bytes	188.74 GB	316.15 MB
Shape	(3, 199, 201, 1024, 768)	(1, 1, 201, 1024, 768)
Count	598 Tasks	597 Chunks
Type	uint16	numpy.ndarray

199 3

768 1024 201

This dataset has shape (3, 199, 201, 1024, 768):

3 fluorescence color channels,
199 time points,
201 z-slices,
1024 pixels in the y dimension, and
768 pixels in the x dimension.

# Load our Point Spread Function (PSF)
import dask.array.image
psf = dask.array.image.imread("AOLLSMData/m4/psfs_z0p1/*.tif")[:, None, ...]

	Array	Chunk
Bytes	2.48 MB	827.39 kB
Shape	(3, 1, 101, 64, 64)	(1, 1, 101, 64, 64)
Count	6 Tasks	3 Chunks
Type	uint16	numpy.ndarray

1 3

64 64 101

# Convert data to float32 for computation¶
import numpy as np
imgs = imgs.astype(np.float32)
# Note: the psf needs to be sampled with a voxel spacing
# consistent with the image's sampling
psf = psf.astype(np.float32)

# Apply Richardson-Lucy Deconvolution¶
def richardson_lucy_deconvolution(img, psf, iterations=1):
    """ Apply deconvolution to a single chunk of data """
    import itk

    img = img[0, 0, ...]  # remove leading two length-one dimensions
    psf = psf[0, 0, ...]  # remove leading two length-one dimensions

    image = itk.image_view_from_array(img)   # Convert to ITK object
    kernel = itk.image_view_from_array(psf)  # Convert to ITK object

    deconvolved = itk.richardson_lucy_deconvolution_image_filter(
        image,
        kernel_image=kernel,
        number_of_iterations=iterations
    )

    result = itk.array_from_image(deconvolved)  # Convert back to Numpy array
    result = result[None, None, ...]  # Add back the leading length-one dimensions

    return result

out = da.map_blocks(richardson_lucy_deconvolution, imgs, psf, dtype=np.float32)

# Create a local cluster of dask worker processes
# (this could also point to a distributed cluster if you have it)
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=20, threads_per_process=1)
client = Client(cluster)  # now dask operations use this cluster by default

# Trigger computation and store
out.to_zarr("AOLLSMData_m4_raw.zarr", "deconvolved", overwrite=True)

So in the example above we …

Load data both from Zarr and TIFF files into multi-chunked Dask arrays
Construct a function to apply an ITK routine onto each chunk
Apply that function across the dask array with the dask.array.map_blocks function.
Store the result back into Zarr format

From the perspective of an imaging scientist, the new piece of technology here is the dask.array.map_blocks function. Given a Dask array composed of many NumPy arrays and a function, map_blocks applies that function across each block in parallel, returning a Dask array as a result. It’s a great tool whenever you want to apply an operation across many blocks in a simple fashion. Because Dask arrays are just made out of Numpy arrays it’s an easy way to compose Dask with the rest of the Scientific Python ecosystem.

Building the right function

However in this case there are a few challenges to constructing the right Numpy -> Numpy function, due to both idiosyncrasies in ITK and Dask Array. Let’s look at our function again:

def richardson_lucy_deconvolution(img, psf, iterations=1):
    """ Apply deconvolution to a single chunk of data """
    import itk

    img = img[0, 0, ...]  # remove leading two length-one dimensions
    psf = psf[0, 0, ...]  # remove leading two length-one dimensions

    image = itk.image_view_from_array(img)   # Convert to ITK object
    kernel = itk.image_view_from_array(psf)  # Convert to ITK object

    deconvolved = itk.richardson_lucy_deconvolution_image_filter(
        image,
        kernel_image=kernel,
        number_of_iterations=iterations
    )

    result = itk.array_from_image(deconvolved)  # Convert back to Numpy array
    result = result[None, None, ...]  # Add back the leading length-one dimensions

    return result

out = da.map_blocks(richardson_lucy_deconvolution, imgs, psf, dtype=np.float32)

This is longer than we would like. Instead, we would have preferred to just use the itk function directly, without all of the steps before and after.

deconvolved = da.map_blocks(itk.richardson_lucy_deconvolution_image_filter, imgs, psf)

What were the extra steps in our function and why were they necessary?

Convert to and from ITK Image objects: ITK functions don’t consume and produce Numpy arrays, they consume and produce their own Image data structure. There are convenient functions to convert back and forth, so handling this is straightforward, but it does need to be handled each time. See ITK #1136 for a feature request that would remove the need for this step.
Unpack and pack singleton dimensions: Our Dask arrays have shapes like the following:
```
Array Shape: (3, 199, 201, 1024, 768)
Chunk Shape: (1,   1, 201, 1024, 768)
```
So our map_blocks function gets NumPy arrays of the chunk size, (1, 1, 201, 1024, 768). However, our ITK functions are meant to work on 3d arrays, not 5d arrays, so we need to remove those first two dimensions.
```
img = img[0, 0, ...]  # remove leading two length-one dimensions
psf = psf[0, 0, ...]  # remove leading two length-one dimensions
```
And then when we’re done, Dask expects to get back 5d arrays like what it provided, so we add these singleton dimensions back in
```
result = result[None, None, ...]  # Add back the leading length-one dimensions
```
Again, this is straightforward for users who are accustomed to NumPy slicing syntax, but does need to be done each time. This adds some friction to our development process, and is another step that can confuse users.

But if you’re comfortable working around things like this, then ITK and map_blocks can be a powerful combination if you want to parallelize out ITK operations across a cluster.

Defining a Dask Cluster

Above we used dask.distributed.LocalCluster to set up 20 single-threaded workers on our local workstation:

from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=20, threads_per_process=1)
client = Client(cluster)  # now dask operations use this cluster by default

If you had a distributed resource, this is where you would connect it. You would swap out LocalCluster with one of Dask’s other deployment options.

Also, we found that we needed to use many single-threaded processes rather than one multi-threaded process because ITK functions seem to still hold onto the GIL. This is fine, we just need to be aware of it so that we set up our Dask workers appropriately with one thread per process for maximum efficiency. See ITK #1134 for an active Github issue on this topic.

Serialization

We had some difficulty when using the ITK library across multiple processes, because the library itself didn’t serialize well. (If you don’t understand what that means, don’t worry). We solved a bit of this in ITK #1090, but some issues still remain.

We got around this by including the import in the function rather than outside of it.

def richardson_lucy_deconvolution(img, psf, iterations=1):
    import itk   # <--- we work around serialization issues by importing within the function

That way each task imports itk individually, and we sidestep this issue.

Trying Scikit-Image

We also tried out the Richardson Lucy deconvolution operation in Scikit-Image. Scikit-Image is known for being more Scipy/Numpy native, but not always as fast as ITK. Our experience confirmed this perception.

First, we were glad to see that the scikit-image function worked with map_blocks immediately without any packing/unpacking, dimensionality, or serialization issues:

import skimage.restoration

out = da.map_blocks(skimage.restoration.richardson_lucy, imgs, psf)  # just works

So all of that converting to and from image objects or removing and adding singleton dimensions isn’t necessary here.

In terms of performance we were also happy to see that Scikit-Image released the GIL, so we were able to get very high reported CPU utilization when using a small number of multi-threaded processes. However, even though CPU utilization was high, our parallel performance was poor enough that we stuck with the ITK solution, warts and all. More information about this is available in Github issue scikit-image #4083.

Note: sequentially on a single chunk, ITK ran in around 2 minutes while scikit-image ran in 3 minutes. It was only once we started parallelizing that things became slow.

Regardless, our goal in this experiment was to see how well ITK and Dask array played together. It was nice to see what smooth integration looks like, if only to motivate future development in ITK+Dask relations.

Numba GUFuncs

An alternative to da.map_blocks are Generalized Universal Functions (gufuncs) These are functions that have many magical properties, one of which is that they operate equally well on both NumPy and Dask arrays. If libraries like ITK or Scikit-Image make their functions into gufuncs then they work without users having to do anything special.

The easiest way to implement gufuncs today is with Numba. I did this on our wrapped richardson_lucy function, just to show how it could work, in case other libraries want to take this on in the future.

import numba

@numba.guvectorize(
    ["float32[:,:,:], float32[:,:,:], float32[:,:,:]"],  # we have to specify types
    "(i,j,k),(a,b,c)->(i,j,k)",                          # and dimensionality explicitly
    forceobj=True,
)
def richardson_lucy_deconvolution(img, psf, out):
    # <---- no dimension unpacking!
    iterations = 1
    image = itk.image_view_from_array(np.ascontiguousarray(img))
    kernel = itk.image_view_from_array(np.ascontiguousarray(psf))

    deconvolved = itk.richardson_lucy_deconvolution_image_filter(
        image, kernel_image=kernel, number_of_iterations=iterations
    )
    out[:] = itk.array_from_image(deconvolved)

# Now this function works natively on either NumPy or Dask arrays
out = richardson_lucy_deconvolution(imgs, psf)  # <-- no map_blocks call!

Note that we’ve both lost the dimension unpacking and the map_blocks call. Our function now knows enough information about how it can broadcast that Dask can do the parallelization without being told what to do explicitly.

This adds some burden onto library maintainers, but makes the user experience much more smooth.

GPU Acceleration

When doing some user research on image processing and Dask, almost everyone we interviewed said that they wanted faster deconvolution. This seemed to be a major pain point. Now we know why. It’s both very common, and very slow.

Running deconvolution on a single chunk of this size takes around 2-4 minutes, and we have hundreds of chunks in a single dataset. Multi-core parallelism can help a bit here, but this problem may also be ripe for GPU acceleration. Similar operations typically have 100x speedups on GPUs. This might be a more pragmatic solution than scaling out to large distributed clusters.

What’s next?

This experiment both …

Gives us an example that other imaging scientists can copy and modify to be effective with Dask and ITK together.
Highlights areas of improvement where developers from the different libraries can work to remove some of these rough interactions spots in the future.

It’s worth noting that Dask has done this with lots of libraries within the Scipy ecosystem, including Pandas, Scikit-Image, Scikit-Learn, and others.

We’re also going to continue with our imaging experiment, while these technical issues get worked out in the background. Next up, segmentation!

2019 Dask User Survey

2019-08-05T00:00:00+00:00

This notebook presents the results of the 2019 Dask User Survey, which ran earlier this summer. Thanks to everyone who took the time to fill out the survey! These results help us better understand the Dask community and will guide future development efforts.

The raw data, as well as the start of an analysis, can be found in this binder:

Let us know if you find anything in the data.

Highlights

We had 259 responses to the survey. Overall, we found that the survey respondents really care about improved documentation, and ease of use (including ease of deployment), and scaling. While Dask brings together many different communities (big arrays versus big dataframes, traditional HPC users versus cloud-native resource managers), there was general agreement in what is most important for Dask.

Now we’ll go through some individual items questions, highlighting particularly interesting results.

How do you use Dask?

For learning resources, almost every respondent uses the documentation.

Most respondents use Dask at least occasionally. Fortunately we had a decent number of respondents who are just looking into Dask, yet still spent the time to take the survey.

I’m curiuos about how learning resource usage changes as users become more experienced. We might expect those just looking into Dask to start with examples.dask.org, where they can try out Dask without installing anything.

Overall, documentation is still the leader across user user groups.

The usage of the Dask tutorial and the dask examples are relatively consistent across groups. The primary difference between regular and new users is that regular users are more likely to engage on GitHub.

From StackOverflow questions and GitHub issues, we have a vague idea about which parts of the library are used. The survey shows that (for our respondents at least) DataFrame and Delayed are the most commonly used APIs.

About 65.49% of our respondests are using Dask on a Cluster.

But the majority of respondents also use Dask on their laptop. This highlights the importance of Dask scaling down, either for prototyping with a LocalCluster, or for out-of-core analysis using LocalCluster or one of the single-machine schedulers.

Most respondents use Dask interactively, at least some of the time.

Most repondents thought that more documentation and examples would be the most valuable improvements to the project. This is especially pronounced among new users. But even among those using Dask everyday more people thought that “More examples” is more valuable than “New features” or “Performance improvements”.

            <tr>
                    <th id="T_820ef326_b488_11e9_ad41_186590cd1c87level0_row0" class="row_heading level0 row0" >Every day</th>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row0_col0" class="data row0 col0" >9</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row0_col1" class="data row0 col1" >11</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row0_col2" class="data row0 col2" >25</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row0_col3" class="data row0 col3" >22</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row0_col4" class="data row0 col4" >23</td>
        </tr>
        <tr>
                    <th id="T_820ef326_b488_11e9_ad41_186590cd1c87level0_row1" class="row_heading level0 row1" >Just looking for now</th>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row1_col0" class="data row1 col0" >1</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row1_col1" class="data row1 col1" >3</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row1_col2" class="data row1 col2" >18</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row1_col3" class="data row1 col3" >9</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row1_col4" class="data row1 col4" >5</td>
        </tr>
        <tr>
                    <th id="T_820ef326_b488_11e9_ad41_186590cd1c87level0_row2" class="row_heading level0 row2" >Occasionally</th>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row2_col0" class="data row2 col0" >14</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row2_col1" class="data row2 col1" >27</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row2_col2" class="data row2 col2" >52</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row2_col3" class="data row2 col3" >18</td>
                    <td id="T_820ef326_b488_11e9_ad41_186590cd1c87row2_col4" class="data row2 col4" >15</td>
        </tr>
</tbody></table>

Perhaps users of certain dask APIs feel differenlty from the group as a whole? We perform a similar analysis grouped by API use, rather than frequency of use.

Normalized by row. Darker means that a higher proporiton of users with that usage frequency prefer that priority.
Which would help you most right now?	Bug fixes	More documentation	More examples in my field	New features	Performance improvements
How often do you use Dask?

            <tr>
                    <th id="T_821479f4_b488_11e9_ad41_186590cd1c87level0_row0" class="row_heading level0 row0" >Array</th>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row0_col0" class="data row0 col0" >10</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row0_col1" class="data row0 col1" >24</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row0_col2" class="data row0 col2" >62</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row0_col3" class="data row0 col3" >15</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row0_col4" class="data row0 col4" >25</td>
        </tr>
        <tr>
                    <th id="T_821479f4_b488_11e9_ad41_186590cd1c87level0_row1" class="row_heading level0 row1" >Bag</th>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row1_col0" class="data row1 col0" >3</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row1_col1" class="data row1 col1" >11</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row1_col2" class="data row1 col2" >16</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row1_col3" class="data row1 col3" >10</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row1_col4" class="data row1 col4" >7</td>
        </tr>
        <tr>
                    <th id="T_821479f4_b488_11e9_ad41_186590cd1c87level0_row2" class="row_heading level0 row2" >DataFrame</th>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row2_col0" class="data row2 col0" >16</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row2_col1" class="data row2 col1" >32</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row2_col2" class="data row2 col2" >71</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row2_col3" class="data row2 col3" >39</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row2_col4" class="data row2 col4" >26</td>
        </tr>
        <tr>
                    <th id="T_821479f4_b488_11e9_ad41_186590cd1c87level0_row3" class="row_heading level0 row3" >Delayed</th>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row3_col0" class="data row3 col0" >16</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row3_col1" class="data row3 col1" >22</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row3_col2" class="data row3 col2" >55</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row3_col3" class="data row3 col3" >26</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row3_col4" class="data row3 col4" >27</td>
        </tr>
        <tr>
                    <th id="T_821479f4_b488_11e9_ad41_186590cd1c87level0_row4" class="row_heading level0 row4" >Futures</th>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row4_col0" class="data row4 col0" >12</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row4_col1" class="data row4 col1" >9</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row4_col2" class="data row4 col2" >25</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row4_col3" class="data row4 col3" >20</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row4_col4" class="data row4 col4" >17</td>
        </tr>
        <tr>
                    <th id="T_821479f4_b488_11e9_ad41_186590cd1c87level0_row5" class="row_heading level0 row5" >ML</th>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row5_col0" class="data row5 col0" >5</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row5_col1" class="data row5 col1" >11</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row5_col2" class="data row5 col2" >23</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row5_col3" class="data row5 col3" >11</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row5_col4" class="data row5 col4" >7</td>
        </tr>
        <tr>
                    <th id="T_821479f4_b488_11e9_ad41_186590cd1c87level0_row6" class="row_heading level0 row6" >Xarray</th>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row6_col0" class="data row6 col0" >8</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row6_col1" class="data row6 col1" >11</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row6_col2" class="data row6 col2" >34</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row6_col3" class="data row6 col3" >7</td>
                    <td id="T_821479f4_b488_11e9_ad41_186590cd1c87row6_col4" class="data row6 col4" >9</td>
        </tr>
</tbody></table>

Nothing really stands out. The “futures” users (who we expect to be relatively advanced) may prioritize features and performance over documentation. But everyone agrees that more examples are the highest priority.

Common Feature Requests

For specific features, we made a list of things that we (as developers) thought might be important.

The clearest standout is how many people thought “Better NumPy/Pandas support” was “most critical”. In hindsight, it’d be good to have a followup fill-in field to undertand what each respondent meant by that. The parsimonious interpretion is “cover more of the NumPy / pandas API”.

“Ease of deployment” had a high proportion of “critical to me”. Again in hindsight, I notice a bit of ambiguity. Does this mean people want Dask to be easier to deploy? Or does this mean that Dask, which they currently find easy to deploy, is critically important? Regardless, we can prioritize simplicity in deployment.

Relatively few respondents care about things like “Managing many users”, though we expect that this would be relatively popular among system administartors, who are a smaller population.

And of course, we have people pushing Dask to its limits for whom “Improving scaling” is critically important.

What other systems do you use?

A relatively high proportion of respondents use Python 3 (97% compared to 84% in the most recent Python Developers Survey).

3    97.29%
2     2.71%
Name: Python 2 or 3?, dtype: object

We were a bit surprised to see that SSH is the most popular “cluster resource manager”.

SSH                                                       98
Kubernetes                                                73
HPC resource manager (SLURM, PBS, SGE, LSF or similar)    61
My workplace has a custom solution for this               23
I don't know, someone else does this for me               16
Hadoop / Yarn / EMR                                       14
Name: If you use a cluster, how do you launch Dask? , dtype: int64

How does cluster-resource manager compare with API usage?

Normalized by row. Darker means that a higher proporiton of users of that API prefer that priority.
Which would help you most right now?	Bug fixes	More documentation	More examples in my field	New features	Performance improvements
Dask APIs

            <tr>
                    <th id="T_8326d0f8_b488_11e9_ad41_186590cd1c87level0_row0" class="row_heading level0 row0" >Custom</th>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row0_col0" class="data row0 col0" >15</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row0_col1" class="data row0 col1" >6</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row0_col2" class="data row0 col2" >18</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row0_col3" class="data row0 col3" >17</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row0_col4" class="data row0 col4" >14</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row0_col5" class="data row0 col5" >6</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row0_col6" class="data row0 col6" >7</td>
        </tr>
        <tr>
                    <th id="T_8326d0f8_b488_11e9_ad41_186590cd1c87level0_row1" class="row_heading level0 row1" >HPC</th>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row1_col0" class="data row1 col0" >50</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row1_col1" class="data row1 col1" >13</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row1_col2" class="data row1 col2" >40</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row1_col3" class="data row1 col3" >40</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row1_col4" class="data row1 col4" >22</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row1_col5" class="data row1 col5" >11</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row1_col6" class="data row1 col6" >30</td>
        </tr>
        <tr>
                    <th id="T_8326d0f8_b488_11e9_ad41_186590cd1c87level0_row2" class="row_heading level0 row2" >Hadoop / Yarn / EMR</th>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row2_col0" class="data row2 col0" >7</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row2_col1" class="data row2 col1" >6</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row2_col2" class="data row2 col2" >12</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row2_col3" class="data row2 col3" >8</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row2_col4" class="data row2 col4" >4</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row2_col5" class="data row2 col5" >7</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row2_col6" class="data row2 col6" >3</td>
        </tr>
        <tr>
                    <th id="T_8326d0f8_b488_11e9_ad41_186590cd1c87level0_row3" class="row_heading level0 row3" >Kubernetes</th>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row3_col0" class="data row3 col0" >40</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row3_col1" class="data row3 col1" >18</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row3_col2" class="data row3 col2" >56</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row3_col3" class="data row3 col3" >47</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row3_col4" class="data row3 col4" >37</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row3_col5" class="data row3 col5" >26</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row3_col6" class="data row3 col6" >21</td>
        </tr>
        <tr>
                    <th id="T_8326d0f8_b488_11e9_ad41_186590cd1c87level0_row4" class="row_heading level0 row4" >SSH</th>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row4_col0" class="data row4 col0" >61</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row4_col1" class="data row4 col1" >23</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row4_col2" class="data row4 col2" >72</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row4_col3" class="data row4 col3" >58</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row4_col4" class="data row4 col4" >32</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row4_col5" class="data row4 col5" >30</td>
                    <td id="T_8326d0f8_b488_11e9_ad41_186590cd1c87row4_col6" class="data row4 col6" >25</td>
        </tr>
</tbody></table>

HPC users are relatively heavy users of dask.array and xarray.

Somewhat surprisingly, Dask’s heaviest users find dask stable enough. Perhaps they’ve pushed past the bugs and found workarounds (percentages are normalized by row).

Takeaways

We should prioritize improving and expanding our documentation and examples. This may be accomplished by Dask maintainers seeking examples from the community. Many of the examples on https://examples.dask.org were developed by domain specialist who use Dask.
Improved scaling to larger problems is important, but we shouldn’t sacrifice the single-machine usecase to get there.
Both interactive and batch workflows are important.
Dask’s various sub-communities are more similar than they are different.

Thanks again to all the respondents. We look forward to repeating this process to identify trends over time.

Dask Release 2.2.0

2019-08-02T00:00:00+00:00

I’m pleased to announce the release of Dask version 2.2. This is a significant release with bug fixes and new features. The last blogged release was 2.0 on 2019-06-22. This blogpost outlines notable changes since the last post.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

As always there are too many changes to list here, instead we’ll highlight a few that readers may find interesting, or that break old behavior. In particular we discuss the following:

Parquet rewrite
Nicer HTML output for Clients and Logs
Hyper-parameter selection with Hyperband in Dask-ML
Move bytes I/O handling out of Dask to FSSpec
async/await everywhere, and cleaner setup for developers
A new SSH deployment solution

1 - Parquet Rewrite

Today Dask DataFrame can read and write Parquet data using either fastparquet or Apache Arrow.

import dask.dataframe as dd

df = dd.read_parquet("/path/to/mydata.parquet", engine="arrow")
# or
df = dd.read_parquet("/path/to/mydata.parquet", engine="fastparquet")

Supporting both libraries within Dask has been helpful for users, but introduced some maintenance burden, especially given that each library co-evolved with Dask dataframe over the years. The contract between Dask Dataframe and these libraries was convoluted, making it difficult to evolve swiftly.

To address this we’ve formalized what Dask expects of Parquet reader/writers into a more formal Parquet Engine contract. This keeps maintenance costs low, enables independent development for each project, and allows for new engines to emerge.

Already a GPU-accelerated Parquet reader is available in a PR on the RAPIDS cuDF library.

As a result, we’ve also been able to fix a number of long-standing bugs, and improve the functionality with both engines.

Some fun quotes from Sarah Bird during development

I am currently testing this. So far so good. I can load my dataset in a few seconds with 1800 partitions. Game changing!

and

I am now successfully working on a dataset with 74,000 partitions and no metadata. Opening dataset and df.head() takes 7 - 30s. (Presumably depending on whether s3fs cache is cold or not). THIS IS HUGE! This was literally impossible before.

The API remains the same, but functionality should be smoother.

Thanks to Rick Zamora, Martin Durant for doing most of the work here and to Sarah Bird, Wes McKinney, and Mike McCarty for providing guidance and review.

2 - Nicer HTML output for Clients and Logs

from dask.distributed import Client
client = Client()

Dask APIs	Array	Bag	DataFrame	Delayed	Futures	ML	Xarray
If you use a cluster, how do you launch Dask?

Client

Scheduler: tcp://127.0.0.1:60275
Dashboard: http://127.0.0.1:8787/status

Cluster

Workers: 4
Cores: 12
Memory: 17.18 GB

client.cluster.logs()

Scheduler

distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:60275
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.scheduler - INFO - Register tcp://127.0.0.1:60281
distributed.scheduler - INFO - Register tcp://127.0.0.1:60282
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:60281
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:60282
distributed.scheduler - INFO - Register tcp://127.0.0.1:60285
distributed.scheduler - INFO - Register tcp://127.0.0.1:60286
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:60285
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:60286
distributed.scheduler - INFO - Receive client connection: Client-6b6ba1d0-b3bd-11e9-9bd0-acde48001122

tcp://127.0.0.1:60281

distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:60281
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:60281
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:60275
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          3
distributed.worker - INFO -                Memory:                    4.29 GB
distributed.worker - INFO -       Local Directory: /Users/mrocklin/workspace/dask/dask-worker-space/worker-c4_44fym
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:60275
distributed.worker - INFO - -------------------------------------------------

tcp://127.0.0.1:60282

distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:60282
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:60282
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:60275
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          3
distributed.worker - INFO -                Memory:                    4.29 GB
distributed.worker - INFO -       Local Directory: /Users/mrocklin/workspace/dask/dask-worker-space/worker-quu4taje
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:60275
distributed.worker - INFO - -------------------------------------------------

tcp://127.0.0.1:60285

distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:60285
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:60285
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:60275
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          3
distributed.worker - INFO -                Memory:                    4.29 GB
distributed.worker - INFO -       Local Directory: /Users/mrocklin/workspace/dask/dask-worker-space/worker-ll4cozug
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:60275
distributed.worker - INFO - -------------------------------------------------

tcp://127.0.0.1:60286

distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:60286
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:60286
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:60275
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          3
distributed.worker - INFO -                Memory:                    4.29 GB
distributed.worker - INFO -       Local Directory: /Users/mrocklin/workspace/dask/dask-worker-space/worker-lpbkkzj6
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:60275
distributed.worker - INFO - -------------------------------------------------

Note: this looks better under any browser other than IE and Edge

Thanks to Jacob Tomlinson for this work.

3 - Hyperparameter selection with HyperBand

Dask-ML 1.0 has been released with a new HyperBandSearchCV meta-estimator for hyper-parameter optimization. This can be used as an alternative to RandomizedSearchCV to find similar hyper-parameters in less time by not wasting time on hyper-parameters that are not promising.

>>> import numpy as np
>>> from dask_ml.model_selection import HyperbandSearchCV
>>> from dask_ml.datasets import make_classification
>>> from sklearn.linear_model import SGDClassifier

>>> X, y = make_classification(chunks=20)
>>> est = SGDClassifier(tol=1e-3)
>>> param_dist = {'alpha': np.logspace(-4, 0, num=1000),
>>>               'loss': ['hinge', 'log', 'modified_huber', 'squared_hinge'],
>>>               'average': [True, False]}

>>> search = HyperbandSearchCV(est, param_dist)
>>> search.fit(X, y, classes=np.unique(y))
>>> search.best_params_
{'loss': 'log', 'average': False, 'alpha': 0.0080502}

Thanks to Scott Sievert. You can see Scott talk about this topic in greater depth by watching his SciPy 2019 talk.

4 - Move bytes I/O handling out of Dask to FSSpec

We’ve spun out Dask’s internal code to read and write raw data to different storage systems out to a separate project, fsspec.

Here is a small example:

import fsspec

with fsspec.open("https://github.com/dask/dask/edit/master/README.rst") as f:
    print(f.read(1000))

with fsspec.open("s3://bucket/myfile.csv") as f:
    df = pd.read_csv(f)

with fsspec.open("hdfs:///path/to/myfile.csv") as f:
    df = pd.read_csv(f)

with fsspec.open("gcs://bucket/myfile.csv") as f:
    df = pd.read_csv(f)

Dask’s I/O infrastructure to read and write bytes from systems like HDFS, S3, GCS, Azure, and other remote storage systems is arguably the most uniform and comprehensive in Python today. Through tooling like s3fs, gcsfs, and ~~hdfs3~~ pyarrow.hdfs, it’s easy to read and write data in a Pythonic way to a variety of remote storage systems.

Early on we decided that we wanted this code to live outside of the mainline Dask codebase, which is why they are independent projects. This choice allowed other libraries, like Pandas, Zarr, and others to benefit from this work, without a strict dependency on Dask. However, there was still code within Dask that helped to unify them a bit. We’ve moved this code out to an external project, fsspec which includes all of the centralization code that Dask used to provide, as well as a formal specification for what a remote data system should look like in order to be compatible. This also helps to unify efforts with other projects like Arrow.

Special thanks to Martin Durant for shepherding Dask’s I/O infrastructure over the years, and for doing the more immediate work of splitting out fsspec.

You can read more about FSSpec and its transition out of Dask here.

5 - Async/Await everywhere, and cleaner setup for developers

In Dask 2.0 we dropped Python 2 support and now support only Python 3.5 and above. This allows us to adopt async and await syntax for concurrent execution rather than an older coroutine based approach with yield. The differences here started out as largely aesthetic, but triggered a number of substantive improvements as we walked through the codebase cleaning things up. Starting and stopping internal Scheduler, Worker, Nanny, and Client objects is now far more uniform, reducing the presence of subtle bugs.

This is discussed in more detail in the Python API setup documentation and is encapsulated in this code example from those docs:

import asyncio

from dask.distributed import Scheduler, Worker, Client

async def f():
    async with Scheduler() as s:
        async with Worker(s.address) as w1, Worker(s.address) as w2:
            async with Client(s.address, asynchronous=True) as client:
                future = client.submit(lambda x: x + 1, 10)
                result = await future
                print(result)

asyncio.get_event_loop().run_until_complete(f())

As a result of this and other internal cleanup intermittent testing failures in our CI have disappeared, and developer mood is high :)

6 - A new SSHCluster

We’ve added a second SSH cluster deployment solution. It looks like this:

from distributed.deploy.ssh2 import SSHCluster  # this will move in future releases

cluster = SSHCluster(
    hosts=["host1", "host2", "host3", "host4"],
    # hosts=["localhost"] * 4  # if you want to try this out locally,
    worker_kwargs={"nthreads": 4},
    scheduler_kwargs={},
    connect_kwargs={"known_hosts": None}
)

Note that this object is experimental, and subject to change without notice

We worked on this for two reasons:

Our user survey showed that a surprising number of people were deploying Dask with SSH. Anecdotally they seem to be just SSHing into machines and then using Dask’s normal Dask Command Line Interface)

We wanted a solution that was easier than this.
We’ve been trying to unify the code in the various deployment solutions (like Kubernetes, SLURM, Yarn/Hadoop) to a central codebase, and having a simple SSHCluster as a test case has proven valuable for testing and experimentation.

Also note, Dask already has a dask-ssh solution today that is more mature

We expect that unification of deployment will be a central theme for the next few months of development.

Acknowledgements

There have been two releases since the last time we had a release blogpost. The following people contributed to the following repositories since the 2.0 release on June 30th:

dask/dask
- Brett Naul
- Daniel Saxton
- David Brochart
- Davis Bennett
- Elliott Sales de Andrade
- GALI PREM SAGAR
- James Bourbeau
- Jim Crist
- Loïc Estève
- Martin Durant
- Matthew Rocklin
- Matthias Bussonnier
- Natalya Rapstine
- Nick Becker
- Peter Andreas Entschev
- Ralf Gommers
- Richard (Rick) Zamora
- Sarah Bird
- Sean McKenna
- Tom Augspurger
- Willi Rath
- Xavier Holt
- andrethrill
- asmith26
- msbrown47
- tshatrov
dask/distributed
- Christian Hudon
- Gabriel Sailer
- Jacob Tomlinson
- James Bourbeau
- Jim Crist
- Martin Durant
- Matthew Rocklin
- Pierre Glaser
- Russ Bubley
- tjb900
dask/dask-jobqueue
- Guillaume Eynard-Bontemps
- Leo Singer
- Loïc Estève
- Matthew Rocklin
- Stuart Berg
dask/dask-examples
- Chris White
- Ian Rose
- Matthew Rocklin
dask/dask-mpi
- Anderson Banihirwe
- Kevin Paul
- Matthew Rocklin
dask/dask-kubernetes
- Matthew Rocklin
- Tom Augspurger
dask/dask-ml
- Roman Yurchak
- Tom Augspurger
dask/dask-yarn
- Al Johri
- Jim Crist
dask/dask-examples

Extracting fsspec from Dask

2019-07-23T00:00:00+00:00

fsspec, the new base for file system operations in Dask, Intake, s3fs, gcsfs and others, is now available as a stand-alone interface and central place to develop new backends and file operations. Although it was developed as part of Dask, you no longer need Dask to use this functionality.

Introduction

Over the past few years, Dask’s IO capability has grown gradually and organically, to include a number of file-formats, and the ability to access data seamlessly on various remote/cloud data systems. This has been achieved through a number of sister packages for viewing cloud resources as file systems, and dedicated code in dask.bytes. Some of the storage backends, particularly s3fs, became immediately useful outside of Dask too, and were picked up as optional dependencies by pandas, xarray and others.

For the sake of consolidating the behaviours of the various backends, providing a single reference specification for any new backends, and to make this set of file system operations available even without Dask, I created fsspec. This last week, Dask changed to use fsspec directly for its IO needs, and I would like to describe in detail here the benefits of this change.

Although this was done initially to easy the maintenance burden, the important takeaway is that we want to make file systems operations easily available to the whole pydata ecosystem, with or without Dask.

History

The first file system I wrote was hdfs3, a thin wrapper around the libhdfs3 C library. At the time, Dask had acquired the ability to run on a distributed cluster, and HDFS was the most popular storage solution for these (in the commercial world, at least), so a solution was required. The python API closely matched the C one, which in turn followed the Java API and posix standards. Fortunately, python already has a file-like standard, so providing objects that implemented that was enough to make remote bytes available to many packages.

Pretty soon, it became apparent that cloud resources would be at least as important as in-cluster file systems, and so followed s3fs, adlfs, and gcsfs. Each followed the same pattern, but with some specific code for the given interface, and improvements based on the experience of the previous interfaces. During this time, Dask’s needs also evolved, due to more complex file formats such as parquet. Code to interface to the different backends and adapt their methods ended up in the Dask repository.

In the meantime, other file system interfaces arrived, particularly pyarrow’s, which had its own HDFS implementation and direct parquet reading. But we would like all of the tools in the ecosystem to work together well, so that Dask can read parquet using either engine from any of the storage backends.

Code duplication

Copying an interface, adapting it and releasing it, as I did with each iteration of the file system, is certainly a quick way to get a job done. However, when you then want to change the behaviour, or add new functionality, it turns out you need to repeat the work in each place (violating the DRY principle) or have the interfaces diverge slowly. Good examples of this were glob and walk, which supported various options for the former, and returned different things (list, versions dir/files iterator) for the latter.

>>> fs = dask.bytes.local.LocalFileSystem()
>>> fs.walk('/home/path/')
<iterator of tuples>


>>> fs = s3fs.S3FileSystme()
>>> fs.walk('bucket/path')
[list of filenames]

We found that, for Dask’s needs, we needed to build small wrapper classes to ensure compatible APIs to all backends, as well as a class for operating on the local file system with the same interface, and finally a registry for all of these with various helper functions. Very little of this was specific to Dask, with only a couple of functions concerning themselves with building graphs and deferred execution. It did, however, raise the important issue that file systems should be serializable and that there should be a way to specify a file to be opened, which is also serializable (and ideally supports transparent text and compression).

New file systems

I already mentioned the effort to make a local file system class which met the same interface as the other ones which already existed. But there are more options that Dask users (and others) might want, such as ssh, ftp, http, in-memory, and so on. Following requests from users to handle these options, we started to write more file system interfaces, which all lived within dask.bytes; but it was unclear whether they should only support very minimal functionality, just enough to get something done from Dask, or a full set of file operations.

The in-memory file system, in particular, existed in an extremely long-lived PR - it’s not clear how useful such a thing is to Dask, when each worker has it’s own memory, and so sees a different state of the “file system”.

Consolidation

file system Spec, later fsspec, was born out of a desire to codify and consolidate the behaviours of the storage backends, reduce duplication, and provide the same functionality to all backends. In the process, it became much easier to write new implementation classes: see the implementation, which include interesting and highly experimental options such as the CachingFileSystem, which makes local copies of every remote read, for faster access the second time around. However, more important main-stream implementations also took shape, such as FTP, SSH, Memory and webHDFS (the latter being the best bet for accessing HDFS from outside the cluster, following all the problems building and authenticating with hdfs3).

Furthermore, the new repository gave the opportunity to implement new features, which would then have further-reaching applicability than if they had been done in just selected repositories. Examples include FUSE mounting, dictionary-style key-value views on file systems (such as used by zarr), and transactional writing of files. All file systems are serializable and pyarrow-compliant.

Usefulness

Eventually it dawned on my that the operations offered by the file system classes are very useful for people not using Dask too. Indeed, s3fs, for example, sees plenty of use stand-alone, or in conjunction with something like fastparquet, which can accept file system functions to its method, or pandas.

So it seemed to make sense to have a particular repo to write out the spec that a Dask-compliant file system should adhere to, and I found that I could factor out a lot of common behaviour from the existing implementations, provide functionality that had existed in only some to all, and generally improve every implementation along the way.

However, it was when considering fsspec in conjunction with Intake that I realised how generally useful a stand-alone file system package can be: the PR implemented a generalised file selector that can browse files in any file system that we have available, even being able, for instance, to view a remote zip-file on S3 as a browseable file system. Note that, similar to the general thrust of this blog, the file selector itself need not live in the Intake repo and will eventually become either its own thing, or an optional feature of fsspec. You shouldn’t need Intake either just to get generalised file system operations.

Final Thoughts

This work is not quite on the level of “protocol standards” such as the well-know python buffer protocol, but I think it is a useful step in making data in various storage services available to people, since you can operate on each with the same API, expect the same behaviour, and create real python file-like objects to pass to other functions. Having a single central repo like this offers an obvious place to discuss and amend the spec, and build extra functionality onto it.

Many improvements remain to be done, such as support for globstrings in more functions, or a single file system which can dispatch to the various backends depending on the form of the URL provided; but there is now an obvious place for all of this to happen.

Dask Release 2.0

2019-06-22T00:00:00+00:00

Please take the Dask User Survey for 2019. Your reponse helps to prioritize future work.

We are pleased to announce the release of Dask version 2.0. This is a major release with bug fixes and new features.

Most major version changes of software signal many new and exciting features. That is not the case with this release. Instead, we’re bumping the major version number because we’ve broken a few APIs to improve maintainability, and because we decided to drop support for Python 2.

This blogpost outlines these changes.

As always, you can conda install Dask:

conda install dask

or pip install from PyPI:

pip install "dask[complete]" --upgrade

Full changelogs are available here:

Drop support for Python 2

Python 2 reaches end of life in 2020, just six months away. Most major PyData projects are dropping Python 2 support around now. See the Python 3 Statement for more details about some of your favorite projects.

Python 2 users can continue to use older versions of Dask, which are in widespread use today. Institutions looking for long term support of Dask in Python 2 may wish to reach out to for-profit consulting companies, like Quansight.

Dropping Python 2 will allow maintainers to spend more of their time fixing bugs and developing new features. It will also allow the project to adopt more modern development practices going forward.

Small breaking changes

We now include a list with a brief description of most of the breaking changes:

The distributed.bokeh module has moved to distributed.dashboard
Various ncores keywords have been moved to nthreads
Client.map/gather/scatter no longer accept iterators and Python queue objects. Users can handle this themselves with submit/as_completed or can use the Streamz library.
The worker /main route has moved to /status
Cluster.workers is now a dictionary mapping worker name to worker, rather than a list as it was before

Some larger fun changes

We didn’t only break things. We also added some new things :)

Array metadata

Previously Dask Arrays were defined by their shape, chunkshape, and datatype, like float, int, and so on.

Now, Dask Arrays also know the type of their chunks. Historically this was almost always a NumPy array, so it didn’t make sense to store, but now that Dask Arrays are being used more frequently with sparse array chunks and GPU array chunks we now maintain this information as well in a ._meta attribute. This is already how Dask dataframes work, so it should be familiar to advanced users of that module.

>>> import dask.array as da
>>> x = da.eye(1000000)
>>> x._meta
array([], shape=(0, 0), dtype=float64)

>>> import sparse
>>> s = x.map_blocks(sparse.COO.from_numpy)
>>> s._meta
<COO: shape=(0, 0), dtype=float64, nnz=0, fill_value=0.0>

This work was largely done by Peter Entschev

Array HTML output

Dask arrays now print themselves nicely in Jupyter notebooks, showing a table of information about their size and chunk size, and also a visual diagram of their chunk structure.

import dask.array as da
x = da.ones((10000, 1000, 1000))

	Array	Chunk
Bytes	80.00 GB	125.00 MB
Shape	(10000, 1000, 1000)	(250, 250, 250)
Count	640 Tasks	640 Chunks
Type	float64	numpy.ndarray

1000 1000 10000

Proxy Worker dashboards from the Scheduler dashboard

If you’ve used Dask.distributed they you’re probably familiar with Dask’s scheduler dashboard, which shows the state of computations on the cluster with a real-time interactive Bokeh dashboard. However you may not be aware that Dask workers also have their own dashboard, which shows a completely separate set of plots for the state of that individual worker.

Historically these worker dashboards haven’t been as commonly used because it’s hard to connect to them. Users don’t know their address, or network rules don’t enable direct web connections. Fortunately, the scheduler dashboard is now able to proxy a connection from the user to the worker dashbaord.

You can access this by clicking on the “Info” tab and then selecting the “dashboard” link next to any of the workers. You will need to also install jupyter-server-proxy

pip install jupyter-server-proxy

Thanks to Ben Zaitlen for this fun addtition. We hope that now that these plots are made more visible, people will invest more into developing plots for them.

Black everywhere

We now use the Black code formatter throughout most Dask repositories. These repositories include pre-commit hooks, which we recommend when developing on the project.

cd /path/to/dask
git checkout master
git pull upstream master

pip install pre-commit
pre-commit install

Git will then call black and flake8 whenever you attempt to commit code.

Dask Gateway

We would also like to inform readers about the somewhat new Dask Gateway project that enables institutions and IT to control many Dask clusters for a variety of users.

Acknowledgements

There have been several releases since the last time we had a release blogpost. The following people contributed to the following repositories since the 1.1.0 release on January 23rd:

dask/dask
- (Rick) Richard J Zamora
- Abhinav Ralhan
- Adam Beberg
- Alistair Miles
- Álvaro Abella Bascarán
- Anderson Banihirwe
- Aploium
- Bart Broere
- Benjamin Zaitlen
- Bouwe Andela
- Brett Naul
- Brian Chu
- Bruce Merry
- Christian Hudon
- Cody Johnson
- Dan O’Donovan
- Daniel Saxton
- Daniel Severo
- Danilo Horta
- Dimplexion
- Elliott Sales de Andrade
- Endre Mark Borza
- Genevieve Buckley
- George Sakkis
- Guillaume Lemaitre
- HSR05
- Hameer Abbasi
- Henrique Ribeiro
- Henry Pinkard
- Hugo
- Ian Bolliger
- Ian Rose
- Isaiah Norton
- James Bourbeau
- Janne Vuorela
- John Kirkham
- Jim Crist
- Joe Corbett
- Jorge Pessoa
- Julia Signell
- JulianWgs
- Justin Poehnelt
- Justin Waugh
- Ksenia Bobrova
- Lijo Jose
- Marco Neumann
- Mark Bell
- Martin Durant
- Matthew Rocklin
- Michael Eaton
- Michał Jastrzębski
- Nathan Matare
- Nick Becker
- Paweł Kordek
- Peter Andreas Entschev
- Philipp Rudiger
- Philipp S. Sommer
- Roma Sokolov
- Ross Petchler
- Scott Sievert
- Shyam Saladi
- Søren Fuglede Jørgensen
- Thomas Zilio
- Tom Augspurger
- Yu Feng
- aaronfowles
- amerkel2
- asmith26
- btw08
- gregrf
- mbarkhau
- mcsoini
- severo
- tpanza
dask/distributed
- Adam Beberg
- Benjamin Zaitlen
- Brett Jurman
- Brett Randall
- Brian Chu
- Caleb
- Chris White
- Daniel Farrell
- Elliott Sales de Andrade
- George Sakkis
- James Bourbeau
- Jim Crist
- John Kirkham
- K.-Michael Aye
- Loïc Estève
- Magnus Nord
- Manuel Garrido
- Marco Neumann
- Martin Durant
- Mathieu Dugré
- Matt Nicolls
- Matthew Rocklin
- Michael Delgado
- Michael Spiegel
- Muammar El Khatib
- Nikos Tsaousis
- Olivier Grisel
- Peter Andreas Entschev
- Sam Grayson
- Scott Sievert
- Tom Augspurger
- Torsten Wörtwein
- amerkel2
- condoratberlin
- deepthirajagopalan7
- jukent
- plbertrand
dask/dask-ml
- Alejandro
- Florian Rohrer
- James Bourbeau
- Julien Jerphanion
- Matthew Rocklin
- Nathan Henrie
- Paul Vecchio
- Ryan McCormick
- Saadullah Amin
- Scott Sievert
- Sriharsha Atyam
- Tom Augspurger
dask/dask-jobqueue
- Andrea Zonca
- Guillaume Eynard-Bontemps
- Kyle Husmann
- Levi Naden
- Loïc Estève
- Matthew Rocklin
- Matyas Selmeci
- ocaisa
dask/dask-kubernetes
- Brian Phillips
- Jacob Tomlinson
- Jim Crist
- Joe Hamman
- Joseph Hamman
- Matthew Rocklin
- Tom Augspurger
- Yuvi Panda
- adam
dask/dask-examples
- Christoph Deil
- Genevieve Buckley
- Ian Rose
- Martin Durant
- Matthew Rocklin
- Matthias Bussonnier
- Robert Sare
- Tom Augspurger
- Willi Rath
dask/dask-labextension
- Daniel Bast
- Ian Rose
- Matthew Rocklin
- Yuvi Panda

Load Large Image Data with Dask Array

2019-06-20T00:00:00+00:00

This post explores simple workflows to load large stacks of image data with Dask array.

In particular, we start with a directory full of TIFF files of images like the following:

$ $ ls raw/ | head
ex6-2_CamA_ch1_CAM1_stack0000_560nm_0000000msec_0001291795msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0001_560nm_0043748msec_0001335543msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0002_560nm_0087497msec_0001379292msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0003_560nm_0131245msec_0001423040msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0004_560nm_0174993msec_0001466788msecAbs_000x_000y_000z_0000t.tif

and show how to stitch these together into large lazy arrays using the dask-image library

>>> import dask_image
>>> x = dask_image.imread.imread('raw/*.tif')

or by writing your own Dask delayed image reader function.

	Array	Chunk
Bytes	3.16 GB	316.15 MB
Shape	(2010, 1024, 768)	(201, 1024, 768)
Count	30 Tasks	10 Chunks
Type	uint16	numpy.ndarray

768 1024 2010

Some day we’ll eventually be able to perform complex calculations on this dask array.

Disclaimer: we’re not going to produce rendered images like the above in this post. These were created with NVidia IndeX, a completely separate tool chain from what is being discussed here. This post covers the first step of image loading.

Series Overview

A common case in fields that acquire large amounts of imaging data is to write out smaller acquisitions into many small files. These files can tile a larger space, sub-sample from a larger time period, and may contain multiple channels. The acquisition techniques themselves are often state of the art and constantly pushing the envelope in term of how large a field of view can be acquired, at what resolution, and what quality.

Once acquired this data presents a number of challenges. Algorithms often designed and tested to work on very small pieces of this data need to be scaled up to work on the full dataset. It might not be clear at the outset what will actually work and so exploration still plays a very big part of the whole process.

Historically this analytical process has involved a lot of custom code. Often the analytical process is stitched together by a series of scripts possibly in several different languages that write various intermediate results to disk. Thanks to advances in modern tooling these process can be significantly improved. In this series of blogposts, we will outline ways for image scientists to leverage different tools to move towards a high level, friendly, cohesive, interactive analytical pipeline.

Post Overview

This post in particular focuses on loading and managing large stacks of image data in parallel from Python.

Loading large image data can be a complex and often unique problem. Different groups may choose to store this across many files on disk, a commodity or custom database solution, or they may opt to store it in the cloud. Not all datasets within the same group may be treated the same for a variety of reasons. In short, this means loading data is a hard and expensive problem.

Despite data being stored in many different ways, often groups want to reapply the same analytical pipeline to these datasets. However if the data pipeline is tightly coupled to a particular way of loading the data for later analytical steps, it may be very difficult if not impossible to reuse an existing pipeline. In other words, there is friction between the loading and analysis steps, which frustrates efforts to make things reusable.

Having a modular and general way to load data makes it easy to present data stored differently in a standard way. Further having a standard way to present data to analytical pipelines allows that part of the pipeline to focus on what it does best, analysis! In general, this should decouple these to components in a way that improves the experience of users involved in all parts of the pipeline.

We will use image data generously provided by Gokul Upadhyayula at the Advanced Bioimaging Center at UC Berkeley and discussed in this paper (preprint), though the workloads presented here should work for any kind of imaging data, or array data generally.

Load image data with Dask

Let’s start again with our image data from the top of the post:

$ $ ls /path/to/files/raw/ | head
ex6-2_CamA_ch1_CAM1_stack0000_560nm_0000000msec_0001291795msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0001_560nm_0043748msec_0001335543msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0002_560nm_0087497msec_0001379292msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0003_560nm_0131245msec_0001423040msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0004_560nm_0174993msec_0001466788msecAbs_000x_000y_000z_0000t.tif

Load a single sample image with Scikit-Image

To load a single image, we use Scikit-Image:

>>> import glob
>>> filenames = glob.glob("/path/to/files/raw/*.tif")
>>> len(filenames)
597

>>> import imageio
>>> sample = imageio.imread(filenames[0])
>>> sample.shape
(201, 1024, 768)

Each filename corresponds to some 3d chunk of a larger image. We can look at a few 2d slices of this single 3d chunk to get some context.

import matplotlib.pyplot as plt
import skimage.io
plt.figure(figsize=(10, 10))
skimage.io.imshow(sample[:, :, 0])

plt.figure(figsize=(10, 10))
skimage.io.imshow(sample[:, 0, :])

plt.figure(figsize=(10, 10))
skimage.io.imshow(sample[0, :, :])

Investigate Filename Structure

These are slices from only one chunk of a much larger aggregate image. Our interest here is combining the pieces into a large image stack. It is common to see a naming structure in the filenames. Each filename then may indicate a channel, time step, and spatial location with the <i> being some numeric values (possibly with units). Individual filenames may have more or less information and may notate it differently than we have.

mydata_ch<i>_<j>t_<k>x_<l>y_<m>z.tif

In principle with NumPy we might allocate a giant array and then iteratively load images and place them into the giant array.

full_array = np.empty((..., ..., ..., ..., ...), dtype=sample.dtype)

for fn in filenames:
    img = imageio.imread(fn)
    index = get_location_from_filename(fn)  # We need to write this function
    full_array[index, :, :, :] = img

However if our data is large then we can’t load it all into memory at once like this into a single Numpy array, and instead we need to be a bit more clever to handle it efficiently. One approach here is to use Dask, which handles larger-than-memory workloads easily.

Lazily load images with Dask Array

Now we learn how to lazily load and stitch together image data with Dask array. We’ll start with simple examples first and then move onto the full example with this more complex dataset afterwards.

We can delay the imageio.imread calls with Dask Delayed.

import dask
import dask.array as da

lazy_arrays = [dask.delayed(imageio.imread)(fn) for fn in filenames]
lazy_arrays = [da.from_delayed(x, shape=sample.shape, dtype=sample.dtype)
               for x in lazy_arrays]

Note: here we’re assuming that all of the images have the same shape and dtype as the sample file that we loaded above. This is not always the case. See the dask_image note below in the Future Work section for an alternative.

We haven’t yet stitched these together. We have hundreds of single-chunk Dask arrays, each of which lazily loads a single 3d chunk of data from disk. Lets look at a single array.

>>> lazy_arrays[0]

	Array	Chunk
Bytes	316.15 MB	316.15 MB
Shape	(201, 1024, 768)	(201, 1024, 768)
Count	2 Tasks	1 Chunks
Type	uint16	numpy.ndarray

768 1024 201

This is a lazy 3-dimensional Dask array of a single 300MB chunk of data. That chunk is created by loading in a particular TIFF file. Normally Dask arrays are composed of many chunks. We can concatenate many of these single-chunked Dask arrays into a multi-chunked Dask array with functions like da.concatenate and da.stack.

Here we concatenate the first ten Dask arrays along a few axes, to get an easier-to-understand picture of how this looks. Take a look both at how the shape changes as we change the axis= parameter both in the table on the left and the image on the right.

da.concatenate(lazy_arrays[:10], axis=0)

	Array	Chunk
Bytes	3.16 GB	316.15 MB
Shape	(2010, 1024, 768)	(201, 1024, 768)
Count	30 Tasks	10 Chunks
Type	uint16	numpy.ndarray

768 1024 2010

da.concatenate(lazy_arrays[:10], axis=1)

	Array	Chunk
Bytes	3.16 GB	316.15 MB
Shape	(201, 10240, 768)	(201, 1024, 768)
Count	30 Tasks	10 Chunks
Type	uint16	numpy.ndarray

768 10240 201

da.concatenate(lazy_arrays[:10], axis=2)

	Array	Chunk
Bytes	3.16 GB	316.15 MB
Shape	(201, 1024, 7680)	(201, 1024, 768)
Count	30 Tasks	10 Chunks
Type	uint16	numpy.ndarray

7680 1024 201

Or, if we wanted to make a new dimension, we would use da.stack. In this case note that we’ve run out of easily visible dimensions, so you should take note of the listed shape in the table input on the left more than the picture on the right. Notice that we’ve stacked these 3d images into a 4d image.

da.stack(lazy_arrays[:10])

	Array	Chunk
Bytes	3.16 GB	316.15 MB
Shape	(10, 201, 1024, 768)	(1, 201, 1024, 768)
Count	30 Tasks	10 Chunks
Type	uint16	numpy.ndarray

10 1

768 1024 201

These are the common case situations, where you have a single axis along which you want to stitch images together.

Full example

This works fine for combining along a single axis. However if we need to combine across multiple we need to perform multiple concatenate steps. Fortunately there is a simpler option da.block, which can concatenate along multiple axes at once if you give it a nested list of dask arrays.

a = da.block([[laxy_array_00, lazy_array_01],
              [lazy_array_10, lazy_array_11]])

We now do the following:

Parse each filename to learn where it should live in the larger array
See how many files are in each of our relevant dimensions
Allocate a NumPy object-dtype array of the appropriate size, where each element of this array will hold a single-chunk Dask array
Go through our filenames and insert the proper Dask array into the right position
Call da.block on the result

This code is a bit complex, but shows what this looks like in a real-world setting

# Get various dimensions

fn_comp_sets = dict()
for fn in filenames:
    for i, comp in enumerate(os.path.splitext(fn)[0].split("_")):
        fn_comp_sets.setdefault(i, set())
        fn_comp_sets[i].add(comp)
fn_comp_sets = list(map(sorted, fn_comp_sets.values()))

remap_comps = [
    dict(map(reversed, enumerate(fn_comp_sets[2]))),
    dict(map(reversed, enumerate(fn_comp_sets[4])))
]

# Create an empty object array to organize each chunk that loads a TIFF
a = np.empty(tuple(map(len, remap_comps)) + (1, 1, 1), dtype=object)

for fn, x in zip(filenames, lazy_arrays):
    channel = int(fn[fn.index("_ch") + 3:].split("_")[0])
    stack = int(fn[fn.index("_stack") + 6:].split("_")[0])

    a[channel, stack, 0, 0, 0] = x

# Stitch together the many blocks into a single array
a = da.block(a.tolist())

	Array	Chunk
Bytes	188.74 GB	316.15 MB
Shape	(3, 199, 201, 1024, 768)	(1, 1, 201, 1024, 768)
Count	2985 Tasks	597 Chunks
Type	uint16	numpy.ndarray

199 3

768 1024 201

That’s a 180 GB logical array, composed of around 600 chunks, each of size 300 MB. We can now do normal NumPy like computations on this array using Dask Array, but we’ll save that for a future post.

>>> # array computations would work fine, and would run in low memory
>>> # but we'll save actual computation for future posts
>>> a.sum().compute()

Save Data

To simplify data loading in the future, we store this in a large chunked array format like Zarr using the to_zarr method.

a.to_zarr("mydata.zarr")

We may add additional information about the image data as attributes. This both makes things simpler for future users (they can read the full dataset with a single line using da.from_zarr) and much more performant because Zarr is an analysis ready format that is efficiently encoded for computation.

Zarr uses the Blosc library for compression by default. For scientific imaging data, we can optionally pass compression options that provide a good compression ratio to speed tradeoff and optimize compression performance.

from numcodecs import Blosc
a.to_zarr("mydata.zarr", compressor=Blosc(cname='zstd', clevel=3, shuffle=Blosc.BITSHUFFLE))

Future Work

The workload above is generic and straightforward. It works well in simple cases and also extends well to more complex cases, providing you’re willing to write some for-loops and parsing code around your custom logic. It works on a single small-scale laptop as well as a large HPC or Cloud cluster. If you have a function that turns a filename into a NumPy array, you can generate large lazy Dask array using that function, Dask Delayed and Dask Array.

Dask Image

However, we can make things a bit easier for users if we specialize a bit. For example the Dask Image library has a parallel image reader function, which automates much of our work above in the simple case.

>>> import dask_image
>>> x = dask_image.imread.imread('raw/*.tif')

Similarly libraries like Xarray have readers for other file formats, like GeoTIFF.

As domains do more and more work like what we did above they tend to write down common patterns into domain-specific libraries, which then increases the accessibility and user base of these tools.

GPUs

If we have special hardware lying around like a few GPUs, we can move the data over to it and perform computations with a library like CuPy, which mimics NumPy very closely. Thus benefiting from the same operations listed above, but with the added performance of GPUs behind them.

import cupy as cp
a_gpu = a.map_blocks(cp.asarray)

Computation

Finally, in future blogposts we plan to talk about how to compute on our large Dask arrays using common image-processing workloads like overlapping stencil functions, segmentation and deconvolution, and integrating with other libraries like ITK.

Python and GPUs: A Status Update

2019-06-19T00:00:00+00:00

This blogpost was delivered in talk form at the recent PASC 2019 conference. Slides for that talk are here.

We’re improving the state of scalable GPU computing in Python.

This post lays out the current status, and describes future work. It also summarizes and links to several other more blogposts from recent months that drill down into different topics for the interested reader.

Broadly we cover briefly the following categories:

Python libraries written in CUDA like CuPy and RAPIDS
Python-CUDA compilers, specifically Numba
Scaling these libraries out with Dask
Network communication with UCX
Packaging with Conda

Performance of GPU accelerated Python Libraries

Probably the easiest way for a Python programmer to get access to GPU performance is to use a GPU-accelerated Python library. These provide a set of common operations that are well tuned and integrate well together.

Many users know libraries for deep learning like PyTorch and TensorFlow, but there are several other for more general purpose computing. These tend to copy the APIs of popular Python projects:

Numpy on the GPU: CuPy
Numpy on the GPU (again): Jax
Pandas on the GPU: RAPIDS cuDF
Scikit-Learn on the GPU: RAPIDS cuML

These libraries build GPU accelerated variants of popular Python libraries like NumPy, Pandas, and Scikit-Learn. In order to better understand the relative performance differences Peter Entschev recently put together a benchmark suite to help with comparisons. He has produced the following image showing the relative speedup between GPU and CPU:

There are lots of interesting results there. Peter goes into more depth in this in his blogpost.

More broadly though, we see that there is variability in performance. Our mental model for what is fast and slow on the CPU doesn’t neccessarily carry over to the GPU. Fortunately though, due consistent APIs, users that are familiar with Python can easily experiment with GPU acceleration without learning CUDA.

Numba: Compiling Python to CUDA

See also this recent blogpost about Numba stencils and the attached GPU notebook

The built-in operations in GPU libraries like CuPy and RAPIDS cover most common operations. However, in real-world settings we often find messy situations that require writing a little bit of custom code. Switching down to C/C++/CUDA in these cases can be challenging, especially for users that are primarily Python developers. This is where Numba can come in.

Python has this same problem on the CPU as well. Users often couldn’t be bothered to learn C/C++ to write fast custom code. To address this there are tools like Cython or Numba, which let Python programmers write fast numeric code without learning much beyond the Python language.

For example, Numba accelerates the for-loop style code below about 500x on the CPU, from slow Python speeds up to fast C/Fortran speeds.

import numba  # We added these two lines for a 500x speedup

@numba.jit    # We added these two lines for a 500x speedup
def sum(x):
    total = 0
    for i in range(x.shape[0]):
        total += x[i]
    return total

The ability to drop down to low-level performant code without context switching out of Python is useful, particularly if you don’t already know C/C++ or have a compiler chain set up for you (which is the case for most Python users today).

This benefit is even more pronounced on the GPU. While many Python programmers know a little bit of C, very few of them know CUDA. Even if they did, they would probably have difficulty in setting up the compiler tools and development environment.

Enter numba.cuda.jit Numba’s backend for CUDA. Numba.cuda.jit allows Python users to author, compile, and run CUDA code, written in Python, interactively without leaving a Python session. Here is an image of writing a stencil computation that smoothes a 2d-image all from within a Jupyter Notebook:

Here is a simplified comparison of Numba CPU/GPU code to compare programming style.. The GPU code gets a 200x speed improvement over a single CPU core.

CPU – 600 ms

@numba.jit
def _smooth(x):
    out = np.empty_like(x)
    for i in range(1, x.shape[0] - 1):
        for j in range(1, x.shape[1] - 1):
            out[i, j] = x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                        x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                        x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) // 9

    return out

or if we use the fancy numba.stencil decorator …

@numba.stencil
def _smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9

GPU – 3 ms

@numba.cuda.jit
def smooth_gpu(x, out):
    i, j = cuda.grid(2)
    n, m = x.shape
    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                     x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                     x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9

Numba.cuda.jit has been out in the wild for years. It’s accessible, mature, and fun to play with. If you have a machine with a GPU in it and some curiosity then we strongly recommend that you try it out.

conda install numba
# or
pip install numba

>>> import numba.cuda

Scaling with Dask

As mentioned in previous blogposts ( 1, 2, 3, 4 ) we’ve been generalizing Dask, to operate not just with Numpy arrays and Pandas dataframes, but with anything that looks enough like Numpy (like CuPy or Sparse or Jax) or enough like Pandas (like RAPIDS cuDF) to scale those libraries out too. This is working out well. Here is a brief video showing Dask array computing an SVD in parallel, and seeing what happens when we swap out the Numpy library for CuPy.

We see that there is about a 10x speed improvement on the computation. Most importantly, we were able to switch between a CPU implementation and a GPU implementation with a small one-line change, but continue using the sophisticated algorithms with Dask Array, like it’s parallel SVD implementation.

We also saw a relative slowdown in communication. In general almost all non-trivial Dask + GPU work today is becoming communication-bound. We’ve gotten fast enough at computation that the relative importance of communication has grown significantly. We’re working to resolve this with our next topic, UCX.

Communication with UCX

See this talk by Akshay Venkatesh or view the slides

Also see this recent blogpost about UCX and Dask

We’ve been integrating the OpenUCX library into Python with UCX-Py. UCX provides uniform access to transports like TCP, InfiniBand, shared memory, and NVLink. UCX-Py is the first time that access to many of these transports has been easily accessible from the Python language.

Using UCX and Dask together we’re able to get significant speedups. Here is a trace of the SVD computation from before both before and after adding UCX:

Before UCX:

After UCX:

There is still a great deal to do here though (the blogpost linked above has several items in the Future Work section).

People can try out UCX and UCX-Py with highly experimental conda packages:

conda create -n ucx -c conda-forge -c jakirkham/label/ucx cudatoolkit=9.2 ucx-proc=*=gpu ucx ucx-py python=3.7

We hope that this work will also affect non-GPU users on HPC systems with Infiniband, or even users on consumer hardware due to the easy access to shared memory communication.

Packaging

In an earlier blogpost we discussed the challenges around installing the wrong versions of CUDA enabled packages that don’t match the CUDA driver installed on the system. Fortunately due to recent work from Stan Seibert and Michael Sarahan at Anaconda, Conda 4.7 now has a special cuda meta-package that is set to the version of the installed driver. This should make it much easier for users in the future to install the correct package.

Conda 4.7 was just releasead, and comes with many new features other than the cuda meta-package. You can read more about it here.

conda update conda

There is still plenty of work to do in the packaging space today. Everyone who builds conda packages does it their own way, resulting in headache and heterogeneity. This is largely due to not having centralized infrastructure to build and test CUDA enabled packages, like we have in Conda Forge. Fortunately, the Conda Forge community is working together with Anaconda and NVIDIA to help resolve this, though that will likely take some time.

Summary

This post gave an update of the status of some of the efforts behind GPU computing in Python. It also provided a variety of links for future reading. We include them below if you would like to learn more:

Slides
Numpy on the GPU: CuPy
Numpy on the GPU (again): Jax
Pandas on the GPU: RAPIDS cuDF
Scikit-Learn on the GPU: RAPIDS cuML
Benchmark suite
Numba CUDA JIT notebook
A talk on UCX
A blogpost on UCX and Dask
Conda 4.7

Dask on HPC

2019-06-12T00:00:00+00:00

We analyze large datasets on HPC systems with Dask, a parallel computing library that integrates well with the existing Python software ecosystem, and works comfortably with native HPC hardware.

This article explains why this approach makes sense for us. Our motivation is to share our experiences with our colleagues, and to highlight opportunities for future work.

We start with six reasons why we use Dask, followed by seven issues that affect us today.

1. Ease of use

Dask extends libraries like Numpy, Pandas, and Scikit-learn, which are well-known APIs for scientists and engineers. It also extends simpler APIs for multi-node multiprocessing. This makes it easy for our existing user base to get up to speed.

By abstracting the parallelism away from the user/developer, our analysis tools can be written by computer science non-experts, such as the scientists themselves, meaning that our software engineers can take on more of a supporting role than a leadership role. Experience has shown that, with tools like Dask and Jupyter, scientists spend less time coding and more time thinking about science, as they should.

2. Smooth HPC integration

With tools like Dask Jobqueue and Dask MPI there is no need of any boilerplate shell scripting code commonly found with job queueing systems.

Dask interacts natively with our existing job schedulers (SLURM/SGE/LSF/PBS/…) so there is no additional system to set up and manage between users and IT. All the infrastructure that we need is already in place.

Interactive analysis at scale is powerful, and lets us use our existing infrastructure in new ways. Auto scaling improves our occupancy and helps with acceptance by HPC operators / owners. Dask’s resilience against the death of all or part of its workers offers new ways of leveraging job-preemption when co-locating classical HPC workloads with analytics jobs.

3. Aimed for Scientific Processing

In addition to being integrated with the Scipy and PyData software ecosystems, Dask is compatible with scientific data formats like HDF5, NetCDF, Parquet, and so on. This is because Dask works with other libraries within the Python ecosystem, like Xarray, which already have strong support for scientific data formats and processing, and with C/C++/Fortran codes, such as is common for Python libraries.

This native support is one of the major advantages that we’ve seen of Dask over Apache Spark.

4. Versatility of APIs

And yet Dask is not designed for any particular workflow, but instead can provide infrastructure to cover a variety of different problems within an institution. Many different kinds of workloads are possible:

You can easily handle Numpy arrays or Pandas Dataframes at scale, doing some numerical work or data analysis/cleaning,
You can handle any objects collection, like JSON files, text, or log files,
You can express more arbitrary task or job scheduling workloads with Dask Delayed, or real time and reactive processing with Dask Futures.

Dask covers and simplifies many of the wide range of HPC workflows we’ve seen over the years. Many workflows that were previously implemented using job arrays, simplified MPI (e.g. mpi4py) or plain bash scripts seem to be easier for our users with Dask.

5. Versatility of Infrastructure

Dask is compatible with laptops, servers, HPC systems, and cloud computing. The environment can change with very little code adaptation which reduces our burden to rewrite code as we migrate analysis between systems such as from a laptop to a supercomputer, or between a supercomputer and the cloud.

# Local machines
from dask.distributed import LocalCluster
cluster = LocalCluster()

# HPC Job Schedulers
from dask_jobqueue import SLURMCluster, PBSCluster, SGECluster, ...
cluster = SLURMCluster(queue='default', project='ABCD1234')

# Hadoop/Spark clusters
from dask_yarn import YARNCluster
cluster = YarnCluster(environment='environment.tar.gz', worker_vcores=2)

# Cloud/Kubernetes clusters
from dask_kubernetes import KubeCluster
cluster = KubeCluster(pod_spec={...})

Dask is more than just a tool to us; it is a gateway to thinking about a completely different way of providing computing infrastructure to our users. Dask opens up the door to cloud computing technologies (such as elastic scaling and object storage) and makes us rethink what an HPC center should really look like.

6. Cost and Collaboration

Dask is free and open source, which means we do not have to rebalance our budget and staff to address the new immediate need of data analysis tools. We don’t have to pay for licenses, and we have the ability to make changes to the code when necessary. The HPC community has good representation among Dask developers. It’s easy for us to participate and our concerns are well understood.

What needs work

1. Heterogeneous resources handling

Often we want to include different kinds of HPC nodes in the same deployment. This includes situations like the following:

Workers with low or high memory,
Workers with GPUs,
Workers from different node pools.

Dask provides some support for this heterogeneity already, but not enough. We see two major opportunities for improvement.

Tools like Dask-Jobqueue should make it easier to manage multiple worker pools within the same cluster. Currently the deployment solution assumes homogeneity.
It should be easier for users to specify which parts of a computation require different hardware. The solution today works, but requires more detail from the user than is ideal.

2. Coarse-Grained Diagnostics and History

Dask provides a number of profiling tools that deliver real-time diagnostics at the individual task-level, but there is no way today to analyze or profile your Dask application at a coarse-grained level, and no built-in way to track performance over long periods of time.

Having more tools to analyze bulk performance would be helpful when making design decisions and future architecture choices.

Having the ability to persist or store history of computations (compute() calls) and tasks executed on a scheduler could be really helpful to track problems and potential performance improvements.

3. Scheduler Performance on Large Graphs

HPC users want to analyze Petabyte datasets on clusters of thousands of large nodes.

While Dask can theoretically handle this scale, it does tend to slow down a bit, reducing the pleasure of interactive large-scale computing. Handling millions of tasks can lead to tens of seconds latency before a computation actually starts. This is perfectly fine for our Dask batch jobs, but tends to make the interactive Jupyter users frustrated.

Much of this slowdown is due to task-graph construction time and centralized scheduling, both of which can be accelerated through a variety of means. We expect that, with some cleverness, we can increase the scale at which Dask continues to run smoothly by another order of magnitude.

4. Launch Batch Jobs with MPI

This issue was resolved while we prepared this blogpost.

Most Dask workflows today are interactive. People log into a Jupyter notebook, import Dask, and then Dask asks the job scheduler (like SLURM, PBS, …) for resources dynamically. This is great because Dask is able to fit into small gaps in the schedule, release workers when they’re not needed, giving users a pleasant interactive experience while lessening the load on the cluster.

However not all jobs are interactive. Often scientists want to submit a large job similar to how they submit MPI jobs. They submit a single job script with the necessary resources, walk away, and the resource manager runs that job when those resources become available (which may be many hours from now). While not as novel as the interactive workloads, these workloads are critical to common processes, and important to support.

This point was raised by Kevin Paul at NCAR during discussion of this blogpost. Between when we started planning and when we released this blogpost Kevin had already solved the problem by prodiving dask-mpi, a project that makes it easy to launch Dask using normal mpirun or mpiexec commands, making it easy to deploy Dask anywhere that MPI is deployable.

5. More Data Formats

Dask works well today with bread-and-butter scientific data formats like HDF5, Grib, and NetCDF, as well as common data science formats like CSV, JSON, Parquet, ORC, and so on.

However, the space of data formats is vast and Dask users find themselves struggling a little, or even solving the data ingestion problem manually for a number of common formats in different domains:

Remote sensing datasets: GeoTIFF, Jpeg2000,
Astronomical data: FITS, VOTable,
… and so on

Supporting these isn’t hard (indeed many of us have built our own support for them in Dask), but it would be handy to have a high quality centralized solution.

6. Link with Deep Learning

Many of our institutions are excited to leverage recent advances in deep learning and integrate powerful tools like Keras, TensorFlow, and PyTorch and powerful hardware like GPUs into our workflows.

However, we often find that our data and architecture look a bit different from what we find in standard deep learning tutorials. We like using Dask for data ingestion, cleanup, and pre-processing, but would like to establish better practices and smooth tooling to transition from scientific workflows on HPC using Dask to deep learning as efficiently as possible.

For more information, see this github issue for an example topic.

7. More calculation guidelines

While there are means to analyse and diagnose computations interactively, and a quite decent set of examples for Dask common calculations, trials and error appear to be the norm with big HPC computation before coming to optimized workflows.

We should develop more guidelines and strategy on how to perform large scale computation, and we need to foster the community around Dask, which is already done in projects such as Pangeo. Note that these guidelines may be infrastructure dependent.

Experiments in High Performance Networking with UCX and DGX

2019-06-09T00:00:00+00:00

This post is about experimental and rapidly changing software. Code examples in this post should not be relied upon to work in the future.

This post talks about connecting UCX, a high performance networking library, to Dask, a parallel Python library, to accelerate communication-heavy workloads, particularly when using GPUs.

Additionally, we do this work on a DGX, a high-end multi-CPU multi-GPU machine with a complex internal network. Working in this context was good to force improvements in setting up Dask in heterogeneous situations targeting different network cards, CPU sockets, GPUs, and so on..

Motivation

Many distributed computing workloads are communication-bound. This is common in cases like the following:

Dataframe joins
Machine learning algorithms
Complex array computations

Communication becomes a bigger bottleneck as we accelerate our computation, such as when we use GPUs for computing.

Historically, high performance communication was only available using MPI, or with custom solutions. This post describes an effort to get close to the communication bandwidth of MPI while still maintaining the ease of programmability and accessibility of a dynamic system like Dask.

UCX, Python, and Dask

To get high performance networking in Dask, we wrapped UCX with Python and then connected that to Dask.

The OpenUCX project provides a uniform API around various high performance networking libraries like InfiniBand, traditional networking protocols like TCP/shared memory, and GPU-specific protocols like NVLink. It is a layer beneath something like OpenMPI (the main user of OpenUCX today) that figures out which networking system to use.

Python users today don’t have much access to these network libraries, except through MPI, which is sometimes not ideal. (Try searching for “infiniband” on PyPI.)

This led us to create UCX-Py . UCX-Py is a Python wrapper around the UCX C library, which provides a Pythonic API, both with blocking syntax appropriate for traditional HPC programs, as well as a non-blocking async/await syntax for more concurrent programs (like Dask). For more information on UCX I recommend watching Akshay’s UCX talk from the GPU Technology Conference 2019.

Note: UCX-Py was primarily developed by Akshay Venkatesh (UCX, NVIDIA) Tom Augspurger (Dask, Pandas, Anaconda), and Ben Zaitlen (NVIDIA, RAPIDS, Dask))

We then extended Dask communications to optionally use UCX. If you have UCX and UCX-Py installed, then you can use the ucx:// protocol in addresses or the --protocol ucx flag when starting things up, something like this.

$ dask-scheduler --protocol ucx
Scheduler started at ucx://127.0.0.1:8786

$ dask-worker ucx://127.0.0.1:8786

>>> from dask.distributed import Client
>>> client = Client('ucx://127.0.0.1:8786')

Experiment

We modified our SVD with Dask and CuPy benchmark benchmark to use the UCX protocol for inter-process communication and ran it on half of a DGX machine, using four GPUs. Here is a minimal implementation of the UCX-enabled code:

import cupy
import dask
import dask.array
from dask.distributed import Client, wait
from dask_cuda import DGX

# Define DGX cluster and client
cluster = DGX(CUDA_VISIBLE_DEVICES=[0, 1, 2, 3])
client = Client(cluster)

# Create random data
rs = dask.array.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.random((1000000, 1000), chunks=(10000, 1000))
x = x.persist()

# Perform distributed SVD
u, s, v = dask.array.linalg.svd(x)
u, s, v = dask.persist(u, s, v)
_ = wait([u, s, v])

By using UCX the overall communication times are reduced by an order of magnitude. To produce the task-stream figures below, the benchmark was run on a DGX-1 with CUDA_VISIBLE_DEVICES=[0,1,2,3]. It is clear that the red task bars, corresponding to inter-process communication, are significantly compressed. Communications that were taking 500ms-1s before now take around 20ms.

Before UCX:

After UCX:

Diving into the Details

On a GPU using NVLink we can get somewhere between 5-10 GB/s throughput between pairs of GPUs. On a CPU this drops down to 1-2 GB/s (which seems well below optimal). These speeds can affect all Dask workloads (array, dataframe, xarray, ML, …), but when the proper hardware is present, other bottlenecks may occur, such as serialization when dealing with text or JSON-like data.

This of course, depends on this fancy networking hardware being present. On the GPU example above we’re mostly relying on NVLink, but we would also get improved performance on an HPC InfiniBand network or even on a single laptop machine using shared memory transports.

The examples above was run on a DGX machine, which includes all of these transports and more (as well as numerous GPUs).

DGX

The test machine used above was a DGX-1, which has eight GPUs, two CPU sockets, four Infiniband network cards, and a complex NVLink arrangement. This is a good example of non-uniform hardware. Certain CPUs are closer to certain GPUs and network cards, and understanding this proximity has an order-of-magnitude effect on performance. This situation isn’t unique to DGX machines. The same situation arises when we have …

Multiple workers in one node, with several nodes in a cluster
Multiple nodes in one rack, with several racks in a data center
Multiple data centers, such as is the case with hybrid cloud

Working with the DGX was interesting because it forced us to start thinking about heterogeneity, and making it easier to specify complex deployment scenarios with Dask.

Here is a diagram showing how the GPUs, CPUs, and Infiniband cards are connected to each other in a DGX-1:

And here the output of nvidia-smi showing the NVLink, networking, and CPU affinity structure (this is mostly orthogonal to the structure displayed above).

$ nvidia-smi  topo -m
     GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7   ib0   ib1   ib2   ib3
GPU0   X    NV1   NV1   NV2   NV2   SYS   SYS   SYS   PIX   SYS   PHB   SYS
GPU1  NV1    X    NV2   NV1   SYS   NV2   SYS   SYS   PIX   SYS   PHB   SYS
GPU2  NV1   NV2    X    NV2   SYS   SYS   NV1   SYS   PHB   SYS   PIX   SYS
GPU3  NV2   NV1   NV2    X    SYS   SYS   SYS   NV1   PHB   SYS   PIX   SYS
GPU4  NV2   SYS   SYS   SYS    X    NV1   NV1   NV2   SYS   PIX   SYS   PHB
GPU5  SYS   NV2   SYS   SYS   NV1    X    NV2   NV1   SYS   PIX   SYS   PHB
GPU6  SYS   SYS   NV1   SYS   NV1   NV2    X    NV2   SYS   PHB   SYS   PIX
GPU7  SYS   SYS   SYS   NV1   NV2   NV1   NV2    X    SYS   PHB   SYS   PIX
ib0   PIX   PIX   PHB   PHB   SYS   SYS   SYS   SYS    X    SYS   PHB   SYS
ib1   SYS   SYS   SYS   SYS   PIX   PIX   PHB   PHB   SYS    X    SYS   PHB
ib2   PHB   PHB   PIX   PIX   SYS   SYS   SYS   SYS   PHB   SYS    X    SYS
ib3   SYS   SYS   SYS   SYS   PHB   PHB   PIX   PIX   SYS   PHB   SYS    X

    CPU Affinity
GPU0  0-19,40-59
GPU1  0-19,40-59
GPU2  0-19,40-59
GPU3  0-19,40-59
GPU4  20-39,60-79
GPU5  20-39,60-79
GPU6  20-39,60-79
GPU7  20-39,60-79

Legend:

  X    = Self
  SYS  = Traverse PCIe as well as the SMP interconnect between NUMA nodes
  NODE = Travrese PCIe as well as the interconnect between PCIe Host Bridges
  PHB  = Traverse PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Traverse multiple PCIe switches (without PCIe Host Bridge)
  PIX  = Traverse a single PCIe switch
  NV#  = Traverse a bonded set of # NVLinks

The DGX was originally designed for deep learning applications. The complex network infrastructure above can be well used by specialized NVIDIA networking libraries like NCCL, which knows how to route things correctly, but is something of a challenge for other more general purpose systems like Dask to adapt to.

Fortunately, in meeting this challenge we were able to clean up a number of related issues in Dask. In particular we can now:

Specify a more heterogeneous worker configuration when starting up a local cluster dask/distributed #2675
Learn bandwidth over time dask/distributed #2658
Add Worker plugins to help handle things like CPU affinity (though this is quite general) dask/distributed #2453

With these changes we’re now able to describe most of the DGX structure as configuration in the Python function below:

import os

from dask.distributed import Nanny, SpecCluster, Scheduler
from distributed.worker import TOTAL_MEMORY

from dask_cuda.local_cuda_cluster import cuda_visible_devices


class CPUAffinity:
    """ A Worker plugin to pin CPU affinity """
    def __init__(self, cores):
        self.cores = cores

    def setup(self, worker=None):
        os.sched_setaffinity(0, self.cores)


affinity = {  # See nvidia-smi topo -m
    0: list(range(0, 20)) + list(range(40, 60)),
    1: list(range(0, 20)) + list(range(40, 60)),
    2: list(range(0, 20)) + list(range(40, 60)),
    3: list(range(0, 20)) + list(range(40, 60)),
    4: list(range(20, 40)) + list(range(60, 79)),
    5: list(range(20, 40)) + list(range(60, 79)),
    6: list(range(20, 40)) + list(range(60, 79)),
    7: list(range(20, 40)) + list(range(60, 79)),
}

def DGX(
    interface="ib",
    dashboard_address=":8787",
    threads_per_worker=1,
    silence_logs=True,
    CUDA_VISIBLE_DEVICES=None,
    **kwargs
):
    """ A Local Cluster for a DGX 1 machine

    NVIDIA's DGX-1 machine has a complex architecture mapping CPUs,
    GPUs, and network hardware.  This function creates a local cluster
    that tries to respect this hardware as much as possible.

    It creates one Dask worker process per GPU, and assigns each worker
    process the correct CPU cores and Network interface cards to
    maximize performance.

    That being said, things aren't perfect.  Today a DGX has very high
    performance between certain sets of GPUs and not others.  A Dask DGX
    cluster that uses only certain tightly coupled parts of the computer
    will have significantly higher bandwidth than a deployment on the
    entire thing.

    Parameters
    ----------
    interface: str
        The interface prefix for the infiniband networking cards.  This is
        often "ib"` or "bond".  We will add the numeric suffix 0,1,2,3 as
        appropriate.  Defaults to "ib".
    dashboard_address: str
        The address for the scheduler dashboard.  Defaults to ":8787".
    CUDA_VISIBLE_DEVICES: str
        String like ``"0,1,2,3"`` or ``[0, 1, 2, 3]`` to restrict
        activity to different GPUs

    Examples
    --------
    >>> from dask_cuda import DGX
    >>> from dask.distributed import Client
    >>> cluster = DGX(interface='ib')
    >>> client = Client(cluster)
    """
    if CUDA_VISIBLE_DEVICES is None:
        CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "0,1,2,3,4,5,6,7")
    if isinstance(CUDA_VISIBLE_DEVICES, str):
        CUDA_VISIBLE_DEVICES = CUDA_VISIBLE_DEVICES.split(",")
    CUDA_VISIBLE_DEVICES = list(map(int, CUDA_VISIBLE_DEVICES))
    memory_limit = TOTAL_MEMORY / 8

    spec = {
        i: {
            "cls": Nanny,
            "options": {
                "env": {
                    "CUDA_VISIBLE_DEVICES": cuda_visible_devices(
                        ii, CUDA_VISIBLE_DEVICES
                    ),
                    "UCX_TLS": "rc,cuda_copy,cuda_ipc",
                },
                "interface": interface + str(i // 2),
                "protocol": "ucx",
                "ncores": threads_per_worker,
                "data": dict,
                "preload": ["dask_cuda.initialize_context"],
                "dashboard_address": ":0",
                "plugins": [CPUAffinity(affinity[i])],
                "silence_logs": silence_logs,
                "memory_limit": memory_limit,
            },
        }
        for ii, i in enumerate(CUDA_VISIBLE_DEVICES)
    }

    scheduler = {
        "cls": Scheduler,
        "options": {
            "interface": interface + str(CUDA_VISIBLE_DEVICES[0] // 2),
            "protocol": "ucx",
            "dashboard_address": dashboard_address,
        },
    }

    return SpecCluster(
        workers=spec,
        scheduler=scheduler,
        silence_logs=silence_logs,
        **kwargs
    )

However, we never got the NVLink structure down. The Dask scheduler currently still assumes uniform bandwidths between workers. We’ve started to make small steps towards changing this, but we’re not there yet (this will be useful as well for people that want to think about in-rack or cross-data-center deployments).

As usual, in solving a highly specific problem, we were able to solve a number of lingering general features, which then made our specific problem easy to write down.

Future Work

There has been significant effort over the last few months make everything above work. In particular we …

Modified UCX to support client-server workloads
Wrapped UCX with UCX-Py and design a Python async-await friendly interface
Wrapped UCX-Py with Dask
Hooked everything together to make generic workloads function well

The result is quite nice, especially for more communication heavy workloads. However there is still plenty to do. This section details what we’re thinking about now to continue this work.

Routing within complex networks: If you restrict yourself to four of the eight GPUs in a DGX, you can get 5-12 GB/s between pairs of GPUs. For some workloads this can be significant. It makes the system feel much more like a single unit than a bunch of isolated machines.

However we still can’t get great performance across the whole DGX because there are many GPU-pairs that are not connected by NVLink, and so we get 10x slower speeds. These dominate communication costs if you naively try to use the full DGX.

This might be solved either by:
1. Teaching Dask to avoid these communications
2. Teaching UCX to route communications like these through a chain of multiple NVLink connections
3. Avoiding complex networks altogether. Newer systems like the DGX-2 use NVSwitch, which provides uniform connectivity, with each GPU connected to every other GPU.
Edit: I’ve since learned that UCX should be able to handle this. We should still get PCIe speeds (around 4-7 GB/s) even when we don’t have NVLink once an upstream bug gets fixed. Hooray!
CPU: We can get 1-2 GB/s across InfiniBand, which isn’t bad, but also wasn’t the full 5-8 GB/s that we were hoping for. This deserves more serious profiling to determine what is going wrong. The current guess is that this has to do with memory allocations.
```
In [1]: %time _ = b'0' * 1000000000  # 1 GB
CPU times: user 248 ms, sys: 223 ms, total: 472 ms
Wall time: 470 ms   # <<----- Around 2 GB/s.  Slower than I expected
```
Probably we’re just doing something dumb here.
Package UCX: Currently I’m building the UCX and UCX-Py libraries from source (see appendix below for instructions). Ideally these would become conda packages. John Kirkham (Conda Forge, NVIDIA, Dask) is taking a look at this along with the UCX developers from Mellanox.

See ucx-py #65 for more information.
Learn Heterogeneous Bandwidths: In order to make good scheduling decisions Dask needs to estimate how long it will take to move data between machines. This question is now becoming much more complex, and depends on both the source and destination machines (the network topology) the data type (NumPy array, GPU array, Pandas Dataframe with text) and more. In complex situations our bandwidths can span a 100x range (100 MB/s to 10 GB/s).

Dask will have to develop more complex models for bandwidth, and learn these over time.

See dask/distributed #2743 for more information.
Support other GPU libraries: To send GPU data around we need to teach Dask how to serialize Python objects into GPU buffers. There is code in the dask/distributed repository to do this for Numba, CuPy, and RAPIDS cuDF objects, but we’ve really only tested CuPy seriously. We should expand this by some of the following steps:
1. Try a distributed Dask cuDF join computation
  
  See dask/distributed #2746 for initial work here.
2. Teach Dask to serialize array GPU libraries, like PyTorch and TensorFlow, or possibly anything that supports the __cuda_array_interface__ protocol.
Track down communication failures: We still occasionally get unexplained communication failures. We should stress test this system to discover rough corners.
TCP: Groups with high performing TCP networks can’t yet make use of UCX+Dask (though they can use either one individually).

Currently using UCX in a client-server mode as we’re doing with Dask requires access to RDMA libraries, which are often not found on systems without networking systems like InfiniBand. This means that groups with high performing TCP networks can’t make use of UCX+Dask.

This is in progress at openucx/ucx #3570
Commodity Hardware: Currently this code is only really useful on high performance Linux systems that have InfiniBand or NVLink. However, it would be nice to also use this on more commodity systems, including personal laptop computers using TCP and shared memory.

Currently Dask uses TCP for inter-process communication on a single machine. Using UCX on a personal computer would give us access to shared memory speeds, which tend to be an order of magnitude faster.

See openucx/ucx #3663 for more information.
Tune Performance: The 5-10 GB/s bandwidths that we see with NVLink today are sub-optimal. With UCX-Py alone we’re able to get something like 15 GB/s on large message tranfers. We should benchmark and tune our implementation to see what is taking up the extra time. Until things work more robustly though, this is a secondary priority.

Appendix: Setup

Performing these experiments depends currently on development branches in a few repositories. This section includes my current setup.

Create Conda Environment

conda create -n ucx python=3.7 libtool cmake automake autoconf cython bokeh pytest pkg-config ipython dask numba -y

Note: for some reason using conda-forge makes the autogen step below fail.

Set up UCX

# Clone UCX repository and get branch
git clone https://github.com/openucx/ucx
cd ucx
git remote add Akshay-Venkatesh git@github.com:Akshay-Venkatesh/ucx.git
git remote update Akshay-Venkatesh
git checkout ucx-cuda

# Build
git clean -xfd
export CUDA_HOME=/usr/local/cuda-9.2/
./autogen.sh
mkdir build
cd build
../configure --prefix=$CONDA_PREFIX --enable-debug --with-cuda=$CUDA_HOME --enable-mt --disable-cma CPPFLAGS="-I//usr/local/cuda-9.2/include"
make -j install

# Verify
ucx_info -d
which ucx_info  # verify that this is in the conda environment

# Verify that we see NVLink speeds
ucx_perftest -t tag_bw -m cuda -s 1048576 -n 1000 & ucx_perftest dgx15 -t tag_bw -m cuda -s 1048576 -n 1000

Set up UCX-Py

git clone git@github.com:rapidsai/ucx-py
cd ucx-py

export UCX_PATH=$CONDA_PREFIX
make install

Set up Dask

git clone git@github.com:dask/dask.git
cd dask
pip install -e .
cd ..

git clone git@github.com:dask/distributed.git
cd distributed
pip install -e .
cd ..

Optionally set up cupy

pip install cupy-cuda92==6

Optionally set up cudf

conda install -c rapidsai-nightly -c conda-forge -c numba cudf dask-cudf cudatoolkit=9.2

Optionally set up JupyterLab

conda install ipykernel jupyterlab nb_conda_kernels nodejs

For the Dask dashboard

pip install dask_labextension
jupyter labextension install dask-labextension

My Benchmark

I’ve been using the following benchmark to test communication. It allocates a chunked Dask array, and then adds it to its transpose, which forces a lot of communication, but not much computation.

from collections import defaultdict
import asyncio
import time
import numpy as np
from pprint import pprint
import cupy

import dask.array as da
from dask.distributed import Client, wait
from distributed.utils import format_time, format_bytes

async def f():

    # Set up workers on the local machine
    async with DGX(asynchronous=True, silence_logs=True) as cluster:
        async with Client(cluster, asynchronous=True) as client:

            # Create a simple random array
            rs = da.random.RandomState(RandomState=cupy.random.RandomState)
            x = rs.random((40000, 40000), chunks='128 MiB').persist()
            print(x.npartitions, 'chunks')
            await wait(x)

            # Add X to its transpose, forcing computation
            y = (x + x.T).sum()
            result = await client.compute(y)

            # Collect, aggregate, and print peer-to-peer bandwidths
            incoming_logs = await client.run(lambda dask_worker: dask_worker.incoming_transfer_log)
            bandwidths = defaultdict(list)
            for k, L in incoming_logs.items():
                for d in L:
                    if d['total'] > 1_000_000:
                        bandwidths[k, d['who']].append(d['bandwidth'])
            bandwidths = {
                (cluster.scheduler.workers[w1].name,
                    cluster.scheduler.workers[w2].name): [format_bytes(x) + '/s' for x in np.quantile(v, [0.25, 0.50, 0.75])]
                for (w1, w2), v in bandwidths.items()
            }
            pprint(bandwidths)


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(f())

Note: most of this example is just getting back diagnostics, which can be easily ignored. Also, you can drop the async/await code if you like. I think that there should probably be more examples in the world using Dask with async/await syntax, so I decided to leave it in.

Composing Dask Array with Numba Stencils

2019-04-09T00:00:00+00:00

In this post we explore four array computing technologies, and how they work together to achieve powerful results.

Numba’s stencil decorator to craft localized compute kernels
Numba’s Just-In-Time (JIT) compiler for array computing in Python
Dask Array for parallelizing array computations across many chunks
NumPy’s Generalized Universal Functions (gufuncs) to tie everything together

In the end we’ll show how a novice developer can write a small amount of Python to efficiently compute localized computation on large amounts of data. In particular we’ll write a simple function to smooth images and apply that in parallel across a large stack of images.

Here is the full code, we’ll dive into it piece by piece below.

import numba

@numba.stencil
def _smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9


@numba.guvectorize(
    [(numba.int8[:, :], numba.int8[:, :])],
    '(n, m) -> (n, m)'
)
def smooth(x, out):
    out[:] = _smooth(x)


# If you want fake data
import dask.array as da
x = da.ones((1000000, 1000, 1000), chunks=('auto', -1, -1), dtype='int8')

# If you have actual data
import dask_image
x = dask_image.imread.imread('/path/to/*.png')

y = smooth(x)
# dask.array<transpose, shape=(1000000, 1000, 1000), dtype=int8, chunksize=(125, 1000, 1000)>

Note: the smooth function above is more commonly referred to as the 2D mean filter in the image processing community.

Now, lets break this down a bit

Docs:: https://numba.pydata.org/numba-doc/dev/user/stencil.html

Many array computing functions operate only on a local region of the array. This is common in image processing, signals processing, simulation, the solution of differential equations, anomaly detection, time series analysis, and more. Typically we write code that looks like the following:

def _smooth(x):
    out = np.empty_like(x)
    for i in range(1, x.shape[0] - 1):
        for j in range(1, x.shape[1] - 1):
            out[i, j] = x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                        x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                        x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) // 9

    return out

Or something similar. The numba.stencil decorator makes this a bit easier to write down. You just write down what happens on every element, and Numba handles the rest.

@numba.stencil
def _smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9

Numba JIT

Docs: http://numba.pydata.org/

When we run this function on a NumPy array, we find that it is slow, operating at Python speeds.

x = np.ones((100, 100))
timeit _smooth(x)
527 ms ± 44.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But if we JIT compile this function with Numba, then it runs more quickly.

@numba.njit
def smooth(x):
    return _smooth(x)

%timeit smooth(x)
70.8 µs ± 6.38 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

For those counting, that’s over 1000x faster!

Note: this function already exists as scipy.ndimage.uniform_filter, which operates at the same speed.

Dask Array

Docs: https://docs.dask.org/en/latest/array.html

In these applications people often have many such arrays and they want to apply this function over all of them. In principle they could do this with a for loop.

from glob import glob
import skimage.io

for fn in glob('/path/to/*.png'):
    img = skimage.io.imread(fn)
    out = smooth(img)
    skimage.io.imsave(fn.replace('.png', '.out.png'), out)

If they wanted to then do this in parallel they would maybe use the multiprocessing or concurrent.futures modules. If they wanted to do this across a cluster then they could rewrite their code with PySpark or some other system.

Or, they could use Dask array, which will handle both the pipelining and the parallelism (single machine or on a cluster) all while still looking mostly like a NumPy array.

import dask_image
x = dask_image.imread('/path/to/*.png')  # a large lazy array of all of our images
y = x.map_blocks(smooth, dtype='int8')

And then because each of the chunks of a Dask array are just NumPy arrays, we can use the map_blocks function to apply this function across all of our images, and then save them out.

This is fine, but lets go a bit further, and discuss generalized universal functions from NumPy.

Generalized Universal Functions

Numba Docs: https://numba.pydata.org/numba-doc/dev/user/vectorize.html

NumPy Docs: https://docs.scipy.org/doc/numpy-1.16.0/reference/c-api.generalized-ufuncs.html

A generalized universal function (gufunc) is a normal function that has been annotated with typing and dimension information. For example we can redefine our smooth function as a gufunc as follows:

@numba.guvectorize(
    [(numba.int8[:, :], numba.int8[:, :])],
    '(n, m) -> (n, m)'
)
def smooth(x, out):
    out[:] = _smooth(x)

This function knows that it consumes a 2d array of int8’s and produces a 2d array of int8’s of the same dimensions.

This sort of annotation is a small change, but it gives other systems like Dask enough information to use it intelligently. Rather than call functions like map_blocks, we can just use the function directly, as if our Dask Array was just a very large NumPy array.

# Before gufuncs
y = x.map_blocks(smooth, dtype='int8')

# After gufuncs
y = smooth(x)

This is nice. If you write library code with gufunc semantics then that code just works with systems like Dask, without you having to build in explicit support for parallel computing. This makes the lives of users much easier.

Finished result

Lets see the full example one more time.

import numba
import dask.array as da

@numba.stencil
def _smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9


@numba.guvectorize(
    [(numba.int8[:, :], numba.int8[:, :])],
    '(n, m) -> (n, m)'
)
def smooth(x, out):
    out[:] = _smooth(x)

x = da.ones((1000000, 1000, 1000), chunks=('auto', -1, -1), dtype='int8')
smooth(x)

This code is decently approachable by novice users. They may not understand the internal details of gufuncs or Dask arrays or JIT compilation, but they can probably copy-paste-and-modify the example above to suit their needs.

The parts that they do want to change are easy to change, like the stencil computation, and creating an array of their own data.

This workflow is efficient and scalable, using low-level compiled code and potentially clusters of thousands of computers.

What could be better

There are a few things that could make this workflow nicer.

It would be nice not to have to specify dtypes in guvectorize, but instead specialize to types as they arrive. numba/numba #2979
Support GPU accelerators for the stencil computations using numba.cuda.jit. Stencil computations are obvious candidates for GPU acceleration, and this is a good accessible point where novice users can specify what they want in a way that is sufficiently constrained for automated systems to rewrite it as CUDA somewhat easily. numba/numba 3915
It would have been nicer to be able to apply the @guvectorize decorator directly on top of the stencil function like this.
```
@numba.guvectorize(...)
@numba.stencil
def average(x):
    ...
```
Rather than have two functions. numba/numba #3914

You may have noticed that our guvectorize function had to assign its result into an out parameter.

@numba.guvectorize(
    [(numba.int8[:, :], numba.int8[:, :])],
    '(n, m) -> (n, m)'
)
def smooth(x, out):
    out[:] = _smooth(x)

It would have been nicer, perhaps, to just return the output

def smooth(x):
    return _smooth(x)

numba/numba #3916

The dask-image library could use a imsave function

dask/dask-image #110

Aspirational Result

With all of these, we might then be able to write the code above as follows

# This is aspirational

import numba
import dask_image

@numba.guvectorize(
    [(numba.int8[:, :], numba.int8[:, :])],
    signature='(n, m) -> (n, m)',
    target='gpu'
)
@numba.stencil
def smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9

x = dask_image.io.imread('/path/to/*.png')
y = smooth(x)
dask_image.io.imsave(y, '/path/to/out/*.png')

Update: Now with GPUs!

After writing this blogpost I did a small update where I used numba.cuda.jit to implement the same smooth function on a GPU to achieve a 200x speedup with only a modest increase to code complexity.

That notebook is here.

cuML and Dask hyperparameter optimization

2019-03-27T00:00:00+00:00

DGX-1 Workstation
Host Memory: 512 GB
GPU Tesla V100 x 8
cudf 0.6
cuml 0.6
dask 1.1.4
Jupyter notebook

TLDR; Hyper-parameter Optimization is functional but slow with cuML

cuML and Dask Hyper-parameter Optimization

cuML is an open source GPU accelerated machine learning library primarily developed at NVIDIA which mirrors the Scikit-Learn API. The current suite of algorithms includes GLMs, Kalman Filtering, clustering, and dimensionality reduction. Many of these machine learning algorithms use hyper-parameters. These are parameters used during the model training process but are not “learned” during the training. Often these parameters are coefficients or penalty thresholds and finding the “best” hyper parameter can be computationally costly. In the PyData community, we often reach to Scikit-Learn’s GridSearchCV or RandomizedSearchCV for easy definition of the search space for hyper-parameters – this is called hyper-parameter optimization. Within the Dask community, Dask-ML has incrementally improved the efficiency of hyper-parameter optimization by leveraging both Scikit-Learn and Dask to use multi-core and distributed schedulers: Grid and RandomizedSearch with DaskML.

With the newly created drop-in replacement for Scikit-Learn, cuML, we experimented with Dask’s GridSearchCV. In the upcoming 0.6 release of cuML, the estimators are serializable and are functional within the Scikit-Learn/dask-ml framework, but slow compared with Scikit-Learn estimators. And while speeds are slow now, we know how to boost performance, have filed several issues, and hope to show performance gains in future releases.

All code and timing measurements can be found in this Jupyter notebook

Fast Fitting!

cuML is fast! But finding that speed requires developing a bit of GPU knowledge and some intuition. For example, there is a non-zero cost of moving data from device to GPU and, when data is “small” there are little to no performance gains. “Small”, currently might mean less than 100MB.

In the following example we use the diabetes data set provided by sklearn and linearly fit the data with RidgeRegression

\[ \min\limits_w ||y - Xw||^2_2 + alpha \* ||w||^2_2\]

alpha is the hyper-parameter and we initially set to 1.

import numpy as np
from cuml import Ridge as cumlRidge
import dask_ml.model_selection as dcv
from sklearn import datasets, linear_model
from sklearn.externals.joblib import parallel_backend
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2)

fit_intercept = True
normalize = False
alpha = np.array([1.0])

ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

ridge.fit(X_train, y_train)
cu_ridge.fit(X_train, y_train)=

The above ran with a single timing measurement of:

Scikit-Learn Ridge: 28 ms
cuML Ridge: 1.12 s

But the data is quite small, ~28KB. Increasing the size to ~2.8GB and re-running we see significant gains:

dup_ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
dup_cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

# move data from host to device
record_data = (('fea%d'%i, dup_data[:,i]) for i in range(dup_data.shape[1]))
gdf_data = cudf.DataFrame(record_data)
gdf_train = cudf.DataFrame(dict(train=dup_train))

#sklearn
dup_ridge.fit(dup_data, dup_train)

# cuml
dup_cu_ridge.fit(gdf_data, gdf_train.train)

With new timing measurements of:

Scikit-Learn Ridge: 4.82 s ± 694 ms
cuML Ridge: 450 ms ± 47.6 ms

With more data we clearly see faster fitting times, but the time to move data to the GPU (through CUDF) was 19.7s. This cost of data movement is one of the reasons why RAPIDS/cuDF was developed – keep data on the GPU and avoid having to move back and forth.

Hyper-Parameter Optimization Experiments

So moving to the GPU can be costly, but once there, with larger data sizes, we gain significant performance optimizations. Naively, we thought, “well, we have GPU machine learning, we have distributed hyper-parameter optimization… we should have distributed, GPU-accelerated, hyper-parameter optimization!”

Scikit-Learn assumes a specific, but well defined API for estimators over which it will perform hyper-parameter optimization. Most estimators/classifiers in Scikit-Learn look like the following:

class DummyEstimator(BaseEstimator):
    def __init__(self, params=...):
        ...

    def fit(self, X, y=None):
        ...

    def predict(self, X):
        ...

    def score(self, X, y=None):
        ...

    def get_params(self):
        ...

    def set_params(self, params...):
        ...

When we started experimenting with hyper-parameter optimization, we found a few API holes missing, these were resolved, mostly handling matching argument structure and various getters/setters.

get_params and set_params (#271)
fix/clf-solver (#318)
map fit_transform to sklearn implementation (#330)
Fea get params small changes (#322)

With holes plugged up we tested again. Using the same diabetes data set, we are now performing hyper-parameter optimization and searching over many alpha parameters for the best scoring alpha.

params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

grid = GridSearchCV(clf, params, scoring='r2')
grid.fit(X_train, y_train)

cu_grid = GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(X_train, y_train)

Again, reminding ourselves that the data is small ~28KB, we don’t expect to observe cuml performing faster than sklearn. Instead, we want to demonstrate functionality.

Again, reminding ourselves that the data is small ~28KB, we don’t expect to observe cuml performing faster than Scikit-Learn. Instead, we want to demonstrate functionality. Additionally, we also tried swapping out Dask-ML’s implementation of GridSearchCV (which adheres to the same API as Scikit-Learn) to use all of the GPUs we have available in parallel.

params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

grid = dcv.GridSearchCV(clf, params, scoring='r2')
grid.fit(X_train, y_train)

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(X_train, y_train)

Timing Measurements:

GridSearchCV	sklearn-Ridge	cuml-ridge
Scikit-Learn	88.4 ms ± 6.11 ms	6.51 s ± 132 ms
Dask-ML	873 ms ± 347 ms	740 ms ± 142 ms

Unsurprisingly, we see that GridSearchCV and Ridge Regression from Scikit-Learn is the fastest in this context. There is cost to distributing work and data, and as we previously mentioned, moving data from host to device.

How does performance scale as we scale data?

two_dup_data = np.array(np.vstack([X_train]*int(1e2)))
two_dup_train = np.array(np.hstack([y_train]*int(1e2)))
three_dup_data = np.array(np.vstack([X_train]*int(1e3)))
three_dup_train = np.array(np.hstack([y_train]*int(1e3)))

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(two_dup_data, two_dup_train)

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(three_dup_data, three_dup_train)

grid = dcv.GridSearchCV(clf, params, scoring='r2')
grid.fit(three_dup_data, three_dup_train)

Timing Measurements:

Data (MB)	cuML+Dask-ML	sklearn+Dask-ML
2.8 MB	13.8s
28 MB	1min 17s	4.87 s

cuML + dask-ml (Distributed GridSearchCV) does significantly worse as data sizes increase! Why? Primarily, two reasons:

Non optimized movement of data between host and device compounded by N devices and the size of the parameter space
Scoring methods are not implemented in with cuML

Below is the Dask graph for the GridSearch

There are 50 (cv=5 times 10 parameters for alpha) instances of chunking up our test data set and scoring performance. That means 50 times we are moving data back forth between host and device for fitting and 50 times for scoring. That’s not great, but it’s also very solvable – build scoring functions for GPUs!

Immediate Future Work

We know the problems, GH Issues have been filed, and we are working on these issues – come help!

Built In Scorers (#242)
DeviceNDArray as input data (#369)
Communication with UCX (#2344)

Dask and the __array_function__ protocol

2019-03-18T00:00:00+00:00

Dask is versatile for analytics parallelism, but there is still one issue to leverage it to a broader spectrum: allowing it to transparently work with NumPy-like libraries. We have previously discussed how to work with GPU Dask Arrays, but limited to the scope of the array’s member methods sharing a NumPy-like interface, for example the .sum() method, thus, calling general functionality from NumPy’s library wasn’t still possible. NumPy recently addressed this issue in NEP-18 with the introduction of the __array_function__ protocol. In short, the protocol allows a NumPy function call to dispatch the appropriate NumPy-like library implementation, depending on the array type given as input, thus allowing Dask to remain agnostic of such libraries, internally calling just the NumPy function, which automatically handles dispatching of the appropriate library implementation, for example, CuPy or Sparse.

To understand what’s the end goal of this change, consider the following example:

import numpy as np
import dask.array as da

x = np.random.random((5000, 1000))

d = da.from_array(x, chunks=(1000, 1000))

u, s, v = np.linalg.svd(d)

Now consider we want to speedup the SVD computation of a Dask array and offload that work to a CUDA-capable GPU, we ultimately want to simply replace the NumPy array x by a CuPy array and let NumPy do its magic via __array_function__ protocol and dispatch the appropriate CuPy linear algebra operations under the hood:

import numpy as np
import cupy
import dask.array as da

x = cupy.random.random((5000, 1000))

d = da.from_array(x, chunks=(1000, 1000))

u, s, v = np.linalg.svd(d)

We could do the same for a Sparse array, or any other NumPy-like array that supports the __array_function__ protocol and the computation that we are trying to perform. In the next section, we will take a look at potential performance benefits that the protocol helps leveraging.

Note that the features described in this post are still experimental, some still under development and review. For a more detailed discussion on the actual progress of __array_function__, please refer to the Issues section.

Performance

Before going any further, assume the following hardware is utilized for all performance results described in this entire post:

CPU: 6-core (12-threads) Intel Core i7-7800X @ 3.50GHz
Main memory: 16 GB
GPU: NVIDIA Quadro GV100
OpenBLAS 0.2.18
cuBLAS 9.2.174
cuSOLVER 9.2.148

Let’s now check an example to see potential performance benefits of the __array_function__ protocol with Dask when using CuPy as a backend. Let’s start by creating all the arrays that we will use for computing an SVD later. Please note that my focus here is how Dask can leverage compute performance, therefore I’m ignoring in this example the time spent on creating or copying the arrays between CPU and GPU.

import numpy as np
import cupy
import dask.array as da

x = np.random.random((10000, 1000))
y = cupy.array(x)

dx = da.from_array(x, chunks=(5000, 1000))
dy = da.from_array(y, chunks=(5000, 1000), asarray=False)

Seen above we have four arrays:

x: a NumPy array in main memory;
y: a CuPy array in GPU memory;
dx: a NumPy array wrapped in a Dask array;
dy: a copy of a CuPy array wrapped in a Dask array; wrapping a CuPy array in a Dask array as a view (asarray=True) is not supported yet.

Compute SVD on a NumPy array

We can then start by computing the SVD of x using NumPy, thus, it’s processed on CPU in a single thread:

u, s, v = np.linalg.svd(x)

The timing information I obtained after that looks like the following:

CPU times: user 3min 10s, sys: 347 ms, total: 3min 11s
Wall time: 3min 11s

Over 3 minutes seems a bit too slow, so now the question is: Can we do better, and more importantly, without having to change our entire code?

The answer to this question is: Yes, we can.

Let’s look now at other results.

Compute SVD on the NumPy array wrapped in Dask array

First, of all, this is what you had to do before the introduction of the __array_function__ protocol:

u, s, v = da.linalg.svd(dx)
u, s, v = dask.compute(u, s, v)

The code above might have been very prohibitive for several projects, since one needs to call the proper library dispatcher in addition to passing the correct array. In other words, one would need to find all NumPy calls in the code and replace those by the correct library’s function call, depending on the input array type. After __array_function__, the same NumPy function can be called, using the Dask array dx as input:

u, s, v = np.linalg.svd(dx)
u, s, v = dask.compute(u, s, v)

Note: Dask defers computation of results until its consumption, therefore we need to call the dask.compute() function on result arrays to actually compute them.

Let’s now take a look at the timing information:

CPU times: user 1min 23s, sys: 460 ms, total: 1min 23s
Wall time: 1min 13s

Now, without changing any code, besides the wrapping of the NumPy array as a Dask array, we can see a speedup of 2x. Not too bad. But let’s go back to our previous question: Can we do better?

Compute SVD on the CuPy array

We can do the same as for the Dask array now and simply call NumPy’s SVD function on the CuPy array y:

u, s, v = np.linalg.svd(y)

The timing information we get now is the following:

CPU times: user 17.3 s, sys: 1.81 s, total: 19.1 s
Wall time: 19.1 s

We now see a 4-5x speedup with no change in internal calls whatsoever! This is exactly the sort of benefit that we expect to leverage with the __array_function__ protocol, speeding up existing code, for free!

Let’s go back to our original question one last time: Can we do better?

Compute SVD on the CuPy array wrapped in Dask array

We can now take advantage of the benefits of Dask data chunk splitting and the CuPy GPU implementation, in an attempt to keep our GPU busy as much as possible, this remains as simple as:

u, s, v = np.linalg.svd(dy)
u, s, v = dask.compute(u, s, v)

For which we get the following timing:

CPU times: user 8.97 s, sys: 653 ms, total: 9.62 s
Wall time: 9.45 s

Giving us another 2x speedup over the single-threaded CuPy SVD computing.

To conclude, we started from over 3 minutes and are now down to under 10 seconds by simply dispatching the work on a different array.

Application

We will now talk a bit about potential applications of the __array_function__ protocol. For this, we will discuss the Dask-GLM library, used for fitting Generalized Linear Models on large datasets. It’s built on top of Dask and offers an API compatible with scikit-learn.

Before the introduction of the __array_function__ protocol, we would need to rewrite most of its internal implementation for each and every NumPy-like library that we wished to use as a backend, therefore, we would need a specialization of the implementation for Dask, another for CuPy and yet another for Sparse. Now, thanks to all the functionality that these libraries share through compatible interface, we don’t have to change the implementation at all, we simply pass a different array type as input, as simple as that.

Example with scikit-learn

To demonstrate the ability we acquired, let’s consider the following scikit-learn example (based on the original example here):

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

N = 1000

# x from 0 to N
x = N * np.random.random((40000, 1))

# y = a*x + b with noise
y = 0.5 * x + 1.0 + np.random.normal(size=x.shape)

# create a linear regression model
est = LinearRegression()

We can then fit the model,

est.fit(x, y)

and obtain its time measurements:

CPU times: user 3.4 ms, sys: 0 ns, total: 3.4 ms
Wall time: 2.3 ms

We can then use it for prediction on some test data,

# predict y from the data
x_new = np.linspace(0, N, 100)
y_new = est.predict(x_new[:, np.newaxis])

And also check its time measurements:

CPU times: user 1.16 ms, sys: 680 µs, total: 1.84 ms
Wall time: 1.58 ms

And finally plot the results:

# plot the results
plt.figure(figsize=(4, 3))
ax = plt.axes()
ax.scatter(x, y, linewidth=3)
ax.plot(x_new, y_new, color='black')

ax.set_facecolor((1.0, 1.0, 0.42))

ax.set_xlabel('x')
ax.set_ylabel('y')

ax.axis('tight')

plt.show()

Example with Dask-GLM

The only thing we have to change from the code before is the first block, where we import libraries and create arrays:

import numpy as np
from dask_glm.estimators import LinearRegression
import matplotlib.pyplot as plt

N = 1000

# x from 0 to N
x = N * np.random.random((40000, 1))

# y = a*x + b with noise
y = 0.5 * x + 1.0 + np.random.normal(size=x.shape)

# create a linear regression model
est = LinearRegression(solver='lbfgs')

The rest of the code and also the plot look alike the previous scikit-learn example, so we’re ommitting those here for brevity. Note also that we could have called LinearRegression() without any arguments, but for this example we chose the lbfgs solver, that converges reasonably fast.

We can also have a look at the timing results for fitting, followed by those for predicting with Dask-GLM:

# Fitting
CPU times: user 9.66 ms, sys: 116 µs, total: 9.78 ms
Wall time: 8.94 ms

# Predicting
CPU times: user 130 µs, sys: 327 µs, total: 457 µs
Wall time: 1.06 ms

If instead we want to use CuPy to compute, we have to change only 3 lines, importing cupy instead of numpy, and the two lines where we create the random arrays, replacing them to cupy.random insted of np.random. The block should then look like this:

import cupy
from dask_glm.estimators import LinearRegression
import matplotlib.pyplot as plt

N = 1000

# x from 0 to N
x = N * cupy.random.random((40000, 1))

# y = a*x + b with noise
y = 0.5 * x + 1.0 + cupy.random.normal(size=x.shape)

# create a linear regression model
est = LinearRegression(solver='lbfgs')

And the timing results we obtain in this scenario are:

# Fitting
CPU times: user 151 ms, sys: 40.7 ms, total: 191 ms
Wall time: 190 ms

# Predicting
CPU times: user 1.91 ms, sys: 778 µs, total: 2.69 ms
Wall time: 1.37 ms

For the simple example chosen for this post, scikit-learn outperforms Dask-GLM using both NumPy and CuPy arrays. There may exist several reasons that contribute to this, and while we didn’t dive deep into understanding the exact reasons and their extent, we could cite some likely possibilities:

scikit-learn may be using solvers that converge faster;
Dask-GLM is entirely built on top of Dask, while scikit-learn may be heavily optimized internally;
Too many synchronization steps and data transfer could occur for small datasets with CuPy.

Performance for different Dask-GLM solvers

To verify how Dask-GLM with NumPy arrays would compare with CuPy arrays, we also did some logistic regression benchmarking of Dask-GLM solvers. The results below were obtained from a training dataset with 10², 10³, …, 10⁶ features of 100 dimensions, and matching number of test features.

Note: we are intentionally omitting results for Dask arrays, as we have identified a potential bug that causes Dask arrays not to converge.

From the results observed in the graphs above we can see that CuPy can be one order of magnitude faster than NumPy for fitting with any of the Dask-GLM solvers. Please note also that both axis are given in logarithmic scale for easier visualization.

Another interesting effect that can be seen is how converging may take longer for small number of samples. However, as we would normally hope for, compute time required to converge scales linearly to the number of samples.

Prediction with CuPy, as seen above, can be proportionally much faster than NumPy, staying mostly constant for all solvers, and around 3-4 orders of magnitude faster.

Issues

In this section we describe the work that has been done and is still ongoing since February, 2019, towards enabling the features described in previous sections. If you are not interested in all the details, feel free to completely skip this.

Fixed Issues

Since early February, 2019, substantial progress has been made towards deeper support of the __array_function__ protocol in the different projects, this trend is still going on and will continue in March. Below we see a list of issues that have been fixed or are in the process of review:

__array_function__ protocol dependencies fixed in CuPy PR #2029;
Dask issues using CuPy backend with mean() and moment() Dask Issue #4481, fixed in Dask PR #4513 and Dask PR #4519;
Replace in SciPy the aliased NumPy functions that may not be available in libraries like CuPy, fixed in SciPy PR #9888;
Allow creation of arbitrary shaped arrays, using the input array as reference for the new array to be created, under review in NumPy PR #13043;
Multithreading with CuPy first identified in Dask Issue #4487, CuPy Issue #2045 and CuPy Issue #1109, now under review in CuPy PR #2053;
Calling Dask’s flatnonzero() on CuPy array missing cupy.compress(), first identified in Dask Issue #4497, under review in Dask PR #4548,
Dask support for __array_function__, under review in Dask PR #4567.

Known Issues

Currently, one of the biggest issues we are tackling relates to the Dask issue #4490 we first identified when calling Dask’s diag() on a CuPy array. This requires some change on the Dask Array class, and subsequent changes throughout large parts of the Dask codebase. I will not go into too much detail here, but the way we are handling this issue is by adding a new attribute _meta to Dask Array in replacement of the simple dtype that currently exists. This new attribute will not only hold the dtype information, but also an empty array of the backend type used to create the Array in the first place, thus allowing us to internally reconstruct arrays of the backend type, without having to know explicitly whether it’s a NumPy, CuPy, Sparse or any other NumPy-like array. For additional details, please refer to Dask Issue #2977.

We have identified some more issues with ongoing discussions:

Using Sparse as a Dask backend, discussed in Dask Issue #4523;
Calling Dask’s fix() on CuPy array depends on __array_wrap__, discussed in Dask Issue #4496 and CuPy Issue #589;
Allow coercing of __array_function__, discussed in NumPy Issue #12974.

Future Work

There are several possibilities for a richer experience with Dask, some of which could be very interesting in the short and mid-term are:

Use Dask-cuDF alongside with Dask-GLM to present interesting realistic applications of the whole ecosystem;
More comprehensive examples and benchmarks for Dask-GLM;
Support for more models in Dask-GLM;
A deeper look into the Dask-GLM versus scikit-learn performance;
Profile CuPy’s performance of matrix-matrix multiplication operations (GEMM), compare to matrix-vector multiplication operations (GEMV) for distributed Dask operation.

Building GPU Groupby-Aggregations for Dask

2019-03-04T00:00:00+00:00

We’ve sufficiently aligned Dask DataFrame and cuDF to get groupby aggregations like the following to work well.

df.groupby('x').y.mean()

This post describes the kind of work we had to do as a model for future development.

Plan

As outlined in a previous post, Dask, Pandas, and GPUs: first steps, our plan to produce distributed GPU dataframes was to combine Dask DataFrame with cudf. In particular, we had to

change Dask DataFrame so that it would parallelize not just around the Pandas DataFrames that it works with today, but around anything that looked enough like a Pandas DataFrame
change cuDF so that it would look enough like a Pandas DataFrame to fit within the algorithms in Dask DataFrame

Changes

On the Dask side this mostly meant replacing

Replacing isinstance(df, pd.DataFrame) checks with is_dataframe_like(df) checks (after defining a suitable is_dataframe_like/is_series_like/is_index_like functions
Avoiding some more exotic functionality in Pandas, and instead trying to use more common functionality that we can expect to be in most DataFrame implementations

On the cuDF side this means making dozens of tiny changes to align the cuDF API to the Pandas API, and to add in missing features.

Dask Changes:
- Remove explicit pandas checks and provide cudf lazy registration #4359
- Replace isinstance(…, pandas) with is_dataframe_like #4375
- Add has_parallel_type
- Lazily register more cudf functions and move to backends file #4396
- Avoid checking against types in is_dataframe_like #4418
- Replace cudf-specific code with dask-cudf import #4470
- Avoid groupby.agg(callable) in groupby-var #4482 – this one is notable in that by simplifying our Pandas usage we actually got a significant speedup on the Pandas side.
cuDF Changes:

I don’t really expect anyone to go through all of those issues, but my hope is that by skimming over the issue titles people will get a sense for the kinds of changes we’re making here. It’s a large number of small things.

Also, kudos to Thomson Comer who solved most of the cuDF issues above.

There are still some pending issues

Square Root #1055, needed for groupby-std
cuDF needs multi-index support for columns #483, needed for:
```
gropuby.agg({'x': ['sum', mean'], 'y': ['min', 'max']})
```

But things mostly work

But generally things work pretty well today:

In [1]: import dask_cudf

In [2]: df = dask_cudf.read_csv('yellow_tripdata_2016-*.csv')

In [3]: df.groupby('passenger_count').trip_distance.mean().compute()
Out[3]: <cudf.Series nrows=10 >

In [4]: _.to_pandas()
Out[4]:
0    0.625424
1    4.976895
2    4.470014
3    5.955262
4    4.328076
5    3.079661
6    2.998077
7    3.147452
8    5.165570
9    5.916169
dtype: float64

Experience

First, most of this work was handled by the cuDF developers (which may be evident from the relative lengths of the issue lists above). When we started this process it felt like a never-ending stream of tiny issues. We weren’t able to see the next set of issues until we had finished the current set. Fortunately, most of them were pretty easy to fix. Additionally, as we went on, it seemed to get a bit easier over time.

Additionally, lots of things work other than groupby-aggregations as a result of the changes above. From the perspective of someone accustomed to Pandas, The cuDF library is starting to feel more reliable. We hit missing functionality less frequently when using cuDF on other operations.

What’s next?

More recently we’ve been working on the various join/merge operations in Dask DataFrame like indexed joins on a sorted column, joins between large and small dataframes (a common special case) and so on. Getting these algorithms from the mainline Dask DataFrame codebase to work with cuDF is resulting in a similar set of issues to what we saw above with groupby-aggregations, but so far the list is much smaller. We hope that this is a trend as we continue on to other sets of functionality into the future like I/O, time-series operations, rolling windows, and so on.

Running Dask and MPI programs together

2019-01-31T00:00:00+00:00

We present an experiment on how to pass data from a loosely coupled parallel computing system like Dask to a tightly coupled parallel computing system like MPI.

We give motivation and a complete digestible example.

Here is a gist of the code and results.

Motivation

Disclaimer: Nothing in this post is polished or production ready. This is an experiment designed to start conversation. No long-term support is offered.

We often get the following question:

How do I use Dask to pre-process my data, but then pass those results to a traditional MPI application?

You might want to do this because you’re supporting legacy code written in MPI, or because your computation requires tightly coupled parallelism of the sort that only MPI can deliver.

First solution: Write to disk

The simplest thing to do of course is to write your Dask results to disk and then load them back from disk with MPI. Given the relative cost of your computation to data loading, this might be a great choice.

For the rest of this blogpost we’re going to assume that it’s not.

Second solution

We have a trivial MPI library written in MPI4Py where each rank just prints out all the data that it was given. In principle though it could call into C++ code, and do arbitrary MPI things.

# my_mpi_lib.py
from mpi4py import MPI

comm = MPI.COMM_WORLD

def print_data_and_rank(chunks: list):
    """ Fake function that mocks out how an MPI function should operate

    -   It takes in a list of chunks of data that are present on this machine
    -   It does whatever it wants to with this data and MPI
        Here for simplicity we just print the data and print the rank
    -   Maybe it returns something
    """
    rank = comm.Get_rank()

    for chunk in chunks:
        print("on rank:", rank)
        print(chunk)

    return sum(chunk.sum() for chunk in chunks)

In our dask program we’re going to use Dask normally to load in data, do some preprocessing, and then hand off all of that data to each MPI rank, which will call the print_data_and_rank function above to initialize the MPI computation.

# my_dask_script.py

# Set up Dask workers from within an MPI job using the dask_mpi project
# See https://dask-mpi.readthedocs.io/en/latest/

from dask_mpi import initialize
initialize()

from dask.distributed import Client, wait, futures_of
client = Client()

# Use Dask Array to "load" data (actually just create random data here)

import dask.array as da
x = da.random.random(100000000, chunks=(1000000,))
x = x.persist()
wait(x)

# Find out where data is on each worker
# TODO: This could be improved on the Dask side to reduce boiler plate

from toolz import first
from collections import defaultdict
key_to_part_dict = {str(part.key): part for part in futures_of(x)}
who_has = client.who_has(x)
worker_map = defaultdict(list)
for key, workers in who_has.items():
    worker_map[first(workers)].append(key_to_part_dict[key])


# Call an MPI-enabled function on the list of data present on each worker

from my_mpi_lib import print_data_and_rank

futures = [client.submit(print_data_and_rank, list_of_parts, workers=worker)
           for worker, list_of_parts in worker_map.items()]

wait(futures)

client.close()

Then we can call this mix of Dask and an MPI program using normal mpirun or mpiexec commands.

mpirun -np 5 python my_dask_script.py

What just happened

So MPI started up and ran our script. The dask-mpi project set a Dask scheduler on rank 0, runs our client code on rank 1, and then runs a bunch of workers on ranks 2+.

Rank 0: Runs a Dask scheduler
Rank 1: Runs our script
Ranks 2+: Run Dask workers

Our script then created a Dask array, though presumably here it would read in data from some source, do more complex Dask manipulations before continuing on.

We then wait until all of the Dask work has finished and is in a quiet state. We then query the state in the scheduler to find out where all of that data lives. That’s this code here:

# Find out where data is on each worker
# TODO: This could be improved on the Dask side to reduce boiler plate

from toolz import first
from collections import defaultdict
key_to_part_dict = {str(part.key): part for part in futures_of(x)}
who_has = client.who_has(x)
worker_map = defaultdict(list)
for key, workers in who_has.items():
    worker_map[first(workers)].append(key_to_part_dict[key])

Admittedly, this code is gross, and not particularly friendly or obvious to non-Dask experts (or even Dask experts themselves, I had to steal this from the Dask XGBoost project, which does the same trick).

But after that we just call our MPI library’s initialize function, print_data_and_rank on all of our data using Dask’s Futures interface. That function gets the data directly from local memory (the Dask workers and MPI ranks are in the same process), and does whatever the MPI application wants.

Future work

This could be improved in a few ways:

The “gross” code referred to above could probably be placed into some library code to make this pattern easier for people to use.
Ideally the Dask part of the computation wouldn’t also have to be managed by MPI, but could maybe start up MPI on its own.

You could imagine Dask running on something like Kubernetes doing highly dynamic work, scaling up and down as necessary. Then it would get to a point where it needed to run some MPI code so it would, itself, start up MPI on its worker processes and run the MPI application on its data.
We haven’t really said anything about resilience here. My guess is that this isn’t hard to do (Dask has plenty of mechanisms to build complex inter-task relationships) but I didn’t solve it above.

Here is a gist of the code and results.

Single-Node Multi-GPU Dataframe Joins

2019-01-29T00:00:00+00:00

We experiment with single-node multi-GPU joins using cuDF and Dask. We find that the in-GPU computation is faster than communication. We also present context and plans for near-future work, including improving high performance communication in Dask with UCX.

Here is a notebook of the experiment in this post

Introduction

In a recent post we showed how Dask + cuDF could accelerate reading CSV files using multiple GPUs in parallel. That operation quickly became bound by the speed of our disk after we added a few GPUs. Now we try a very different kind of operation, multi-GPU joins.

This workload can be communication-heavy, especially if the column on which we are joining is not sorted nicely, and so provides a good example on the other extreme from parsing CSV.

Benchmark

Construct random data using the CPU

Here we use Dask array and Dask dataframe to construct two random tables with a shared id column. We can play with the number of rows of each table and the number of keys to make the join challenging in a variety of ways.

import dask.array as da
import dask.dataframe as dd

n_rows = 1000000000
n_keys = 5000000

left = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='x'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1)

n_rows = 10000000

right = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='y'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1)

Send to the GPUs

We have two Dask dataframes composed of many Pandas dataframes of our random data. We now map the cudf.from_pandas function across these to make a Dask dataframe of cuDF dataframes.

import dask
import cudf

gleft = left.map_partitions(cudf.from_pandas)
gright = right.map_partitions(cudf.from_pandas)

gleft, gright = dask.persist(gleft, gright)  # persist data in device memory

What’s nice here is that there wasn’t any special dask_pandas_dataframe_to_dask_cudf_dataframe function. Dask composed nicely with cuDF. We didn’t need to do anything special to support it.

We’ll also persisted the data in device memory.

After this, simple operations are easy and fast and use our eight GPUs.

>>> gleft.x.sum().compute()  # this takes 250ms
500004719.254711

Join

We’ll use standard Pandas syntax to merge the datasets, persist the result in RAM, and then wait

out = gleft.merge(gright, on=['id'])  # this is lazy

Profile and analyze results

We now look at the Dask diagnostic plots for this computation.

Task stream and communication

When we look at Dask’s task stream plot we see that each of our eight threads (each of which manages a single GPU) spent most of its time in communication (red is communication time). The actual merge and concat tasks are quite fast relative to the data transfer time.

That’s not too surprising. For this computation I’ve turned off any attempt to communicate between devices (more on this below) so the data is being moved from the GPU to the CPU memory, then serialized and put onto a TCP socket. We’re moving tens of GB on a single machine, so we’re seeing about 1GB/s total throughput of the system, which is typical for TCP-on-localhost in Python.

Flamegraph of computation

We can also look more deeply at the computational costs in Dask’s flamegraph-style plot. This shows which lines of our functions were taking up the most time (down to the Python level at least).

This Flame graph shows which lines of cudf code we spent time on while computing (excluding the main communication costs mentioned above). It may be interesting for those trying to further optimize performance. It shows that most of our costs are in memory allocation. Like communication, this has actually also been fixed in RAPIDS’ optional memory management pool, it just isn’t default yet, so I didn’t use it here.

Plans for efficient communication

The cuDF library actually has a decent approach to single-node multi-GPU communication that I’ve intentionally turned off for this experiment. That approach cleverly used Dask to communicate device pointer information using Dask’s normal channels (this is small and fast) and then used that information to initiate a side-channel communication for the bulk of the data. This approach was effective, but somewhat fragile. I’m inclined to move on for it in favor of …

UCX. The UCX project provides a single API that wraps around several transports like TCP, Infiniband, shared memory, and also GPU-specific transports. UCX claims to find the best way to communicate data between two points given the hardware available. If Dask were able to use this for communication then it would provide both efficient GPU-to-GPU communication on a single machine, and also efficient inter-machine communication when efficient networking hardware like Infiniband was present, even outside the context of GPUs.

There is some work we need to do here:

We need to make a Python wrapper around UCX
We need to make an optional Dask Comm around this ucx-py library that allows users to specify endpoints like ucx://path-to-scheduler
We need to make Python memoryview-like objects that refer to device memory
…

This work is already in progress by Akshay Vekatesh, who works on UCX, and Tom Augspurger a core Dask/Pandas developer. I suspect that they’ll write about it soon. I’m looking forward to seeing what comes of it, both for Dask and for high performance Python generally.

It’s worth pointing out that this effort won’t just help GPU users. It should help anyone on advanced networking hardware, including the mainstream scientific HPC community.

Summary

Single-node Mutli-GPU joins have a lot of promise. In fact, earlier RAPIDS developers got this running much faster than I was able to do above through the clever communication tricks I briefly mentioned. The main purpose of this post is to provide a benchmark for joins that we can use in the future, and to highlight when communication can be essential in parallel computing.

Now that GPUs have accelerated the computation time of each of our chunks of work we increasingly find that other systems become the bottleneck. We didn’t care as much about communication before because computational costs were comparable. Now that computation is an order of magnitude cheaper, other aspects of our stack become much more important.

I’m looking forward to seeing where this goes.

Come help!

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

If you’re interested in being paid to focus more on these topics, then consider applying for a job. NVIDIA’s RAPIDS team is looking to hire engineers for Dask development with GPUs and other data analytics library development projects.

Senior Library Software Engineer - RAPIDS

Dask Release 1.1.0

2019-01-23T00:00:00+00:00

I’m pleased to announce the release of Dask version 1.1.0. This is a major release with bug fixes and new features. The last release was 1.0.0 on 2018-11-29. This blogpost outlines notable changes since the last release.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

A lot of work has happened over the last couple months, and we encourage people to look through the changelog to get a sense of the kinds of incremental changes that developers are working on.

There are also a few notable changes in this release that we’ll highlight here:

Support for the recent Numpy 1.16 and Pandas 0.24 releases
Support for Pandas Extension Arrays (see Tom Augspurger’s post on the topic)
High level graph in Dask dataframe and operator fusion in simple cases
Increased support for other libraries that look enough like Numpy and Pandas to work within Dask Array/Dataframe

Support for Numpy 1.16 and Pandas 0.24

Both Numpy and Pandas have been evolving quickly over the last few months. We’re excited about the changes to extensibility arriving in both libraries. The Dask array/dataframe submodules have been updated to work well with these recent changes.

Pandas Extension Arrays

In particular Dask Dataframe supports Pandas Extension arrays, meaning that it’s easier to use third party Pandas packages like CyberPandas or Fletcher in parallel with Dask Dataframe.

For more information see Tom Augspurger’s post

High Level Graphs in Dask Dataframe

For a while Dask array has had some high level graphs for “atop” operations (elementwise, broadcasting, transpose, tensordot, reductions), which allow for reduced overhead and task fusion on computations within this class.

y = da.exp(x + 1).T  # These operations get fused to a single task

We’ve renamed atop to blockwise to be a bit more generic, and have also started applying it to Dask Dataframe, which helps to reduce overhead substantially when doing computations with many simple operations.

This still needs to be improved to increase the class of cases where it works, but we’re already seeing nice speedups on previously unseen workloads.

The da.atop function has been deprecated in favor of da.blockwise. There is now also a dd.blockwise which shares a common code path.

Non-Pandas dataframe and Non-Numpy array types

We’re working to make Dask a bit more agnostic to the types of in-memory array and dataframe objects that it can manipulate. Rather than having Dask Array be a grid of Numpy arrays and Dask Dataframe be a sequence of Pandas dataframes, we’re relaxing that constraint to a grid of Numpy-like arrays and a sequence of Pandas-like dataframes.

This is an ongoing effort that has targetted alternate backends like scipy.sparse, pydata/sparse, cupy, cudf and other systems.

There were some recent posts on arrays and dataframes that show proofs of concept for this with GPUs.

Acknowledgements

There have been several releases since the last time we had a release blogpost. The following people contributed to the dask/dask repository since the 0.19.0 release on September 5th:

Anderson Banihirwe
Antonino Ingargiola
Armin Berres
Bart Broere
Carlos Valiente
Daniel Li
Daniel Saxton
David Hoese
Diane Trout
Damien Garaud
Elliott Sales de Andrade
Eric Wolak
Gábor Lipták
Guido Imperiale
Guillaume Eynard-Bontemps
Itamar Turner-Trauring
James Bourbeau
Jan Koch
Javad
Jendrik Jördening
Jim Crist
Jonathan Fraine
John Kirkham
Johnnie Gray
Julia Signell
Justin Dennison
M. Farrajota
Marco Neumann
Mark Harfouche
Markus Gonser
Martin Durant
Matthew Rocklin
Matthias Bussonnier
Mina Farid
Paul Vecchio
Prabakaran Kumaresshan
Rahul Vaidya
Stephan Hoyer
Stuart Berg
TakaakiFuruse
Takahiro Kojima
Tom Augspurger
Yu Feng
Zhenqing Li
@milesial
@samc0de
@slnguyen

The following people contributed to the dask/distributed repository since the 0.19.0 release on September 5th:

Adam Klein
Brett Naul
Daniel Farrell
Diane Trout
Dirk Petersen
Eric Ma
Jim Crist
John Kirkham
Gaurav Sheni
Guillaume Eynard-Bontemps
Loïc Estève
Marius van Niekerk
Matthew Rocklin
Michael Wheeler
MikeG
NotSqrt
Peter Killick
Roy Wedge
Russ Bubley
Stephan Hoyer
@tjb900
Tom Rochette
@fjetter

Extension Arrays in Dask DataFrame

2019-01-22T00:00:00+00:00

This work is supported by Anaconda Inc

Dask DataFrame works well with pandas’ new Extension Array interface, including third-party extension arrays. This lets Dask

easily support pandas’ new extension arrays, like their new nullable integer array
support third-party extension array arrays, like cyberpandas’s IPArray

Background

Pandas 0.23 introduced the ExtensionArray, a way to store things other than a simple NumPy array in a DataFrame or Series. Internally pandas uses this for data types that aren’t handled natively by NumPy like datetimes with timezones, Categorical, or (the new!) nullable integer arrays.

>>> s = pd.Series(pd.date_range('2000', periods=4, tz="US/Central"))
>>> s
0   2000-01-01 00:00:00-06:00
1   2000-01-02 00:00:00-06:00
2   2000-01-03 00:00:00-06:00
3   2000-01-04 00:00:00-06:00
dtype: datetime64[ns, US/Central]

dask.dataframe has always supported the extension types that pandas defines.

>>> import dask.dataframe as dd
>>> dd.from_pandas(s, npartitions=2)
Dask Series Structure:
npartitions=2
0    datetime64[ns, US/Central]
2                           ...
3                           ...
dtype: datetime64[ns, US/Central]
Dask Name: from_pandas, 2 tasks

The Challenge

Newer versions of pandas allow third-party libraries to write custom extension arrays. These arrays can be placed inside a DataFrame or Series, and work just as well as any extension array defined within pandas itself. However, third-party extension arrays provide a slight challenge for Dask.

Recall: dask.dataframe is lazy. We use a familiar pandas-like API to build up a task graph, rather than executing immediately. But if Dask DataFrame is lazy, then how do things like the following work?

>>> df = pd.DataFrame({"A": [1, 2], 'B': [3, 4]})
>>> ddf = dd.from_pandas(df, npartitions=2)
>>> ddf[['B']].columns
Index(['B'], dtype='object')

ddf[['B']] (lazily) selects the column 'B' from the dataframe. But accessing .columns immediately returns a pandas Index object with just the selected columns.

No real computation has happened (you could just as easily swap out the from_pandas for a dd.read_parquet on a larger-than-memory dataset, and the behavior would be the same). Dask is able to do these kinds of “metadata-only” computations, where the output depends only on the columns and the dtypes, without executing the task graph. Internally, Dask does this by keeping a pair of dummy pandas DataFrames on each Dask DataFrame.

>>> ddf._meta
Empty DataFrame
Columns: [A, B]
Index: []

>>> ddf._meta_nonempty
ddf._meta_nonempty
   A  B
0  1  1
1  1  1

We need the _meta_nonempty, since some operations in pandas behave differently on an Empty DataFrame than on a non-empty one (either by design or, occasionally, a bug in pandas).

The issue with third-party extension arrays is that Dask doesn’t know what values to put in the _meta_nonempty. We’re quite happy to do it for each NumPy dtype and each of pandas’ own extension dtypes. But any third-party library could create an ExtensionArray for any type, and Dask would have no way of knowing what’s a valid value for it.

The Solution

Rather than Dask guessing what values to use for the _meta_nonempty, extension array authors (or users) can register their extension dtype with Dask. Once registered, Dask will be able to generate the _meta_nonempty, and things should work fine from there. For example, we can register the dummy DecimalArray that pandas uses for testing (this isn’t part of pandas’ public API) with Dask.

from decimal import Decimal
from pandas.tests.extension.decimal import DecimalArray, DecimalDtype

# The actual registration that would be done in the 3rd-party library
from dask.dataframe.extensions import make_array_nonempty


@make_array_nonempty.register(DecimalDtype)
def _(dtype):
    return DecimalArray._from_sequence([Decimal('0'), Decimal('NaN')],
                                       dtype=dtype)

Now users of that extension type can place those arrays inside a Dask DataFrame or Series.

>>> df = pd.DataFrame({"A": DecimalArray([Decimal('1.0'), Decimal('2.0'),
...                                       Decimal('3.0')])})

>>> ddf = dd.from_pandas(df, 2)
>>> ddf
Dask DataFrame Structure:
                     A
npartitions=1
0              decimal
2                  ...
Dask Name: from_pandas, 1 tasks

>>> ddf.dtypes
A    decimal
dtype: object

And from there, the usual operations just as they would in pandas.

>>> from random import choices
>>> df = pd.DataFrame({"A": DecimalArray(choices([Decimal('1.0'),
...                                               Decimal('2.0')],
...                                              k=100)),
...                    "B": np.random.choice([0, 1, 2, 3], size=(100,))})
>>> ddf = dd.from_pandas(df, 2)
In [35]: ddf.groupby("A").B.mean().compute()
Out[35]:
A
1.0    1.50
2.0    1.48
Name: B, dtype: float64

The Real Lesson

It’s neat that Dask now supports extension arrays. But to me, the exciting thing is just how little work this took. The PR implementing support for third-party extension arrays is quite short, just defining the object that third-parties register with, and using it to generate the data when dtype is detected. Supporting the three new extension arrays in pandas 0.24.0 (IntegerArray, PeriodArray, and IntervalArray), takes a handful of lines of code

@make_array_nonempty.register(pd.Interval):
def _(dtype):
    return IntervalArray.from_breaks([0, 1, 2], closed=dtype.closed)


@make_array_nonempty.register(pd.Period):
def _(dtype):
    return period_array([2000, 2001], freq=dtype.freq)


@make_array_nonempty.register(_IntegerDtype):
def _(dtype):
    return integer_array([0, None], dtype=dtype)

Dask benefits directly from improvements made to pandas. Dask didn’t have to build out a new parallel extension array interface, and reimplement all the new extension arrays using the parallel interface. We just re-used what pandas already did, and it fits into the existing Dask structure.

For third-party extension array authors, like cyberpandas, the work is similarly minimal. They don’t need to re-implement everything from the ground up, just to play well with Dask.

This highlights the importance of one of the Dask project’s core values: working with the community. If you visit dask.org, you’ll see phrases like

Integrates with existing projects

and

Built with the broader community

At the start of Dask, the developers could have gone off and re-written pandas or NumPy from scratch to be parallel friendly (though we’d probably still be working on that part today, since that’s such a massive undertaking). Instead, the Dask developers worked with the community, occasionally nudging it in directions that would help out dask. For example, many places in pandas held the GIL, preventing thread-based parallelism. Rather than abandoning pandas, the Dask and pandas developers worked together to release the GIL where possible when it was a bottleneck for dask.dataframe. This benefited Dask and anyone else trying to do thread-based parallelism with pandas DataFrames.

And now, when pandas introduces new features like nullable integers, dask.dataframe just needs to register it as an extension type and immediately benefits from it. And third-party extension array authors can do the same for their extension arrays.

If you’re writing an ExtensionArray, make sure to add it to the pandas ecosystem page, and register it with Dask!

Dask, Pandas, and GPUs: first steps

2019-01-13T00:00:00+00:00

We’re building a distributed GPU Pandas dataframe out of cuDF and Dask Dataframe. This effort is young.

This post describes the current situation, our general approach, and gives examples of what does and doesn’t work today. We end with some notes on scaling performance.

You can also view the experiment in this post as a notebook.

And here is a table of results:

Architecture	Time	Bandwidth
Single CPU Core	3min 14s	50 MB/s
Eight CPU Cores	58s	170 MB/s
Forty CPU Cores	35s	285 MB/s
One GPU	11s	900 MB/s
Eight GPUs	5s	2000 MB/s

Building Blocks: cuDF and Dask

Building a distributed GPU-backed dataframe is a large endeavor. Fortunately we’re starting on a good foundation and can assemble much of this system from existing components:

The cuDF library aims to implement the Pandas API on the GPU. It gets good speedups on standard operations like reading CSV files, filtering and aggregating columns, joins, and so on.
```
import cudf  # looks and feels like Pandas, but runs on the GPU

df = cudf.read_csv('myfile.csv')
df = df[df.name == 'Alice']
df.groupby('id').value.mean()
```
cuDF is part of the growing RAPIDS initiative.
The Dask Dataframe library provides parallel algorithms around the Pandas API. It composes large operations like distributed groupbys or distributed joins from a task graph of many smaller single-node groupbys or joins accordingly (and many other operations).
```
import dask.dataframe as dd  # looks and feels like Pandas, but runs in parallel

df = dd.read_csv('myfile.*.csv')
df = df[df.name == 'Alice']
df.groupby('id').value.mean().compute()
```
The Dask distributed task scheduler provides general-purpose parallel execution given complex task graphs. It’s good for adding multi-node computing into an existing codebase.

Given these building blocks, our approach is to make the cuDF API close enough to Pandas that we can reuse the Dask Dataframe algorithms.

Benefits and Challenges to this approach

This approach has a few benefits:

We get to reuse the parallel algorithms found in Dask Dataframe originally designed for Pandas.
It consolidates the development effort within a single codebase so that future effort spent on CPU Dataframes benefits GPU Dataframes and vice versa. Maintenance costs are shared.
By building code that works equally with two DataFrame implementations (CPU and GPU) we establish conventions and protocols that will make it easier for other projects to do the same, either with these two Pandas-like libraries, or with future Pandas-like libraries.

This approach also aims to demonstrate that the ecosystem should support Pandas-like libraries, rather than just Pandas. For example, if (when?) the Arrow library develops a computational system then we’ll be in a better place to roll that in as well.
When doing any refactor we tend to clean up existing code.

For example, to make dask dataframe ready for a new GPU Parquet reader we end up refactoring and simplifying our Parquet I/O logic.

The approach also has some drawbacks. Namely, it places API pressure on cuDF to match Pandas so:

Slight differences in API now cause larger problems, such as these:
- Join column ordering differs rapidsai/cudf #251
- Groupby aggregation column ordering differs rapidsai/cudf #483
cuDF has some pressure on it to repeat what some believe to be mistakes in the Pandas API.

For example, cuDF today supports missing values arguably more sensibly than Pandas. Should cuDF have to revert to the old way of doing things just to match Pandas semantics? Dask Dataframe will probably need to be more flexible in order to handle evolution and small differences in semantics.

Alternatives

We could also write a new dask-dataframe-style project around cuDF that deviates from the Pandas/Dask Dataframe API. Until recently this has actually been the approach, and the dask-cudf project did exactly this. This was probably a good choice early on to get started and prototype things. The project was able to implement a wide range of functionality including groupby-aggregations, joins, and so on using dask delayed.

We’re redoing this now on top of dask dataframe though, which means that we’re losing some functionality that dask-cudf already had, but hopefully the functionality that we add now will be more stable and established on a firmer base.

Status Today

Today very little works, but what does is decently smooth.

Here is a simple example that reads some data from many CSV files, picks out a column, and does some aggregations.

from dask_cuda import LocalCUDACluster
import dask_cudf
from dask.distributed import Client

cluster = LocalCUDACluster()  # runs on eight local GPUs
client = Client(cluster)

gdf = dask_cudf.read_csv('data/nyc/many/*.csv')  # wrap around many CSV files

>>> gdf.passenger_count.sum().compute()
184464740

Also note, NYC Taxi ridership is significantly less than it was a few years ago

What I’m excited about in the example above

All of the infrastructure surrounding the cuDF code, like the cluster setup, diagnostics, JupyterLab environment, and so on, came for free, like any other new Dask project.

Here is an image of my JupyterLab setup
Our df object is actually just a normal Dask Dataframe. We didn’t have to write new __repr__, __add__, or .sum() implementations, and probably many functions we didn’t think about work well today (though also many don’t).
We’re tightly integrated and more connected to other systems. For example, if we wanted to convert our dask-cudf-dataframe to a dask-pandas-dataframe then we would just use the cuDF to_pandas function:
```
df = df.map_partitions(cudf.DataFrame.to_pandas)
```
We don’t have to write anything special like a separate .to_dask_dataframe method or handle other special cases.

Dask parallelism is orthogonal to the choice of CPU or GPU.
It’s easy to switch hardware. By avoiding separate dask-cudf code paths it’s easier to add cuDF to an existing Dask+Pandas codebase to run on GPUs, or to remove cuDF and use Pandas if we want our code to be runnable without GPUs.

There are more examples of this in the scaling section below.

What’s wrong with the example above

In general the answer is many small things.

The cudf.read_csv function doesn’t yet support reading chunks from a single CSV file, and so doesn’t work well with very large CSV files. We had to split our large CSV files into many smaller CSV files first with normal Dask+Pandas:
```
import dask.dataframe as dd
(df = dd.read_csv('few-large/*.csv')
        .repartition(npartitions=100)
        .to_csv('many-small/*.csv', index=False))
```
(See rapidsai/cudf #568)
Many operations that used to work in dask-cudf like groupby-aggregations and joins no longer work. We’re going to need to slightly modify many cuDF APIs over the next couple of months to more closely match their Pandas equivalents.
I ran the timing cell twice because it currently takes a few seconds to import cudf today. rapidsai/cudf #627
We had to make Dask Dataframe a bit more flexible and assume less about its constituent dataframes being exactly Pandas dataframes. (see dask/dask #4359 and dask/dask #4375 for examples). I suspect that there will by many more small changes like these necessary in the future.

These problems are representative of dozens more similar issues. They are all fixable and indeed, many are actively being fixed today by the good folks working on RAPIDS.

Near Term Schedule

The RAPIDS group is currently busy working to release 0.5, which includes some of the fixes necessary to run the example above, and also many unrelated stability improvements. This will probably keep them busy for a week or two during which I don’t expect to see much Dask + cuDF work going on other than planning.

After that, Dask parallelism support will be a top priority, so I look forward to seeing some rapid progress here.

Scaling Results

In my last post about combining Dask Array with CuPy, a GPU-accelerated Numpy, we saw impressive speedups from using many GPUs on a simple problem that manipulated some simple random data.

Dask Array + CuPy on Random Data

Architecture	Time
Single CPU Core	2hr 39min
Forty CPU Cores	11min 30s
One GPU	1 min 37s
Eight GPUs	19s

That exercise was easy to scale because it was almost entirely bound by the computation of creating random data.

Dask DataFrame + cuDF on CSV data

We did a similar study on the read_csv example above, which is bound mostly by reading CSV data from disk and then parsing it. You can see a notebook available here. We have similar (though less impressive) numbers to present.

Architecture	Time	Bandwidth
Single CPU Core	3min 14s	50 MB/s
Eight CPU Cores	58s	170 MB/s
Forty CPU Cores	35s	285 MB/s
One GPU	11s	900 MB/s
Eight GPUs	5s	2000 MB/s

The bandwidth numbers were computed by noting that the data was around 10 GB on disk

Analysis

First, I want to emphasize again that it’s easy to test a wide variety of architectures using this setup because of the Pandas API compatibility between all of the different projects. We’re seeing a wide range of performance (40x span) across a variety of different hardware with a wide range of cost points.

Second, note that this problem scales less well than our previous example with CuPy, both on CPU and GPU. I suspect that this is because this example is also bound by I/O and not just number-crunching. While the jump from single-CPU to single-GPU is large, the jump from single-CPU to many-CPU or single-GPU to many-GPU is not as large as we would have liked. For GPUs for example we got around a 2x speedup when we added 8x as many GPUs.

At first one might think that this is because we’re saturating disk read speeds. However two pieces of evidence go against that guess:

NVIDIA folks familiar with my current hardware inform me that they’re able to get much higher I/O throughput when they’re careful
The CPU scaling is similarly poor, despite the fact that it’s obviously not reaching full I/O bandwidth

Instead, it’s likely that we’re just not treating our disks and IO pipelines carefully.

We might consider working to think more carefully about data locality within a single machine. Alternatively, we might just choose to use a smaller machine, or many smaller machines. My team has been asking me to start playing with some cheaper systems than a DGX, I may experiment with those soon. It may be that for data-loading and pre-processing workloads the previous wisdom of “pack as much computation as you can into a single box” no longer holds (without us doing more work that is).

Come help

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

Senior Library Software Engineer - RAPIDS

GPU Dask Arrays, first steps

2019-01-03T00:00:00+00:00

The following code creates and manipulates 2 TB of randomly generated data.

import dask.array as da

rs = da.random.RandomState()
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))
(x + 1)[::2, ::2].sum().compute(scheduler='threads')

On a single CPU, this computation takes two hours.

On an eight-GPU single-node system this computation takes nineteen seconds.

Actually this computation isn’t that impressive. It’s a simple workload, for which most of the time is spent creating and destroying random data. The computation and communication patterns are simple, reflecting the simplicity commonly found in data processing workloads.

What is impressive is that we were able to create a distributed parallel GPU array quickly by composing these four existing libraries:

CuPy provides a partial implementation of Numpy on the GPU.
Dask Array provides chunked algorithms on top of Numpy-like libraries like Numpy and CuPy.

This enables us to operate on more data than we could fit in memory by operating on that data in chunks.
The Dask distributed task scheduler runs those algorithms in parallel, easily coordinating work across many CPU cores.
The Dask CUDA to extend Dask distributed with GPU support.

These tools already exist. We had to connect them together with a small amount of glue code and minor modifications. By mashing these tools together we can quickly build and switch between different architectures to explore what is best for our application.

For this example we relied on the following changes upstream:

Comparison among single/multi CPU/GPU

We can now easily run some experiments on different architectures. This is easy because …

We can switch between CPU and GPU by switching between Numpy and CuPy.
We can switch between single/multi-CPU-core and single/multi-GPU by switching between Dask’s different task schedulers.

These libraries allow us to quickly judge the costs of this computation for the following hardware choices:

Single-threaded CPU
Multi-threaded CPU with 40 cores (80 H/T)
Single-GPU
Multi-GPU on a single machine with 8 GPUs

We present code for these four choices below, but first, we present a table of results.

Results

Architecture	Time
Single CPU Core	2hr 39min
Forty CPU Cores	11min 30s
One GPU	1 min 37s
Eight GPUs	19s

Setup

import cupy
import dask.array as da

# generate chunked dask arrays of mamy numpy random arrays
rs = da.random.RandomState()
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))

print(x.nbytes / 1e9)  # 2 TB
# 2000.0

CPU timing

(x + 1)[::2, ::2].sum().compute(scheduler='single-threaded')
(x + 1)[::2, ::2].sum().compute(scheduler='threads')

Single GPU timing

We switch from CPU to GPU by changing our data source to generate CuPy arrays rather than NumPy arrays. Everything else should more or less work the same without special handling for CuPy.

(This actually isn’t true yet, many things in dask.array will break for non-NumPy arrays, but we’re working on it actively both within Dask, within NumPy, and within the GPU array libraries. Regardless, everything in this example works fine.)

# generate chunked dask arrays of mamy cupy random arrays
rs = da.random.RandomState(RandomState=cupy.random.RandomState)  # <-- we specify cupy here
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))

(x + 1)[::2, ::2].sum().compute(scheduler='single-threaded')

Multi GPU timing

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

(x + 1)[::2, ::2].sum().compute()

And again, here are the results:

Architecture	Time
Single CPU Core	2hr 39min
Forty CPU Cores	11min 30s
One GPU	1 min 37s
Eight GPUs	19s

First, this is my first time playing with an 40-core system. I was surprised to see that many cores. I was also pleased to see that Dask’s normal threaded scheduler happily saturates many cores.

Although later on it did dive down to around 5000-6000%, and if you do the math you’ll see that we’re not getting a 40x speedup. My guess is that performance would improve if we were to play with some mixture of threads and processes, like having ten processes with eight threads each.

The jump from the biggest multi-core CPU to a single GPU is still an order of magnitude though. The jump to multi-GPU is another order of magnitude, and brings the computation down to 19s, which is short enough that I’m willing to wait for it to finish before walking away from my computer.

Actually, it’s quite fun to watch on the dashboard (especially after you’ve been waiting for three hours for the sequential solution to run):

Conclusion

This computation was simple, but the range in architecture just explored was extensive. We swapped out the underlying architecture from CPU to GPU (which had an entirely different codebase) and tried both multi-core CPU parallelism as well as multi-GPU many-core parallelism.

We did this in less than twenty lines of code, making this experiment something that an undergraduate student or other novice could perform at home. We’re approaching a point where experimenting with multi-GPU systems is approachable to non-experts (at least for array computing).

Here is a notebook for the experiment above

Room for improvement

We can work to expand the computation above in a variety of directions. There is a ton of work we still have to do to make this reliable.

Use more complex array computing workloads

The Dask Array algorithms were designed first around Numpy. We’ve only recently started making them more generic to other kinds of arrays (like GPU arrays, sparse arrays, and so on). As a result there are still many bugs when exploring these non-Numpy workloads.

For example if you were to switch sum for mean in the computation above you would get an error because our mean computation contains an easy to fix error that assumes Numpy arrays exactly.
Use Pandas and cuDF instead of Numpy and CuPy

The cuDF library aims to reimplement the Pandas API on the GPU, much like how CuPy reimplements the NumPy API. Using Dask DataFrame with cuDF will require some work on both sides, but is quite doable.

I believe that there is plenty of low-hanging fruit here.
Improve and move LocalCUDACluster

The LocalCUDAClutster class used above is an experimental Cluster type that creates as many workers locally as you have GPUs, and assigns each worker to prefer a different GPU. This makes it easy for people to load balance across GPUs on a single-node system without thinking too much about it. This appears to be a common pain-point in the ecosystem today.

However, the LocalCUDACluster probably shouldn’t live in the dask/distributed repository (it seems too CUDA specific) so will probably move to some dask-cuda repository. Additionally there are still many questions about how to handle concurrency on top of GPUs, balancing between CPU cores and GPU cores, and so on.
Multi-node computation

There’s no reason that we couldn’t accelerate computations like these further by using multiple multi-GPU nodes. This is doable today with manual setup, but we should also improve the existing deployment solutions dask-kubernetes, dask-yarn, and dask-jobqueue, to make this easier for non-experts who want to use a cluster of multi-GPU resources.
Expense

The machine I ran this on is expensive. Well, it’s nowhere close to as expensive to own and operate as a traditional cluster that you would need for these kinds of results, but it’s still well beyond the price point of a hobbyist or student.

It would be useful to run this on a more budget system to get a sense of the tradeoffs on more reasonably priced systems. I should probably also learn more about provisioning GPUs on the cloud.

Come help!

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

If you’re interested in being paid to focus more on these topics, then consider applying for a job. The NVIDIA corporation is hiring around the use of Dask with GPUs.

Senior Library Software Engineer - RAPIDS

That’s a fairly generic posting. If you’re interested the posting doesn’t seem to fit then please apply anyway and we’ll tweak things.