Dask Working Notes - Posts tagged GPU

Easy CPU/GPU Arrays and Dataframes

2023-02-02T00:00:00+00:00

This article was originally posted on the RAPIDS blog.

It’s now easy to switch between CPU (NumPy / Pandas) and GPU (CuPy / cuDF) in Dask. As of Dask 2022.10.0, users can optionally select the backend engine for input IO and data creation. In the short-term, the goal of the backend-configuration system is to enable Dask users to write code that will run on both CPU and GPU systems.

The preferred backend can be configured using the array.backend and dataframe.backend options with the standard Dask configuration system:

dask.config.set({"array.backend": "cupy"})
dask.config.set({"dataframe.backend": "cudf"})

To see how users can easily switch between NumPy and CuPy, let’s start by creating an array of ones:

>>> with dask.config.set({"array.backend": "cupy"}):
...     darr = da.ones(10, chunks=(5,))  # Get cupy-backed collection
...
>>> darr
dask.array<ones_like, shape=(10,), dtype=float64, chunksize=(5,), chunktype=cupy.ndarray>

The chunktype informs us that the array is constructed with cupy.ndarray objects instead of numpy.ndarray objects.

We’ve also improved the user experience for random array creation. Previously, if a user wanted to create a CuPy-backed Dask array, they were required to define an explicit RandomState object in Dask using CuPy. For example, the following code worked prior to Dask 2022.10.0, but seems rather verbose:

>>> import cupy
>>> import dask.array as da
>>> rs = da.random.RandomState(RandomState=cupy.random.RandomState)
>>>
>>> darr = rs.randint(0, 3, size=(10, 20), chunks=(2, 5))
>>> darr
dask.array<randint, shape=(10, 20), dtype=int64, chunksize=(2, 5), chunktype=cupy.ndarray>

Now, we can leverage the array.backend configuration to create a CuPy-backed dask array for random data:

>>> with dask.config.set({"array.backend": "cupy"}):
...     darr = da.random.randint(0, 3, size=(10, 20), chunks=(2, 5))  # Get cupy-backed collection
...
>>> darr
dask.array<randint, shape=(10, 20), dtype=int64, chunksize=(2, 5), chunktype=cupy.ndarray>

Using array.backend is significantly easier and much more ergonomic – it supports all basic array creation methods including: ones, zeros, empty, full, arange, and random

Note: from_array, from_zarr, from_tiledb have not yet been implemented with this functionality

Dispatching for Dataframe Creation

When creating Dask Dataframes backed by either Pandas or cuDF, the beginning is often the input I/O methods: read_csv, read_parquet, etc. We’ll first start by constructing a dataframe on the fly with from_dict:

>>> with dask.config.set({"dataframe.backend": "cudf"}):
...     data = {"a": range(10), "b": range(10)}
...     ddf = dd.from_dict(data, npartitions=2)
...
>>> ddf
<dask_cudf.DataFrame | 2 tasks | 2 npartitions>

Here we can tell we have a cuDF backed dataframe and we are using dask-cudf because the repr shows us the type: <dask_cudf.DataFrame | 2 tasks | 2 npartitions>. Let’s also demonstrate the read functionality by generating CSV and Parquet data.

ddf.to_csv('example.csv', single_file=True)
ddf.to_parquet('example.parquet')

Now we are simply repeating the config setting but instead using the read_csv and read_parquet methods:

>>> with dask.config.set({"dataframe.backend": "cudf"}):
...     ddf = dd.read_csv('example.csv')
...     print(type(ddf))
...
<class 'dask_cudf.core.DataFrame'>
>>> with dask.config.set({"dataframe.backend": "cudf"}):
...     ddf = dd.read_parquet('example.parquet')
...     type(ddf)
...
<class 'dask_cudf.core.DataFrame'>
>>>

Why is this Useful ?

As hardware changes in exciting and exotic ways with: GPUs, TPUs, IPUs, etc., we want to provide the same interface and treat hardware as an abstraction. For example, many PyTorch workflows start with the following:

device = 'cuda' if torch.cuda.is_available() else 'cpu'

And what follows is typically standard hardware agnostic PyTorch. This is incredibly powerful as the user (in most cases) should not care what hardware underlies the source. As such, it enables the user to develop PyTorch anywhere and everywhere. The new Dask backend selection configurations gives users a similar freedom.

Conclusion

Our long-term goal of this feature is to enable Dask users to use any backend library in dask.array and dask.dataframe, as long as that library conforms to the minimal “array” or “dataframe” standard defined by the data-api consortium, respectively.

The RAPIDS team consistently works with the open-source community to understand and address emerging needs. If you’re an open-source maintainer interested in bringing GPU-acceleration to your project, please reach out on Github or Twitter. The RAPIDS team would love to learn how potential new algorithms or toolkits would impact your work.

Large SVDs

2020-05-13T00:00:00+00:00

We perform Singular Value Decomposition (SVD) calculations on large datasets.

We modify the computation both by using fully precise and approximate methods, and by using both CPUs and GPUs.

In the end we compute an approximate SVD of 200GB of simulated data and using a mutli-GPU machine in 15-20 seconds. Then we run this from a dataset stored in the cloud where we find that I/O is, predictably, a major bottleneck.

SVD - The simple case

Dask arrays contain a relatively sophisticated SVD algorithm that works in the tall-and-skinny or short-and-fat cases, but not so well in the roughly-square case. It works by taking QR decompositions of each block of the array, combining the R matrices, doing another smaller SVD on those, and then performing some matrix multiplication to get back to the full result. It’s numerically stable and decently fast, assuming that the intermediate R matrices of the QR decompositions mostly fit in memory.

The memory constraints here are that if you have an n by m tall and skinny array (n >> m) cut into k blocks then you need to have about m**2 * k space. This is true in many cases, including typical PCA machine learning workloads, where you have tabular data with a few columns (hundreds at most) and many rows.

It’s easy to use and quite robust.

import dask.array as da

x = da.random.random((10000000, 20))
x

	Array	Chunk
Bytes	1.60 GB	100.00 MB
Shape	(10000000, 20)	(625000, 20)
Count	16 Tasks	16 Chunks
Type	float64	numpy.ndarray

20 10000000

u, s, v = da.linalg.svd(x)

This works fine in the short and fat case too (when you have far more columns than rows) but we’re always going to assume that one of your dimensions is unchunked, and that the other dimension has chunks that are quite a bit longer, otherwise, things might not fit into memory.

Approximate SVD

If your dataset is large in both dimensions then the algorithm above won’t work as is. However, if you don’t need exact results, or if you only need a few of the components, then there are a number of excellent approximation algorithms.

Dask array has one of these approximation algorithms implemented in the da.linalg.svd_compressed function. And with it we can compute the approximate SVD of very large matrices.

We were recently working on a problem (explained below) and found that we were still running out of memory when dealing with this algorithm. There were two challenges that we ran into:

The algorithm requires multiple passes over the data, but the Dask task scheduler was keeping the input matrix in memory after it had been loaded once in order to avoid recomputation. Things still worked, but Dask had to move the data to disk and back repeatedly, which reduced performance significantly.

We resolved this by including explicit recomputation steps in the algorithm.
Related chunks of data would be loaded at different times, and so would need to stick around longer than necessary to wait for their associated chunks.

We resolved this by engaging task fusion as an optimization pass.

Before diving further into the technical solution we quickly provide the use case that was motivating this work.

Application - Genomics

Many studies are using genome sequencing to study genetic variation between different individuals within a species. These includes studies of human populations, but also other species such as mice, mosquitoes or disease-causing parasites. These studies will, in general, find a large number of sites in the genome sequence where individuals differ from each other. For example, humans have more than 100 million variable sites in the genome, and modern studies like the UK BioBank are working towards sequencing the genomes of 1 million individuals or more.

In diploid species like humans, mice or mosquitoes, each individual carries two genome sequences, one inherited from each parent. At each of the 100 million variable genome sites there will be two or more “alleles” that a single genome might carry. One way to think about this is via the Punnett square, which represents the different possible genotypes that one individual might carry at one of these variable sites:

In the above there are three possible genotypes: AA, Aa, and aa. For computational genomics, these genotypes can be encoded as 0, 1, or 2. In a study of a species with M genetic variants assayed in N individual samples, we can represent these genotypes as an (M x N) array of integers. For a modern human genetics study, the scale of this array might approach (100 million x 1 million). (Although in practice, the size of the first dimension (number of variants) can be reduced somewhat, by at least an order of magnitude, because many genetic variants will carry little information and/or be correlated with each other.)

These genetic differences are not random, but carry information about patterns of genetic similarity and shared ancestry between individuals, because of the way they have been inherited through many generations. A common task is to perform a dimensionality reduction analysis on these data, such as a principal components analysis (SVD), to identify genetic structure reflecting these differencies in degree of shared ancestry. This is an essential part of discovering genetic variants associated with different diseases, and for learning more about the genetic history of populations and species.

Reducing the time taken to compute an analysis such as SVD, like all science, allows for exploring larger datasets and testing more hypotheses in less time. Practically, this means not simply a fast SVD but an accelerated pipeline end-to-end, from data loading to analysis, to understanding.

We want to run an experiment in less time than it takes to make a cup of tea

Performant SVDs w/ Dask

Now that we have that scientific background, let’s transition back to talking about computation.

To stop Dask from holding onto the data we intentionally trigger computation as we build up the graph. This is a bit atypical in Dask calculations (we prefer to have as much of the computation at once before computing) but useful given the multiple-pass nature of this problem. This was a fairly easy change, and is available in dask/dask #5041.

Additionally, we found that it was helpful to turn on moderately wide task fusion.

import dask
dask.config.set({"optimization.fuse.ave-width": 5})

Then things work fine

We’re going to try this SVD on a few different choices of hardware including:

A MacBook Pro
A DGX-2, an NVIDIA worksation with 16 high-end GPUs and fast interconnect
A twenty-node cluster on AWS

Macbook Pro

We can happily perform an SVD on a 20GB array on a Macbook Pro

import dask.array as da

x = da.random.random(size=(1_000_000, 20_000), chunks=(20_000, 5_000))

u, s, v = da.linalg.svd_compressed(x, k=10, compute=True)
v.compute()

This call is no longer entirely lazy, and it recomputes x a couple times, but it works, and it works using only a few GB of RAM on a consumer laptop.

It takes around 2min 30s time to compute that on a laptop. That’s great! It was super easy to try out, didn’t require any special hardware or setup, and in many cases is fast enough. By working locally we can iterate quickly.

Now that things work, we can experiment with different hardware.

Adding GPUs (a 15 second SVD)

Disclaimer: one of the authors (Ben Zaitlen) works for NVIDIA

We can dramatically increase performance by using a multi-GPU machine. NVIDIA and other manufacturers now make machines with multiple GPUs co-located in the same physical box. In the following section, we will run the calculations on a DGX2, a machine with 16 GPUs and fast interconnect between the GPUs.

Below is almost the same code, running in significantly less same time but we make the following changes:

We increase the size of the array by a factor of 10x
We switch out NumPy for CuPy, a GPU NumPy implementation
We use a sixteen-GPU DGX-2 machine with NVLink interconnects between GPUs (NVLink will dramatically decrease transfer time between workers)

On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 to 15 seconds.

The full notebook is here, but the relevant code snippets are below:

# Some GPU specific setup
from dask_cuda import LocalCluster

cluster = LocalCluster(...)
client = Client(cluster)

import cupy
import dask.array as da
rs = da.random.RandomState(RandomState=cupy.random.RandomState)

# Create the data and run the SVD as normal
x = rs.randint(0, 3, size=(10_000_000, 20_000),
               chunks=(20_000, 5_000), dtype="uint8")
x = x.persist()

u, s, v = da.linalg.svd_compressed(x, k=10, seed=rs)
v.compute()

To see this run, we recommend viewing this screencast:

Read dataset from Disk

While impressive, the computation above is mostly bound by generating random data and then performing matrix calculations. GPUs are good at both of these things.

In practice though, our input array won’t be randomly generated, it will be coming from some dataset stored on disk or increasingly more common, stored in the cloud. To make things more realistic we perform a similar calculation with data stored in a Zarr format in GCS

In this Zarr SVD example, we load a 25GB GCS backed data set onto a DGX2, run a few processing steps, then perform an SVD. The combination of preprocessing and SVD calculations ran in 18.7 sec and the data loading took 14.3 seconds.

Again, on a DGX2, from data loading to SVD we are running in time less than it would take to make a cup of tea. However, the data loading can be accelerated. From GCS we are reading into data into the main memory of the machine (host memory), uncompressing the zarr bits, then moving the data from host memory to the GPU (device memory). Passing data back and forth between host and device memory can significantly decrease performance. Reading directly into the GPU, bypassing host memory, would improve the overall pipeline.

And so we come back to a common lesson of high performance computing:

High performance computing isn’t about doing one thing exceedingly well, it’s about doing nothing poorly.

In this case GPUs made our computation fast enough that we now need to focus on other components of our system, notably disk bandwidth, and a direct reader for Zarr data to GPU memory.

Cloud

Diclaimer: one of the authors (Matthew Rocklin) works for Coiled Computing

We can also run this on the cloud with any number of frameworks. In this case we used the Coiled Cloud service to deploy on AWS

from coiled_cloud import Cloud, Cluster
cloud = Cloud()

cloud.create_cluster_type(
    organization="friends",
    name="genomics",
    worker_cpu=4,
    worker_memory="16 GiB",
    worker_environment={
        "OMP_NUM_THREADS": 1,
        "OPENBLAS_NUM_THREADS": 1,
        # "EXTRA_PIP_PACKAGES": "zarr"
    },
)

cluster = Cluster(
    organization="friends",
    typename="genomics",
    n_workers=20,
)

from dask.distributed import Client
client = Client(cluster)

# then proceed as normal

Using 20 machines with a total of 80 CPU cores on a dataset that was 10x larger than the MacBook pro example we were able to run in about the same amount of time. This shows near optimal scaling for this problem, which is nice to see given how complex the SVD calculation is.

A screencast of this problem is viewable here

Compression

One of the easiest ways for us to improve performance is to reduce the size of this data through compression. This data is highly compressible for two reasons:

The real-world data itself has structure and repetition (although the random play data does not)
We’re storing entries that take on only four values. We’re using eight-bit integers when we only needed two-bit integers

Let’s solve the second problem first.

Compression with bit twiddling

Ideally Numpy would have a two-bit integer datatype. Unfortunately it doesn’t, and this is hard because memory in computers is generally thought of in full bytes.

To work around this we can use bit arithmetic to shove four values into a single value Here are functions that do that, assuming that our array is square, and the last dimension is divisible by four.

import numpy as np

def compress(x: np.ndarray) -> np.ndarray:
    out = np.zeros_like(x, shape=(x.shape[0], x.shape[1] // 4))
    out += x[:, 0::4]
    out += x[:, 1::4] << 2
    out += x[:, 2::4] << 4
    out += x[:, 3::4] << 6
    return out


def decompress(out: np.ndarray) -> np.ndarray:
    back = np.zeros_like(out, shape=(out.shape[0], out.shape[1] * 4))
    back[:, 0::4] = out & 0b00000011
    back[:, 1::4] = (out & 0b00001100) >> 2
    back[:, 2::4] = (out & 0b00110000) >> 4
    back[:, 3::4] = (out & 0b11000000) >> 6
    return back

Then, we can use these functions along with Dask to store our data in a compressed state, but lazily decompress on-demand.

x = x.map_blocks(compress).persist().map_blocks(decompress)

That’s it. We compress each block our data and store that in memory. However the output variable that we have, x will decompress each chunk before we operate on it, so we don’t need to worry about handling compressed blocks.

Compression with Zarr

A slightly more general but probably less efficient route would be to compress our arrays with a proper compression library like Zarr.

The example below shows how we do this in practice.

import zarr
import numpy as np
from numcodecs import Blosc
compressor = Blosc(cname='lz4', clevel=3, shuffle=Blosc.BITSHUFFLE)


x = x.map_blocks(zarr.array, compressor=compressor).persist().map_blocks(np.array)

Additionally, if we’re using the dask-distributed scheduler then we want to make sure that the Blosc compression library doesn’t use additional threads. That way we don’t have parallel calls of a parallel library, which can cause some contention

def set_no_threads_blosc():
    """ Stop blosc from using multiple threads """
    import numcodecs
    numcodecs.blosc.use_threads = False

# Run on all workers
client.register_worker_plugin(set_no_threads_blosc)

This approach is more general, and probably a good trick to have up ones’ sleeve, but it also doesn’t work on GPUs, which in the end is why we ended up going with the bit-twiddling approach one section above, which uses API that was uniformly accessible within the Numpy and CuPy libraries.

Final Thoughts

In this post we did a few things, all around a single important problems in genomics.

We learned a bit of science
We translated a science problem into a computational problem, and in particular into a request to perform large singular value decompositions
We used a canned algorithm in dask.array that performed pretty well, assuming that we’re comfortable going over the array in a few passes
We then tried that algorithm on three architectures quickly
1. A Macbook Pro
2. A multi-GPU machine
3. A cluster in the cloud
Finally we talked about some tricks to pack more data into the same memory with compression

This problem was nice in that we got to dive deep into a technical science problem, and yet also try a bunch of architecture quickly to investigate hardware choices that we might make in the future.

We used several technologies here today, made by several different communities and companies. It was great to see how they all worked together seamlessly to provide a flexible-yet-consistent experience.

cuML and Dask hyperparameter optimization

2019-03-27T00:00:00+00:00

DGX-1 Workstation
Host Memory: 512 GB
GPU Tesla V100 x 8
cudf 0.6
cuml 0.6
dask 1.1.4
Jupyter notebook

TLDR; Hyper-parameter Optimization is functional but slow with cuML

cuML and Dask Hyper-parameter Optimization

cuML is an open source GPU accelerated machine learning library primarily developed at NVIDIA which mirrors the Scikit-Learn API. The current suite of algorithms includes GLMs, Kalman Filtering, clustering, and dimensionality reduction. Many of these machine learning algorithms use hyper-parameters. These are parameters used during the model training process but are not “learned” during the training. Often these parameters are coefficients or penalty thresholds and finding the “best” hyper parameter can be computationally costly. In the PyData community, we often reach to Scikit-Learn’s GridSearchCV or RandomizedSearchCV for easy definition of the search space for hyper-parameters – this is called hyper-parameter optimization. Within the Dask community, Dask-ML has incrementally improved the efficiency of hyper-parameter optimization by leveraging both Scikit-Learn and Dask to use multi-core and distributed schedulers: Grid and RandomizedSearch with DaskML.

With the newly created drop-in replacement for Scikit-Learn, cuML, we experimented with Dask’s GridSearchCV. In the upcoming 0.6 release of cuML, the estimators are serializable and are functional within the Scikit-Learn/dask-ml framework, but slow compared with Scikit-Learn estimators. And while speeds are slow now, we know how to boost performance, have filed several issues, and hope to show performance gains in future releases.

All code and timing measurements can be found in this Jupyter notebook

Fast Fitting!

cuML is fast! But finding that speed requires developing a bit of GPU knowledge and some intuition. For example, there is a non-zero cost of moving data from device to GPU and, when data is “small” there are little to no performance gains. “Small”, currently might mean less than 100MB.

In the following example we use the diabetes data set provided by sklearn and linearly fit the data with RidgeRegression

\[ \min\limits_w ||y - Xw||^2_2 + alpha \* ||w||^2_2\]

alpha is the hyper-parameter and we initially set to 1.

import numpy as np
from cuml import Ridge as cumlRidge
import dask_ml.model_selection as dcv
from sklearn import datasets, linear_model
from sklearn.externals.joblib import parallel_backend
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2)

fit_intercept = True
normalize = False
alpha = np.array([1.0])

ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

ridge.fit(X_train, y_train)
cu_ridge.fit(X_train, y_train)=

The above ran with a single timing measurement of:

Scikit-Learn Ridge: 28 ms
cuML Ridge: 1.12 s

But the data is quite small, ~28KB. Increasing the size to ~2.8GB and re-running we see significant gains:

dup_ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
dup_cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

# move data from host to device
record_data = (('fea%d'%i, dup_data[:,i]) for i in range(dup_data.shape[1]))
gdf_data = cudf.DataFrame(record_data)
gdf_train = cudf.DataFrame(dict(train=dup_train))

#sklearn
dup_ridge.fit(dup_data, dup_train)

# cuml
dup_cu_ridge.fit(gdf_data, gdf_train.train)

With new timing measurements of:

Scikit-Learn Ridge: 4.82 s ± 694 ms
cuML Ridge: 450 ms ± 47.6 ms

With more data we clearly see faster fitting times, but the time to move data to the GPU (through CUDF) was 19.7s. This cost of data movement is one of the reasons why RAPIDS/cuDF was developed – keep data on the GPU and avoid having to move back and forth.

Hyper-Parameter Optimization Experiments

So moving to the GPU can be costly, but once there, with larger data sizes, we gain significant performance optimizations. Naively, we thought, “well, we have GPU machine learning, we have distributed hyper-parameter optimization… we should have distributed, GPU-accelerated, hyper-parameter optimization!”

Scikit-Learn assumes a specific, but well defined API for estimators over which it will perform hyper-parameter optimization. Most estimators/classifiers in Scikit-Learn look like the following:

class DummyEstimator(BaseEstimator):
    def __init__(self, params=...):
        ...

    def fit(self, X, y=None):
        ...

    def predict(self, X):
        ...

    def score(self, X, y=None):
        ...

    def get_params(self):
        ...

    def set_params(self, params...):
        ...

When we started experimenting with hyper-parameter optimization, we found a few API holes missing, these were resolved, mostly handling matching argument structure and various getters/setters.

get_params and set_params (#271)
fix/clf-solver (#318)
map fit_transform to sklearn implementation (#330)
Fea get params small changes (#322)

With holes plugged up we tested again. Using the same diabetes data set, we are now performing hyper-parameter optimization and searching over many alpha parameters for the best scoring alpha.

params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

grid = GridSearchCV(clf, params, scoring='r2')
grid.fit(X_train, y_train)

cu_grid = GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(X_train, y_train)

Again, reminding ourselves that the data is small ~28KB, we don’t expect to observe cuml performing faster than sklearn. Instead, we want to demonstrate functionality.

Again, reminding ourselves that the data is small ~28KB, we don’t expect to observe cuml performing faster than Scikit-Learn. Instead, we want to demonstrate functionality. Additionally, we also tried swapping out Dask-ML’s implementation of GridSearchCV (which adheres to the same API as Scikit-Learn) to use all of the GPUs we have available in parallel.

params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

grid = dcv.GridSearchCV(clf, params, scoring='r2')
grid.fit(X_train, y_train)

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(X_train, y_train)

Timing Measurements:

GridSearchCV	sklearn-Ridge	cuml-ridge
Scikit-Learn	88.4 ms ± 6.11 ms	6.51 s ± 132 ms
Dask-ML	873 ms ± 347 ms	740 ms ± 142 ms

Unsurprisingly, we see that GridSearchCV and Ridge Regression from Scikit-Learn is the fastest in this context. There is cost to distributing work and data, and as we previously mentioned, moving data from host to device.

How does performance scale as we scale data?

two_dup_data = np.array(np.vstack([X_train]*int(1e2)))
two_dup_train = np.array(np.hstack([y_train]*int(1e2)))
three_dup_data = np.array(np.vstack([X_train]*int(1e3)))
three_dup_train = np.array(np.hstack([y_train]*int(1e3)))

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(two_dup_data, two_dup_train)

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(three_dup_data, three_dup_train)

grid = dcv.GridSearchCV(clf, params, scoring='r2')
grid.fit(three_dup_data, three_dup_train)

Timing Measurements:

Data (MB)	cuML+Dask-ML	sklearn+Dask-ML
2.8 MB	13.8s
28 MB	1min 17s	4.87 s

cuML + dask-ml (Distributed GridSearchCV) does significantly worse as data sizes increase! Why? Primarily, two reasons:

Non optimized movement of data between host and device compounded by N devices and the size of the parameter space
Scoring methods are not implemented in with cuML

Below is the Dask graph for the GridSearch

There are 50 (cv=5 times 10 parameters for alpha) instances of chunking up our test data set and scoring performance. That means 50 times we are moving data back forth between host and device for fitting and 50 times for scoring. That’s not great, but it’s also very solvable – build scoring functions for GPUs!

Immediate Future Work

We know the problems, GH Issues have been filed, and we are working on these issues – come help!

Built In Scorers (#242)
DeviceNDArray as input data (#369)
Communication with UCX (#2344)

Building GPU Groupby-Aggregations for Dask

2019-03-04T00:00:00+00:00

We’ve sufficiently aligned Dask DataFrame and cuDF to get groupby aggregations like the following to work well.

df.groupby('x').y.mean()

This post describes the kind of work we had to do as a model for future development.

Plan

As outlined in a previous post, Dask, Pandas, and GPUs: first steps, our plan to produce distributed GPU dataframes was to combine Dask DataFrame with cudf. In particular, we had to

change Dask DataFrame so that it would parallelize not just around the Pandas DataFrames that it works with today, but around anything that looked enough like a Pandas DataFrame
change cuDF so that it would look enough like a Pandas DataFrame to fit within the algorithms in Dask DataFrame

Changes

On the Dask side this mostly meant replacing

Replacing isinstance(df, pd.DataFrame) checks with is_dataframe_like(df) checks (after defining a suitable is_dataframe_like/is_series_like/is_index_like functions
Avoiding some more exotic functionality in Pandas, and instead trying to use more common functionality that we can expect to be in most DataFrame implementations

On the cuDF side this means making dozens of tiny changes to align the cuDF API to the Pandas API, and to add in missing features.

Dask Changes:
- Remove explicit pandas checks and provide cudf lazy registration #4359
- Replace isinstance(…, pandas) with is_dataframe_like #4375
- Add has_parallel_type
- Lazily register more cudf functions and move to backends file #4396
- Avoid checking against types in is_dataframe_like #4418
- Replace cudf-specific code with dask-cudf import #4470
- Avoid groupby.agg(callable) in groupby-var #4482 – this one is notable in that by simplifying our Pandas usage we actually got a significant speedup on the Pandas side.
cuDF Changes:

I don’t really expect anyone to go through all of those issues, but my hope is that by skimming over the issue titles people will get a sense for the kinds of changes we’re making here. It’s a large number of small things.

Also, kudos to Thomson Comer who solved most of the cuDF issues above.

There are still some pending issues

Square Root #1055, needed for groupby-std
cuDF needs multi-index support for columns #483, needed for:
```
gropuby.agg({'x': ['sum', mean'], 'y': ['min', 'max']})
```

But things mostly work

But generally things work pretty well today:

In [1]: import dask_cudf

In [2]: df = dask_cudf.read_csv('yellow_tripdata_2016-*.csv')

In [3]: df.groupby('passenger_count').trip_distance.mean().compute()
Out[3]: <cudf.Series nrows=10 >

In [4]: _.to_pandas()
Out[4]:
0    0.625424
1    4.976895
2    4.470014
3    5.955262
4    4.328076
5    3.079661
6    2.998077
7    3.147452
8    5.165570
9    5.916169
dtype: float64

Experience

First, most of this work was handled by the cuDF developers (which may be evident from the relative lengths of the issue lists above). When we started this process it felt like a never-ending stream of tiny issues. We weren’t able to see the next set of issues until we had finished the current set. Fortunately, most of them were pretty easy to fix. Additionally, as we went on, it seemed to get a bit easier over time.

Additionally, lots of things work other than groupby-aggregations as a result of the changes above. From the perspective of someone accustomed to Pandas, The cuDF library is starting to feel more reliable. We hit missing functionality less frequently when using cuDF on other operations.

What’s next?

More recently we’ve been working on the various join/merge operations in Dask DataFrame like indexed joins on a sorted column, joins between large and small dataframes (a common special case) and so on. Getting these algorithms from the mainline Dask DataFrame codebase to work with cuDF is resulting in a similar set of issues to what we saw above with groupby-aggregations, but so far the list is much smaller. We hope that this is a trend as we continue on to other sets of functionality into the future like I/O, time-series operations, rolling windows, and so on.

Single-Node Multi-GPU Dataframe Joins

2019-01-29T00:00:00+00:00

We experiment with single-node multi-GPU joins using cuDF and Dask. We find that the in-GPU computation is faster than communication. We also present context and plans for near-future work, including improving high performance communication in Dask with UCX.

Here is a notebook of the experiment in this post

Introduction

In a recent post we showed how Dask + cuDF could accelerate reading CSV files using multiple GPUs in parallel. That operation quickly became bound by the speed of our disk after we added a few GPUs. Now we try a very different kind of operation, multi-GPU joins.

This workload can be communication-heavy, especially if the column on which we are joining is not sorted nicely, and so provides a good example on the other extreme from parsing CSV.

Benchmark

Construct random data using the CPU

Here we use Dask array and Dask dataframe to construct two random tables with a shared id column. We can play with the number of rows of each table and the number of keys to make the join challenging in a variety of ways.

import dask.array as da
import dask.dataframe as dd

n_rows = 1000000000
n_keys = 5000000

left = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='x'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1)

n_rows = 10000000

right = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='y'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1)

Send to the GPUs

We have two Dask dataframes composed of many Pandas dataframes of our random data. We now map the cudf.from_pandas function across these to make a Dask dataframe of cuDF dataframes.

import dask
import cudf

gleft = left.map_partitions(cudf.from_pandas)
gright = right.map_partitions(cudf.from_pandas)

gleft, gright = dask.persist(gleft, gright)  # persist data in device memory

What’s nice here is that there wasn’t any special dask_pandas_dataframe_to_dask_cudf_dataframe function. Dask composed nicely with cuDF. We didn’t need to do anything special to support it.

We’ll also persisted the data in device memory.

After this, simple operations are easy and fast and use our eight GPUs.

>>> gleft.x.sum().compute()  # this takes 250ms
500004719.254711

Join

We’ll use standard Pandas syntax to merge the datasets, persist the result in RAM, and then wait

out = gleft.merge(gright, on=['id'])  # this is lazy

Profile and analyze results

We now look at the Dask diagnostic plots for this computation.

Task stream and communication

When we look at Dask’s task stream plot we see that each of our eight threads (each of which manages a single GPU) spent most of its time in communication (red is communication time). The actual merge and concat tasks are quite fast relative to the data transfer time.

That’s not too surprising. For this computation I’ve turned off any attempt to communicate between devices (more on this below) so the data is being moved from the GPU to the CPU memory, then serialized and put onto a TCP socket. We’re moving tens of GB on a single machine, so we’re seeing about 1GB/s total throughput of the system, which is typical for TCP-on-localhost in Python.

Flamegraph of computation

We can also look more deeply at the computational costs in Dask’s flamegraph-style plot. This shows which lines of our functions were taking up the most time (down to the Python level at least).

This Flame graph shows which lines of cudf code we spent time on while computing (excluding the main communication costs mentioned above). It may be interesting for those trying to further optimize performance. It shows that most of our costs are in memory allocation. Like communication, this has actually also been fixed in RAPIDS’ optional memory management pool, it just isn’t default yet, so I didn’t use it here.

Plans for efficient communication

The cuDF library actually has a decent approach to single-node multi-GPU communication that I’ve intentionally turned off for this experiment. That approach cleverly used Dask to communicate device pointer information using Dask’s normal channels (this is small and fast) and then used that information to initiate a side-channel communication for the bulk of the data. This approach was effective, but somewhat fragile. I’m inclined to move on for it in favor of …

UCX. The UCX project provides a single API that wraps around several transports like TCP, Infiniband, shared memory, and also GPU-specific transports. UCX claims to find the best way to communicate data between two points given the hardware available. If Dask were able to use this for communication then it would provide both efficient GPU-to-GPU communication on a single machine, and also efficient inter-machine communication when efficient networking hardware like Infiniband was present, even outside the context of GPUs.

There is some work we need to do here:

We need to make a Python wrapper around UCX
We need to make an optional Dask Comm around this ucx-py library that allows users to specify endpoints like ucx://path-to-scheduler
We need to make Python memoryview-like objects that refer to device memory
…

This work is already in progress by Akshay Vekatesh, who works on UCX, and Tom Augspurger a core Dask/Pandas developer. I suspect that they’ll write about it soon. I’m looking forward to seeing what comes of it, both for Dask and for high performance Python generally.

It’s worth pointing out that this effort won’t just help GPU users. It should help anyone on advanced networking hardware, including the mainstream scientific HPC community.

Summary

Single-node Mutli-GPU joins have a lot of promise. In fact, earlier RAPIDS developers got this running much faster than I was able to do above through the clever communication tricks I briefly mentioned. The main purpose of this post is to provide a benchmark for joins that we can use in the future, and to highlight when communication can be essential in parallel computing.

Now that GPUs have accelerated the computation time of each of our chunks of work we increasingly find that other systems become the bottleneck. We didn’t care as much about communication before because computational costs were comparable. Now that computation is an order of magnitude cheaper, other aspects of our stack become much more important.

I’m looking forward to seeing where this goes.

Come help!

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

If you’re interested in being paid to focus more on these topics, then consider applying for a job. NVIDIA’s RAPIDS team is looking to hire engineers for Dask development with GPUs and other data analytics library development projects.

Senior Library Software Engineer - RAPIDS

Dask, Pandas, and GPUs: first steps

2019-01-13T00:00:00+00:00

We’re building a distributed GPU Pandas dataframe out of cuDF and Dask Dataframe. This effort is young.

This post describes the current situation, our general approach, and gives examples of what does and doesn’t work today. We end with some notes on scaling performance.

You can also view the experiment in this post as a notebook.

And here is a table of results:

Architecture	Time	Bandwidth
Single CPU Core	3min 14s	50 MB/s
Eight CPU Cores	58s	170 MB/s
Forty CPU Cores	35s	285 MB/s
One GPU	11s	900 MB/s
Eight GPUs	5s	2000 MB/s

Building Blocks: cuDF and Dask

Building a distributed GPU-backed dataframe is a large endeavor. Fortunately we’re starting on a good foundation and can assemble much of this system from existing components:

The cuDF library aims to implement the Pandas API on the GPU. It gets good speedups on standard operations like reading CSV files, filtering and aggregating columns, joins, and so on.
```
import cudf  # looks and feels like Pandas, but runs on the GPU

df = cudf.read_csv('myfile.csv')
df = df[df.name == 'Alice']
df.groupby('id').value.mean()
```
cuDF is part of the growing RAPIDS initiative.
The Dask Dataframe library provides parallel algorithms around the Pandas API. It composes large operations like distributed groupbys or distributed joins from a task graph of many smaller single-node groupbys or joins accordingly (and many other operations).
```
import dask.dataframe as dd  # looks and feels like Pandas, but runs in parallel

df = dd.read_csv('myfile.*.csv')
df = df[df.name == 'Alice']
df.groupby('id').value.mean().compute()
```
The Dask distributed task scheduler provides general-purpose parallel execution given complex task graphs. It’s good for adding multi-node computing into an existing codebase.

Given these building blocks, our approach is to make the cuDF API close enough to Pandas that we can reuse the Dask Dataframe algorithms.

Benefits and Challenges to this approach

This approach has a few benefits:

We get to reuse the parallel algorithms found in Dask Dataframe originally designed for Pandas.
It consolidates the development effort within a single codebase so that future effort spent on CPU Dataframes benefits GPU Dataframes and vice versa. Maintenance costs are shared.
By building code that works equally with two DataFrame implementations (CPU and GPU) we establish conventions and protocols that will make it easier for other projects to do the same, either with these two Pandas-like libraries, or with future Pandas-like libraries.

This approach also aims to demonstrate that the ecosystem should support Pandas-like libraries, rather than just Pandas. For example, if (when?) the Arrow library develops a computational system then we’ll be in a better place to roll that in as well.
When doing any refactor we tend to clean up existing code.

For example, to make dask dataframe ready for a new GPU Parquet reader we end up refactoring and simplifying our Parquet I/O logic.

The approach also has some drawbacks. Namely, it places API pressure on cuDF to match Pandas so:

Slight differences in API now cause larger problems, such as these:
- Join column ordering differs rapidsai/cudf #251
- Groupby aggregation column ordering differs rapidsai/cudf #483
cuDF has some pressure on it to repeat what some believe to be mistakes in the Pandas API.

For example, cuDF today supports missing values arguably more sensibly than Pandas. Should cuDF have to revert to the old way of doing things just to match Pandas semantics? Dask Dataframe will probably need to be more flexible in order to handle evolution and small differences in semantics.

Alternatives

We could also write a new dask-dataframe-style project around cuDF that deviates from the Pandas/Dask Dataframe API. Until recently this has actually been the approach, and the dask-cudf project did exactly this. This was probably a good choice early on to get started and prototype things. The project was able to implement a wide range of functionality including groupby-aggregations, joins, and so on using dask delayed.

We’re redoing this now on top of dask dataframe though, which means that we’re losing some functionality that dask-cudf already had, but hopefully the functionality that we add now will be more stable and established on a firmer base.

Status Today

Today very little works, but what does is decently smooth.

Here is a simple example that reads some data from many CSV files, picks out a column, and does some aggregations.

from dask_cuda import LocalCUDACluster
import dask_cudf
from dask.distributed import Client

cluster = LocalCUDACluster()  # runs on eight local GPUs
client = Client(cluster)

gdf = dask_cudf.read_csv('data/nyc/many/*.csv')  # wrap around many CSV files

>>> gdf.passenger_count.sum().compute()
184464740

Also note, NYC Taxi ridership is significantly less than it was a few years ago

What I’m excited about in the example above

All of the infrastructure surrounding the cuDF code, like the cluster setup, diagnostics, JupyterLab environment, and so on, came for free, like any other new Dask project.

Here is an image of my JupyterLab setup
Our df object is actually just a normal Dask Dataframe. We didn’t have to write new __repr__, __add__, or .sum() implementations, and probably many functions we didn’t think about work well today (though also many don’t).
We’re tightly integrated and more connected to other systems. For example, if we wanted to convert our dask-cudf-dataframe to a dask-pandas-dataframe then we would just use the cuDF to_pandas function:
```
df = df.map_partitions(cudf.DataFrame.to_pandas)
```
We don’t have to write anything special like a separate .to_dask_dataframe method or handle other special cases.

Dask parallelism is orthogonal to the choice of CPU or GPU.
It’s easy to switch hardware. By avoiding separate dask-cudf code paths it’s easier to add cuDF to an existing Dask+Pandas codebase to run on GPUs, or to remove cuDF and use Pandas if we want our code to be runnable without GPUs.

There are more examples of this in the scaling section below.

What’s wrong with the example above

In general the answer is many small things.

The cudf.read_csv function doesn’t yet support reading chunks from a single CSV file, and so doesn’t work well with very large CSV files. We had to split our large CSV files into many smaller CSV files first with normal Dask+Pandas:
```
import dask.dataframe as dd
(df = dd.read_csv('few-large/*.csv')
        .repartition(npartitions=100)
        .to_csv('many-small/*.csv', index=False))
```
(See rapidsai/cudf #568)
Many operations that used to work in dask-cudf like groupby-aggregations and joins no longer work. We’re going to need to slightly modify many cuDF APIs over the next couple of months to more closely match their Pandas equivalents.
I ran the timing cell twice because it currently takes a few seconds to import cudf today. rapidsai/cudf #627
We had to make Dask Dataframe a bit more flexible and assume less about its constituent dataframes being exactly Pandas dataframes. (see dask/dask #4359 and dask/dask #4375 for examples). I suspect that there will by many more small changes like these necessary in the future.

These problems are representative of dozens more similar issues. They are all fixable and indeed, many are actively being fixed today by the good folks working on RAPIDS.

Near Term Schedule

The RAPIDS group is currently busy working to release 0.5, which includes some of the fixes necessary to run the example above, and also many unrelated stability improvements. This will probably keep them busy for a week or two during which I don’t expect to see much Dask + cuDF work going on other than planning.

After that, Dask parallelism support will be a top priority, so I look forward to seeing some rapid progress here.

Scaling Results

In my last post about combining Dask Array with CuPy, a GPU-accelerated Numpy, we saw impressive speedups from using many GPUs on a simple problem that manipulated some simple random data.

Dask Array + CuPy on Random Data

Architecture	Time
Single CPU Core	2hr 39min
Forty CPU Cores	11min 30s
One GPU	1 min 37s
Eight GPUs	19s

That exercise was easy to scale because it was almost entirely bound by the computation of creating random data.

Dask DataFrame + cuDF on CSV data

We did a similar study on the read_csv example above, which is bound mostly by reading CSV data from disk and then parsing it. You can see a notebook available here. We have similar (though less impressive) numbers to present.

Architecture	Time	Bandwidth
Single CPU Core	3min 14s	50 MB/s
Eight CPU Cores	58s	170 MB/s
Forty CPU Cores	35s	285 MB/s
One GPU	11s	900 MB/s
Eight GPUs	5s	2000 MB/s

The bandwidth numbers were computed by noting that the data was around 10 GB on disk

Analysis

First, I want to emphasize again that it’s easy to test a wide variety of architectures using this setup because of the Pandas API compatibility between all of the different projects. We’re seeing a wide range of performance (40x span) across a variety of different hardware with a wide range of cost points.

Second, note that this problem scales less well than our previous example with CuPy, both on CPU and GPU. I suspect that this is because this example is also bound by I/O and not just number-crunching. While the jump from single-CPU to single-GPU is large, the jump from single-CPU to many-CPU or single-GPU to many-GPU is not as large as we would have liked. For GPUs for example we got around a 2x speedup when we added 8x as many GPUs.

At first one might think that this is because we’re saturating disk read speeds. However two pieces of evidence go against that guess:

NVIDIA folks familiar with my current hardware inform me that they’re able to get much higher I/O throughput when they’re careful
The CPU scaling is similarly poor, despite the fact that it’s obviously not reaching full I/O bandwidth

Instead, it’s likely that we’re just not treating our disks and IO pipelines carefully.

We might consider working to think more carefully about data locality within a single machine. Alternatively, we might just choose to use a smaller machine, or many smaller machines. My team has been asking me to start playing with some cheaper systems than a DGX, I may experiment with those soon. It may be that for data-loading and pre-processing workloads the previous wisdom of “pack as much computation as you can into a single box” no longer holds (without us doing more work that is).

Come help

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

Senior Library Software Engineer - RAPIDS

GPU Dask Arrays, first steps

2019-01-03T00:00:00+00:00

The following code creates and manipulates 2 TB of randomly generated data.

import dask.array as da

rs = da.random.RandomState()
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))
(x + 1)[::2, ::2].sum().compute(scheduler='threads')

On a single CPU, this computation takes two hours.

On an eight-GPU single-node system this computation takes nineteen seconds.

Actually this computation isn’t that impressive. It’s a simple workload, for which most of the time is spent creating and destroying random data. The computation and communication patterns are simple, reflecting the simplicity commonly found in data processing workloads.

What is impressive is that we were able to create a distributed parallel GPU array quickly by composing these four existing libraries:

CuPy provides a partial implementation of Numpy on the GPU.
Dask Array provides chunked algorithms on top of Numpy-like libraries like Numpy and CuPy.

This enables us to operate on more data than we could fit in memory by operating on that data in chunks.
The Dask distributed task scheduler runs those algorithms in parallel, easily coordinating work across many CPU cores.
The Dask CUDA to extend Dask distributed with GPU support.

These tools already exist. We had to connect them together with a small amount of glue code and minor modifications. By mashing these tools together we can quickly build and switch between different architectures to explore what is best for our application.

For this example we relied on the following changes upstream:

Comparison among single/multi CPU/GPU

We can now easily run some experiments on different architectures. This is easy because …

We can switch between CPU and GPU by switching between Numpy and CuPy.
We can switch between single/multi-CPU-core and single/multi-GPU by switching between Dask’s different task schedulers.

These libraries allow us to quickly judge the costs of this computation for the following hardware choices:

Single-threaded CPU
Multi-threaded CPU with 40 cores (80 H/T)
Single-GPU
Multi-GPU on a single machine with 8 GPUs

We present code for these four choices below, but first, we present a table of results.

Results

Architecture	Time
Single CPU Core	2hr 39min
Forty CPU Cores	11min 30s
One GPU	1 min 37s
Eight GPUs	19s

Setup

import cupy
import dask.array as da

# generate chunked dask arrays of mamy numpy random arrays
rs = da.random.RandomState()
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))

print(x.nbytes / 1e9)  # 2 TB
# 2000.0

CPU timing

(x + 1)[::2, ::2].sum().compute(scheduler='single-threaded')
(x + 1)[::2, ::2].sum().compute(scheduler='threads')

Single GPU timing

We switch from CPU to GPU by changing our data source to generate CuPy arrays rather than NumPy arrays. Everything else should more or less work the same without special handling for CuPy.

(This actually isn’t true yet, many things in dask.array will break for non-NumPy arrays, but we’re working on it actively both within Dask, within NumPy, and within the GPU array libraries. Regardless, everything in this example works fine.)

# generate chunked dask arrays of mamy cupy random arrays
rs = da.random.RandomState(RandomState=cupy.random.RandomState)  # <-- we specify cupy here
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))

(x + 1)[::2, ::2].sum().compute(scheduler='single-threaded')

Multi GPU timing

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

(x + 1)[::2, ::2].sum().compute()

And again, here are the results:

Architecture	Time
Single CPU Core	2hr 39min
Forty CPU Cores	11min 30s
One GPU	1 min 37s
Eight GPUs	19s

First, this is my first time playing with an 40-core system. I was surprised to see that many cores. I was also pleased to see that Dask’s normal threaded scheduler happily saturates many cores.

Although later on it did dive down to around 5000-6000%, and if you do the math you’ll see that we’re not getting a 40x speedup. My guess is that performance would improve if we were to play with some mixture of threads and processes, like having ten processes with eight threads each.

The jump from the biggest multi-core CPU to a single GPU is still an order of magnitude though. The jump to multi-GPU is another order of magnitude, and brings the computation down to 19s, which is short enough that I’m willing to wait for it to finish before walking away from my computer.

Actually, it’s quite fun to watch on the dashboard (especially after you’ve been waiting for three hours for the sequential solution to run):

Conclusion

This computation was simple, but the range in architecture just explored was extensive. We swapped out the underlying architecture from CPU to GPU (which had an entirely different codebase) and tried both multi-core CPU parallelism as well as multi-GPU many-core parallelism.

We did this in less than twenty lines of code, making this experiment something that an undergraduate student or other novice could perform at home. We’re approaching a point where experimenting with multi-GPU systems is approachable to non-experts (at least for array computing).

Here is a notebook for the experiment above

Room for improvement

We can work to expand the computation above in a variety of directions. There is a ton of work we still have to do to make this reliable.

Use more complex array computing workloads

The Dask Array algorithms were designed first around Numpy. We’ve only recently started making them more generic to other kinds of arrays (like GPU arrays, sparse arrays, and so on). As a result there are still many bugs when exploring these non-Numpy workloads.

For example if you were to switch sum for mean in the computation above you would get an error because our mean computation contains an easy to fix error that assumes Numpy arrays exactly.
Use Pandas and cuDF instead of Numpy and CuPy

The cuDF library aims to reimplement the Pandas API on the GPU, much like how CuPy reimplements the NumPy API. Using Dask DataFrame with cuDF will require some work on both sides, but is quite doable.

I believe that there is plenty of low-hanging fruit here.
Improve and move LocalCUDACluster

The LocalCUDAClutster class used above is an experimental Cluster type that creates as many workers locally as you have GPUs, and assigns each worker to prefer a different GPU. This makes it easy for people to load balance across GPUs on a single-node system without thinking too much about it. This appears to be a common pain-point in the ecosystem today.

However, the LocalCUDACluster probably shouldn’t live in the dask/distributed repository (it seems too CUDA specific) so will probably move to some dask-cuda repository. Additionally there are still many questions about how to handle concurrency on top of GPUs, balancing between CPU cores and GPU cores, and so on.
Multi-node computation

There’s no reason that we couldn’t accelerate computations like these further by using multiple multi-GPU nodes. This is doable today with manual setup, but we should also improve the existing deployment solutions dask-kubernetes, dask-yarn, and dask-jobqueue, to make this easier for non-experts who want to use a cluster of multi-GPU resources.
Expense

The machine I ran this on is expensive. Well, it’s nowhere close to as expensive to own and operate as a traditional cluster that you would need for these kinds of results, but it’s still well beyond the price point of a hobbyist or student.

It would be useful to run this on a more budget system to get a sense of the tradeoffs on more reasonably priced systems. I should probably also learn more about provisioning GPUs on the cloud.

Come help!

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

If you’re interested in being paid to focus more on these topics, then consider applying for a job. The NVIDIA corporation is hiring around the use of Dask with GPUs.

Senior Library Software Engineer - RAPIDS

That’s a fairly generic posting. If you’re interested the posting doesn’t seem to fit then please apply anyway and we’ll tweak things.