Dask Working Notes - Posts tagged python

Load Large Image Data with Dask Array

2019-06-20T00:00:00+00:00

This post explores simple workflows to load large stacks of image data with Dask array.

In particular, we start with a directory full of TIFF files of images like the following:

$ $ ls raw/ | head
ex6-2_CamA_ch1_CAM1_stack0000_560nm_0000000msec_0001291795msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0001_560nm_0043748msec_0001335543msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0002_560nm_0087497msec_0001379292msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0003_560nm_0131245msec_0001423040msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0004_560nm_0174993msec_0001466788msecAbs_000x_000y_000z_0000t.tif

and show how to stitch these together into large lazy arrays using the dask-image library

>>> import dask_image
>>> x = dask_image.imread.imread('raw/*.tif')

or by writing your own Dask delayed image reader function.

	Array	Chunk
Bytes	3.16 GB	316.15 MB
Shape	(2010, 1024, 768)	(201, 1024, 768)
Count	30 Tasks	10 Chunks
Type	uint16	numpy.ndarray

768 1024 2010

Some day we’ll eventually be able to perform complex calculations on this dask array.

Disclaimer: we’re not going to produce rendered images like the above in this post. These were created with NVidia IndeX, a completely separate tool chain from what is being discussed here. This post covers the first step of image loading.

Series Overview

A common case in fields that acquire large amounts of imaging data is to write out smaller acquisitions into many small files. These files can tile a larger space, sub-sample from a larger time period, and may contain multiple channels. The acquisition techniques themselves are often state of the art and constantly pushing the envelope in term of how large a field of view can be acquired, at what resolution, and what quality.

Once acquired this data presents a number of challenges. Algorithms often designed and tested to work on very small pieces of this data need to be scaled up to work on the full dataset. It might not be clear at the outset what will actually work and so exploration still plays a very big part of the whole process.

Historically this analytical process has involved a lot of custom code. Often the analytical process is stitched together by a series of scripts possibly in several different languages that write various intermediate results to disk. Thanks to advances in modern tooling these process can be significantly improved. In this series of blogposts, we will outline ways for image scientists to leverage different tools to move towards a high level, friendly, cohesive, interactive analytical pipeline.

Post Overview

This post in particular focuses on loading and managing large stacks of image data in parallel from Python.

Loading large image data can be a complex and often unique problem. Different groups may choose to store this across many files on disk, a commodity or custom database solution, or they may opt to store it in the cloud. Not all datasets within the same group may be treated the same for a variety of reasons. In short, this means loading data is a hard and expensive problem.

Despite data being stored in many different ways, often groups want to reapply the same analytical pipeline to these datasets. However if the data pipeline is tightly coupled to a particular way of loading the data for later analytical steps, it may be very difficult if not impossible to reuse an existing pipeline. In other words, there is friction between the loading and analysis steps, which frustrates efforts to make things reusable.

Having a modular and general way to load data makes it easy to present data stored differently in a standard way. Further having a standard way to present data to analytical pipelines allows that part of the pipeline to focus on what it does best, analysis! In general, this should decouple these to components in a way that improves the experience of users involved in all parts of the pipeline.

We will use image data generously provided by Gokul Upadhyayula at the Advanced Bioimaging Center at UC Berkeley and discussed in this paper (preprint), though the workloads presented here should work for any kind of imaging data, or array data generally.

Load image data with Dask

Let’s start again with our image data from the top of the post:

$ $ ls /path/to/files/raw/ | head
ex6-2_CamA_ch1_CAM1_stack0000_560nm_0000000msec_0001291795msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0001_560nm_0043748msec_0001335543msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0002_560nm_0087497msec_0001379292msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0003_560nm_0131245msec_0001423040msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0004_560nm_0174993msec_0001466788msecAbs_000x_000y_000z_0000t.tif

Load a single sample image with Scikit-Image

To load a single image, we use Scikit-Image:

>>> import glob
>>> filenames = glob.glob("/path/to/files/raw/*.tif")
>>> len(filenames)
597

>>> import imageio
>>> sample = imageio.imread(filenames[0])
>>> sample.shape
(201, 1024, 768)

Each filename corresponds to some 3d chunk of a larger image. We can look at a few 2d slices of this single 3d chunk to get some context.

import matplotlib.pyplot as plt
import skimage.io
plt.figure(figsize=(10, 10))
skimage.io.imshow(sample[:, :, 0])

plt.figure(figsize=(10, 10))
skimage.io.imshow(sample[:, 0, :])

plt.figure(figsize=(10, 10))
skimage.io.imshow(sample[0, :, :])

Investigate Filename Structure

These are slices from only one chunk of a much larger aggregate image. Our interest here is combining the pieces into a large image stack. It is common to see a naming structure in the filenames. Each filename then may indicate a channel, time step, and spatial location with the <i> being some numeric values (possibly with units). Individual filenames may have more or less information and may notate it differently than we have.

mydata_ch<i>_<j>t_<k>x_<l>y_<m>z.tif

In principle with NumPy we might allocate a giant array and then iteratively load images and place them into the giant array.

full_array = np.empty((..., ..., ..., ..., ...), dtype=sample.dtype)

for fn in filenames:
    img = imageio.imread(fn)
    index = get_location_from_filename(fn)  # We need to write this function
    full_array[index, :, :, :] = img

However if our data is large then we can’t load it all into memory at once like this into a single Numpy array, and instead we need to be a bit more clever to handle it efficiently. One approach here is to use Dask, which handles larger-than-memory workloads easily.

Lazily load images with Dask Array

Now we learn how to lazily load and stitch together image data with Dask array. We’ll start with simple examples first and then move onto the full example with this more complex dataset afterwards.

We can delay the imageio.imread calls with Dask Delayed.

import dask
import dask.array as da

lazy_arrays = [dask.delayed(imageio.imread)(fn) for fn in filenames]
lazy_arrays = [da.from_delayed(x, shape=sample.shape, dtype=sample.dtype)
               for x in lazy_arrays]

Note: here we’re assuming that all of the images have the same shape and dtype as the sample file that we loaded above. This is not always the case. See the dask_image note below in the Future Work section for an alternative.

We haven’t yet stitched these together. We have hundreds of single-chunk Dask arrays, each of which lazily loads a single 3d chunk of data from disk. Lets look at a single array.

>>> lazy_arrays[0]

	Array	Chunk
Bytes	316.15 MB	316.15 MB
Shape	(201, 1024, 768)	(201, 1024, 768)
Count	2 Tasks	1 Chunks
Type	uint16	numpy.ndarray

768 1024 201

This is a lazy 3-dimensional Dask array of a single 300MB chunk of data. That chunk is created by loading in a particular TIFF file. Normally Dask arrays are composed of many chunks. We can concatenate many of these single-chunked Dask arrays into a multi-chunked Dask array with functions like da.concatenate and da.stack.

Here we concatenate the first ten Dask arrays along a few axes, to get an easier-to-understand picture of how this looks. Take a look both at how the shape changes as we change the axis= parameter both in the table on the left and the image on the right.

da.concatenate(lazy_arrays[:10], axis=0)

	Array	Chunk
Bytes	3.16 GB	316.15 MB
Shape	(2010, 1024, 768)	(201, 1024, 768)
Count	30 Tasks	10 Chunks
Type	uint16	numpy.ndarray

768 1024 2010

da.concatenate(lazy_arrays[:10], axis=1)

	Array	Chunk
Bytes	3.16 GB	316.15 MB
Shape	(201, 10240, 768)	(201, 1024, 768)
Count	30 Tasks	10 Chunks
Type	uint16	numpy.ndarray

768 10240 201

da.concatenate(lazy_arrays[:10], axis=2)

	Array	Chunk
Bytes	3.16 GB	316.15 MB
Shape	(201, 1024, 7680)	(201, 1024, 768)
Count	30 Tasks	10 Chunks
Type	uint16	numpy.ndarray

7680 1024 201

Or, if we wanted to make a new dimension, we would use da.stack. In this case note that we’ve run out of easily visible dimensions, so you should take note of the listed shape in the table input on the left more than the picture on the right. Notice that we’ve stacked these 3d images into a 4d image.

da.stack(lazy_arrays[:10])

	Array	Chunk
Bytes	3.16 GB	316.15 MB
Shape	(10, 201, 1024, 768)	(1, 201, 1024, 768)
Count	30 Tasks	10 Chunks
Type	uint16	numpy.ndarray

10 1

768 1024 201

These are the common case situations, where you have a single axis along which you want to stitch images together.

Full example

This works fine for combining along a single axis. However if we need to combine across multiple we need to perform multiple concatenate steps. Fortunately there is a simpler option da.block, which can concatenate along multiple axes at once if you give it a nested list of dask arrays.

a = da.block([[laxy_array_00, lazy_array_01],
              [lazy_array_10, lazy_array_11]])

We now do the following:

Parse each filename to learn where it should live in the larger array
See how many files are in each of our relevant dimensions
Allocate a NumPy object-dtype array of the appropriate size, where each element of this array will hold a single-chunk Dask array
Go through our filenames and insert the proper Dask array into the right position
Call da.block on the result

This code is a bit complex, but shows what this looks like in a real-world setting

# Get various dimensions

fn_comp_sets = dict()
for fn in filenames:
    for i, comp in enumerate(os.path.splitext(fn)[0].split("_")):
        fn_comp_sets.setdefault(i, set())
        fn_comp_sets[i].add(comp)
fn_comp_sets = list(map(sorted, fn_comp_sets.values()))

remap_comps = [
    dict(map(reversed, enumerate(fn_comp_sets[2]))),
    dict(map(reversed, enumerate(fn_comp_sets[4])))
]

# Create an empty object array to organize each chunk that loads a TIFF
a = np.empty(tuple(map(len, remap_comps)) + (1, 1, 1), dtype=object)

for fn, x in zip(filenames, lazy_arrays):
    channel = int(fn[fn.index("_ch") + 3:].split("_")[0])
    stack = int(fn[fn.index("_stack") + 6:].split("_")[0])

    a[channel, stack, 0, 0, 0] = x

# Stitch together the many blocks into a single array
a = da.block(a.tolist())

	Array	Chunk
Bytes	188.74 GB	316.15 MB
Shape	(3, 199, 201, 1024, 768)	(1, 1, 201, 1024, 768)
Count	2985 Tasks	597 Chunks
Type	uint16	numpy.ndarray

199 3

768 1024 201

That’s a 180 GB logical array, composed of around 600 chunks, each of size 300 MB. We can now do normal NumPy like computations on this array using Dask Array, but we’ll save that for a future post.

>>> # array computations would work fine, and would run in low memory
>>> # but we'll save actual computation for future posts
>>> a.sum().compute()

Save Data

To simplify data loading in the future, we store this in a large chunked array format like Zarr using the to_zarr method.

a.to_zarr("mydata.zarr")

We may add additional information about the image data as attributes. This both makes things simpler for future users (they can read the full dataset with a single line using da.from_zarr) and much more performant because Zarr is an analysis ready format that is efficiently encoded for computation.

Zarr uses the Blosc library for compression by default. For scientific imaging data, we can optionally pass compression options that provide a good compression ratio to speed tradeoff and optimize compression performance.

from numcodecs import Blosc
a.to_zarr("mydata.zarr", compressor=Blosc(cname='zstd', clevel=3, shuffle=Blosc.BITSHUFFLE))

Future Work

The workload above is generic and straightforward. It works well in simple cases and also extends well to more complex cases, providing you’re willing to write some for-loops and parsing code around your custom logic. It works on a single small-scale laptop as well as a large HPC or Cloud cluster. If you have a function that turns a filename into a NumPy array, you can generate large lazy Dask array using that function, Dask Delayed and Dask Array.

Dask Image

However, we can make things a bit easier for users if we specialize a bit. For example the Dask Image library has a parallel image reader function, which automates much of our work above in the simple case.

>>> import dask_image
>>> x = dask_image.imread.imread('raw/*.tif')

Similarly libraries like Xarray have readers for other file formats, like GeoTIFF.

As domains do more and more work like what we did above they tend to write down common patterns into domain-specific libraries, which then increases the accessibility and user base of these tools.

GPUs

If we have special hardware lying around like a few GPUs, we can move the data over to it and perform computations with a library like CuPy, which mimics NumPy very closely. Thus benefiting from the same operations listed above, but with the added performance of GPUs behind them.

import cupy as cp
a_gpu = a.map_blocks(cp.asarray)

Computation

Finally, in future blogposts we plan to talk about how to compute on our large Dask arrays using common image-processing workloads like overlapping stencil functions, segmentation and deconvolution, and integrating with other libraries like ITK.

Python and GPUs: A Status Update

2019-06-19T00:00:00+00:00

This blogpost was delivered in talk form at the recent PASC 2019 conference. Slides for that talk are here.

We’re improving the state of scalable GPU computing in Python.

This post lays out the current status, and describes future work. It also summarizes and links to several other more blogposts from recent months that drill down into different topics for the interested reader.

Broadly we cover briefly the following categories:

Python libraries written in CUDA like CuPy and RAPIDS
Python-CUDA compilers, specifically Numba
Scaling these libraries out with Dask
Network communication with UCX
Packaging with Conda

Performance of GPU accelerated Python Libraries

Probably the easiest way for a Python programmer to get access to GPU performance is to use a GPU-accelerated Python library. These provide a set of common operations that are well tuned and integrate well together.

Many users know libraries for deep learning like PyTorch and TensorFlow, but there are several other for more general purpose computing. These tend to copy the APIs of popular Python projects:

Numpy on the GPU: CuPy
Numpy on the GPU (again): Jax
Pandas on the GPU: RAPIDS cuDF
Scikit-Learn on the GPU: RAPIDS cuML

These libraries build GPU accelerated variants of popular Python libraries like NumPy, Pandas, and Scikit-Learn. In order to better understand the relative performance differences Peter Entschev recently put together a benchmark suite to help with comparisons. He has produced the following image showing the relative speedup between GPU and CPU:

There are lots of interesting results there. Peter goes into more depth in this in his blogpost.

More broadly though, we see that there is variability in performance. Our mental model for what is fast and slow on the CPU doesn’t neccessarily carry over to the GPU. Fortunately though, due consistent APIs, users that are familiar with Python can easily experiment with GPU acceleration without learning CUDA.

Numba: Compiling Python to CUDA

See also this recent blogpost about Numba stencils and the attached GPU notebook

The built-in operations in GPU libraries like CuPy and RAPIDS cover most common operations. However, in real-world settings we often find messy situations that require writing a little bit of custom code. Switching down to C/C++/CUDA in these cases can be challenging, especially for users that are primarily Python developers. This is where Numba can come in.

Python has this same problem on the CPU as well. Users often couldn’t be bothered to learn C/C++ to write fast custom code. To address this there are tools like Cython or Numba, which let Python programmers write fast numeric code without learning much beyond the Python language.

For example, Numba accelerates the for-loop style code below about 500x on the CPU, from slow Python speeds up to fast C/Fortran speeds.

import numba  # We added these two lines for a 500x speedup

@numba.jit    # We added these two lines for a 500x speedup
def sum(x):
    total = 0
    for i in range(x.shape[0]):
        total += x[i]
    return total

The ability to drop down to low-level performant code without context switching out of Python is useful, particularly if you don’t already know C/C++ or have a compiler chain set up for you (which is the case for most Python users today).

This benefit is even more pronounced on the GPU. While many Python programmers know a little bit of C, very few of them know CUDA. Even if they did, they would probably have difficulty in setting up the compiler tools and development environment.

Enter numba.cuda.jit Numba’s backend for CUDA. Numba.cuda.jit allows Python users to author, compile, and run CUDA code, written in Python, interactively without leaving a Python session. Here is an image of writing a stencil computation that smoothes a 2d-image all from within a Jupyter Notebook:

Here is a simplified comparison of Numba CPU/GPU code to compare programming style.. The GPU code gets a 200x speed improvement over a single CPU core.

CPU – 600 ms

@numba.jit
def _smooth(x):
    out = np.empty_like(x)
    for i in range(1, x.shape[0] - 1):
        for j in range(1, x.shape[1] - 1):
            out[i, j] = x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                        x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                        x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) // 9

    return out

or if we use the fancy numba.stencil decorator …

@numba.stencil
def _smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9

GPU – 3 ms

@numba.cuda.jit
def smooth_gpu(x, out):
    i, j = cuda.grid(2)
    n, m = x.shape
    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                     x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                     x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9

Numba.cuda.jit has been out in the wild for years. It’s accessible, mature, and fun to play with. If you have a machine with a GPU in it and some curiosity then we strongly recommend that you try it out.

conda install numba
# or
pip install numba

>>> import numba.cuda

Scaling with Dask

As mentioned in previous blogposts ( 1, 2, 3, 4 ) we’ve been generalizing Dask, to operate not just with Numpy arrays and Pandas dataframes, but with anything that looks enough like Numpy (like CuPy or Sparse or Jax) or enough like Pandas (like RAPIDS cuDF) to scale those libraries out too. This is working out well. Here is a brief video showing Dask array computing an SVD in parallel, and seeing what happens when we swap out the Numpy library for CuPy.

We see that there is about a 10x speed improvement on the computation. Most importantly, we were able to switch between a CPU implementation and a GPU implementation with a small one-line change, but continue using the sophisticated algorithms with Dask Array, like it’s parallel SVD implementation.

We also saw a relative slowdown in communication. In general almost all non-trivial Dask + GPU work today is becoming communication-bound. We’ve gotten fast enough at computation that the relative importance of communication has grown significantly. We’re working to resolve this with our next topic, UCX.

Communication with UCX

See this talk by Akshay Venkatesh or view the slides

Also see this recent blogpost about UCX and Dask

We’ve been integrating the OpenUCX library into Python with UCX-Py. UCX provides uniform access to transports like TCP, InfiniBand, shared memory, and NVLink. UCX-Py is the first time that access to many of these transports has been easily accessible from the Python language.

Using UCX and Dask together we’re able to get significant speedups. Here is a trace of the SVD computation from before both before and after adding UCX:

Before UCX:

After UCX:

There is still a great deal to do here though (the blogpost linked above has several items in the Future Work section).

People can try out UCX and UCX-Py with highly experimental conda packages:

conda create -n ucx -c conda-forge -c jakirkham/label/ucx cudatoolkit=9.2 ucx-proc=*=gpu ucx ucx-py python=3.7

We hope that this work will also affect non-GPU users on HPC systems with Infiniband, or even users on consumer hardware due to the easy access to shared memory communication.

Packaging

In an earlier blogpost we discussed the challenges around installing the wrong versions of CUDA enabled packages that don’t match the CUDA driver installed on the system. Fortunately due to recent work from Stan Seibert and Michael Sarahan at Anaconda, Conda 4.7 now has a special cuda meta-package that is set to the version of the installed driver. This should make it much easier for users in the future to install the correct package.

Conda 4.7 was just releasead, and comes with many new features other than the cuda meta-package. You can read more about it here.

conda update conda

There is still plenty of work to do in the packaging space today. Everyone who builds conda packages does it their own way, resulting in headache and heterogeneity. This is largely due to not having centralized infrastructure to build and test CUDA enabled packages, like we have in Conda Forge. Fortunately, the Conda Forge community is working together with Anaconda and NVIDIA to help resolve this, though that will likely take some time.

Summary

This post gave an update of the status of some of the efforts behind GPU computing in Python. It also provided a variety of links for future reading. We include them below if you would like to learn more:

Slides
Numpy on the GPU: CuPy
Numpy on the GPU (again): Jax
Pandas on the GPU: RAPIDS cuDF
Scikit-Learn on the GPU: RAPIDS cuML
Benchmark suite
Numba CUDA JIT notebook
A talk on UCX
A blogpost on UCX and Dask
Conda 4.7

Experiments in High Performance Networking with UCX and DGX

2019-06-09T00:00:00+00:00

This post is about experimental and rapidly changing software. Code examples in this post should not be relied upon to work in the future.

This post talks about connecting UCX, a high performance networking library, to Dask, a parallel Python library, to accelerate communication-heavy workloads, particularly when using GPUs.

Additionally, we do this work on a DGX, a high-end multi-CPU multi-GPU machine with a complex internal network. Working in this context was good to force improvements in setting up Dask in heterogeneous situations targeting different network cards, CPU sockets, GPUs, and so on..

Motivation

Many distributed computing workloads are communication-bound. This is common in cases like the following:

Dataframe joins
Machine learning algorithms
Complex array computations

Communication becomes a bigger bottleneck as we accelerate our computation, such as when we use GPUs for computing.

Historically, high performance communication was only available using MPI, or with custom solutions. This post describes an effort to get close to the communication bandwidth of MPI while still maintaining the ease of programmability and accessibility of a dynamic system like Dask.

UCX, Python, and Dask

To get high performance networking in Dask, we wrapped UCX with Python and then connected that to Dask.

The OpenUCX project provides a uniform API around various high performance networking libraries like InfiniBand, traditional networking protocols like TCP/shared memory, and GPU-specific protocols like NVLink. It is a layer beneath something like OpenMPI (the main user of OpenUCX today) that figures out which networking system to use.

Python users today don’t have much access to these network libraries, except through MPI, which is sometimes not ideal. (Try searching for “infiniband” on PyPI.)

This led us to create UCX-Py . UCX-Py is a Python wrapper around the UCX C library, which provides a Pythonic API, both with blocking syntax appropriate for traditional HPC programs, as well as a non-blocking async/await syntax for more concurrent programs (like Dask). For more information on UCX I recommend watching Akshay’s UCX talk from the GPU Technology Conference 2019.

Note: UCX-Py was primarily developed by Akshay Venkatesh (UCX, NVIDIA) Tom Augspurger (Dask, Pandas, Anaconda), and Ben Zaitlen (NVIDIA, RAPIDS, Dask))

We then extended Dask communications to optionally use UCX. If you have UCX and UCX-Py installed, then you can use the ucx:// protocol in addresses or the --protocol ucx flag when starting things up, something like this.

$ dask-scheduler --protocol ucx
Scheduler started at ucx://127.0.0.1:8786

$ dask-worker ucx://127.0.0.1:8786

>>> from dask.distributed import Client
>>> client = Client('ucx://127.0.0.1:8786')

Experiment

We modified our SVD with Dask and CuPy benchmark benchmark to use the UCX protocol for inter-process communication and ran it on half of a DGX machine, using four GPUs. Here is a minimal implementation of the UCX-enabled code:

import cupy
import dask
import dask.array
from dask.distributed import Client, wait
from dask_cuda import DGX

# Define DGX cluster and client
cluster = DGX(CUDA_VISIBLE_DEVICES=[0, 1, 2, 3])
client = Client(cluster)

# Create random data
rs = dask.array.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.random((1000000, 1000), chunks=(10000, 1000))
x = x.persist()

# Perform distributed SVD
u, s, v = dask.array.linalg.svd(x)
u, s, v = dask.persist(u, s, v)
_ = wait([u, s, v])

By using UCX the overall communication times are reduced by an order of magnitude. To produce the task-stream figures below, the benchmark was run on a DGX-1 with CUDA_VISIBLE_DEVICES=[0,1,2,3]. It is clear that the red task bars, corresponding to inter-process communication, are significantly compressed. Communications that were taking 500ms-1s before now take around 20ms.

Before UCX:

After UCX:

Diving into the Details

On a GPU using NVLink we can get somewhere between 5-10 GB/s throughput between pairs of GPUs. On a CPU this drops down to 1-2 GB/s (which seems well below optimal). These speeds can affect all Dask workloads (array, dataframe, xarray, ML, …), but when the proper hardware is present, other bottlenecks may occur, such as serialization when dealing with text or JSON-like data.

This of course, depends on this fancy networking hardware being present. On the GPU example above we’re mostly relying on NVLink, but we would also get improved performance on an HPC InfiniBand network or even on a single laptop machine using shared memory transports.

The examples above was run on a DGX machine, which includes all of these transports and more (as well as numerous GPUs).

DGX

The test machine used above was a DGX-1, which has eight GPUs, two CPU sockets, four Infiniband network cards, and a complex NVLink arrangement. This is a good example of non-uniform hardware. Certain CPUs are closer to certain GPUs and network cards, and understanding this proximity has an order-of-magnitude effect on performance. This situation isn’t unique to DGX machines. The same situation arises when we have …

Multiple workers in one node, with several nodes in a cluster
Multiple nodes in one rack, with several racks in a data center
Multiple data centers, such as is the case with hybrid cloud

Working with the DGX was interesting because it forced us to start thinking about heterogeneity, and making it easier to specify complex deployment scenarios with Dask.

Here is a diagram showing how the GPUs, CPUs, and Infiniband cards are connected to each other in a DGX-1:

And here the output of nvidia-smi showing the NVLink, networking, and CPU affinity structure (this is mostly orthogonal to the structure displayed above).

$ nvidia-smi  topo -m
     GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7   ib0   ib1   ib2   ib3
GPU0   X    NV1   NV1   NV2   NV2   SYS   SYS   SYS   PIX   SYS   PHB   SYS
GPU1  NV1    X    NV2   NV1   SYS   NV2   SYS   SYS   PIX   SYS   PHB   SYS
GPU2  NV1   NV2    X    NV2   SYS   SYS   NV1   SYS   PHB   SYS   PIX   SYS
GPU3  NV2   NV1   NV2    X    SYS   SYS   SYS   NV1   PHB   SYS   PIX   SYS
GPU4  NV2   SYS   SYS   SYS    X    NV1   NV1   NV2   SYS   PIX   SYS   PHB
GPU5  SYS   NV2   SYS   SYS   NV1    X    NV2   NV1   SYS   PIX   SYS   PHB
GPU6  SYS   SYS   NV1   SYS   NV1   NV2    X    NV2   SYS   PHB   SYS   PIX
GPU7  SYS   SYS   SYS   NV1   NV2   NV1   NV2    X    SYS   PHB   SYS   PIX
ib0   PIX   PIX   PHB   PHB   SYS   SYS   SYS   SYS    X    SYS   PHB   SYS
ib1   SYS   SYS   SYS   SYS   PIX   PIX   PHB   PHB   SYS    X    SYS   PHB
ib2   PHB   PHB   PIX   PIX   SYS   SYS   SYS   SYS   PHB   SYS    X    SYS
ib3   SYS   SYS   SYS   SYS   PHB   PHB   PIX   PIX   SYS   PHB   SYS    X

    CPU Affinity
GPU0  0-19,40-59
GPU1  0-19,40-59
GPU2  0-19,40-59
GPU3  0-19,40-59
GPU4  20-39,60-79
GPU5  20-39,60-79
GPU6  20-39,60-79
GPU7  20-39,60-79

Legend:

  X    = Self
  SYS  = Traverse PCIe as well as the SMP interconnect between NUMA nodes
  NODE = Travrese PCIe as well as the interconnect between PCIe Host Bridges
  PHB  = Traverse PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Traverse multiple PCIe switches (without PCIe Host Bridge)
  PIX  = Traverse a single PCIe switch
  NV#  = Traverse a bonded set of # NVLinks

The DGX was originally designed for deep learning applications. The complex network infrastructure above can be well used by specialized NVIDIA networking libraries like NCCL, which knows how to route things correctly, but is something of a challenge for other more general purpose systems like Dask to adapt to.

Fortunately, in meeting this challenge we were able to clean up a number of related issues in Dask. In particular we can now:

Specify a more heterogeneous worker configuration when starting up a local cluster dask/distributed #2675
Learn bandwidth over time dask/distributed #2658
Add Worker plugins to help handle things like CPU affinity (though this is quite general) dask/distributed #2453

With these changes we’re now able to describe most of the DGX structure as configuration in the Python function below:

import os

from dask.distributed import Nanny, SpecCluster, Scheduler
from distributed.worker import TOTAL_MEMORY

from dask_cuda.local_cuda_cluster import cuda_visible_devices


class CPUAffinity:
    """ A Worker plugin to pin CPU affinity """
    def __init__(self, cores):
        self.cores = cores

    def setup(self, worker=None):
        os.sched_setaffinity(0, self.cores)


affinity = {  # See nvidia-smi topo -m
    0: list(range(0, 20)) + list(range(40, 60)),
    1: list(range(0, 20)) + list(range(40, 60)),
    2: list(range(0, 20)) + list(range(40, 60)),
    3: list(range(0, 20)) + list(range(40, 60)),
    4: list(range(20, 40)) + list(range(60, 79)),
    5: list(range(20, 40)) + list(range(60, 79)),
    6: list(range(20, 40)) + list(range(60, 79)),
    7: list(range(20, 40)) + list(range(60, 79)),
}

def DGX(
    interface="ib",
    dashboard_address=":8787",
    threads_per_worker=1,
    silence_logs=True,
    CUDA_VISIBLE_DEVICES=None,
    **kwargs
):
    """ A Local Cluster for a DGX 1 machine

    NVIDIA's DGX-1 machine has a complex architecture mapping CPUs,
    GPUs, and network hardware.  This function creates a local cluster
    that tries to respect this hardware as much as possible.

    It creates one Dask worker process per GPU, and assigns each worker
    process the correct CPU cores and Network interface cards to
    maximize performance.

    That being said, things aren't perfect.  Today a DGX has very high
    performance between certain sets of GPUs and not others.  A Dask DGX
    cluster that uses only certain tightly coupled parts of the computer
    will have significantly higher bandwidth than a deployment on the
    entire thing.

    Parameters
    ----------
    interface: str
        The interface prefix for the infiniband networking cards.  This is
        often "ib"` or "bond".  We will add the numeric suffix 0,1,2,3 as
        appropriate.  Defaults to "ib".
    dashboard_address: str
        The address for the scheduler dashboard.  Defaults to ":8787".
    CUDA_VISIBLE_DEVICES: str
        String like ``"0,1,2,3"`` or ``[0, 1, 2, 3]`` to restrict
        activity to different GPUs

    Examples
    --------
    >>> from dask_cuda import DGX
    >>> from dask.distributed import Client
    >>> cluster = DGX(interface='ib')
    >>> client = Client(cluster)
    """
    if CUDA_VISIBLE_DEVICES is None:
        CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "0,1,2,3,4,5,6,7")
    if isinstance(CUDA_VISIBLE_DEVICES, str):
        CUDA_VISIBLE_DEVICES = CUDA_VISIBLE_DEVICES.split(",")
    CUDA_VISIBLE_DEVICES = list(map(int, CUDA_VISIBLE_DEVICES))
    memory_limit = TOTAL_MEMORY / 8

    spec = {
        i: {
            "cls": Nanny,
            "options": {
                "env": {
                    "CUDA_VISIBLE_DEVICES": cuda_visible_devices(
                        ii, CUDA_VISIBLE_DEVICES
                    ),
                    "UCX_TLS": "rc,cuda_copy,cuda_ipc",
                },
                "interface": interface + str(i // 2),
                "protocol": "ucx",
                "ncores": threads_per_worker,
                "data": dict,
                "preload": ["dask_cuda.initialize_context"],
                "dashboard_address": ":0",
                "plugins": [CPUAffinity(affinity[i])],
                "silence_logs": silence_logs,
                "memory_limit": memory_limit,
            },
        }
        for ii, i in enumerate(CUDA_VISIBLE_DEVICES)
    }

    scheduler = {
        "cls": Scheduler,
        "options": {
            "interface": interface + str(CUDA_VISIBLE_DEVICES[0] // 2),
            "protocol": "ucx",
            "dashboard_address": dashboard_address,
        },
    }

    return SpecCluster(
        workers=spec,
        scheduler=scheduler,
        silence_logs=silence_logs,
        **kwargs
    )

However, we never got the NVLink structure down. The Dask scheduler currently still assumes uniform bandwidths between workers. We’ve started to make small steps towards changing this, but we’re not there yet (this will be useful as well for people that want to think about in-rack or cross-data-center deployments).

As usual, in solving a highly specific problem, we were able to solve a number of lingering general features, which then made our specific problem easy to write down.

Future Work

There has been significant effort over the last few months make everything above work. In particular we …

Modified UCX to support client-server workloads
Wrapped UCX with UCX-Py and design a Python async-await friendly interface
Wrapped UCX-Py with Dask
Hooked everything together to make generic workloads function well

The result is quite nice, especially for more communication heavy workloads. However there is still plenty to do. This section details what we’re thinking about now to continue this work.

Routing within complex networks: If you restrict yourself to four of the eight GPUs in a DGX, you can get 5-12 GB/s between pairs of GPUs. For some workloads this can be significant. It makes the system feel much more like a single unit than a bunch of isolated machines.

However we still can’t get great performance across the whole DGX because there are many GPU-pairs that are not connected by NVLink, and so we get 10x slower speeds. These dominate communication costs if you naively try to use the full DGX.

This might be solved either by:
1. Teaching Dask to avoid these communications
2. Teaching UCX to route communications like these through a chain of multiple NVLink connections
3. Avoiding complex networks altogether. Newer systems like the DGX-2 use NVSwitch, which provides uniform connectivity, with each GPU connected to every other GPU.
Edit: I’ve since learned that UCX should be able to handle this. We should still get PCIe speeds (around 4-7 GB/s) even when we don’t have NVLink once an upstream bug gets fixed. Hooray!
CPU: We can get 1-2 GB/s across InfiniBand, which isn’t bad, but also wasn’t the full 5-8 GB/s that we were hoping for. This deserves more serious profiling to determine what is going wrong. The current guess is that this has to do with memory allocations.
```
In [1]: %time _ = b'0' * 1000000000  # 1 GB
CPU times: user 248 ms, sys: 223 ms, total: 472 ms
Wall time: 470 ms   # <<----- Around 2 GB/s.  Slower than I expected
```
Probably we’re just doing something dumb here.
Package UCX: Currently I’m building the UCX and UCX-Py libraries from source (see appendix below for instructions). Ideally these would become conda packages. John Kirkham (Conda Forge, NVIDIA, Dask) is taking a look at this along with the UCX developers from Mellanox.

See ucx-py #65 for more information.
Learn Heterogeneous Bandwidths: In order to make good scheduling decisions Dask needs to estimate how long it will take to move data between machines. This question is now becoming much more complex, and depends on both the source and destination machines (the network topology) the data type (NumPy array, GPU array, Pandas Dataframe with text) and more. In complex situations our bandwidths can span a 100x range (100 MB/s to 10 GB/s).

Dask will have to develop more complex models for bandwidth, and learn these over time.

See dask/distributed #2743 for more information.
Support other GPU libraries: To send GPU data around we need to teach Dask how to serialize Python objects into GPU buffers. There is code in the dask/distributed repository to do this for Numba, CuPy, and RAPIDS cuDF objects, but we’ve really only tested CuPy seriously. We should expand this by some of the following steps:
1. Try a distributed Dask cuDF join computation
  
  See dask/distributed #2746 for initial work here.
2. Teach Dask to serialize array GPU libraries, like PyTorch and TensorFlow, or possibly anything that supports the __cuda_array_interface__ protocol.
Track down communication failures: We still occasionally get unexplained communication failures. We should stress test this system to discover rough corners.
TCP: Groups with high performing TCP networks can’t yet make use of UCX+Dask (though they can use either one individually).

Currently using UCX in a client-server mode as we’re doing with Dask requires access to RDMA libraries, which are often not found on systems without networking systems like InfiniBand. This means that groups with high performing TCP networks can’t make use of UCX+Dask.

This is in progress at openucx/ucx #3570
Commodity Hardware: Currently this code is only really useful on high performance Linux systems that have InfiniBand or NVLink. However, it would be nice to also use this on more commodity systems, including personal laptop computers using TCP and shared memory.

Currently Dask uses TCP for inter-process communication on a single machine. Using UCX on a personal computer would give us access to shared memory speeds, which tend to be an order of magnitude faster.

See openucx/ucx #3663 for more information.
Tune Performance: The 5-10 GB/s bandwidths that we see with NVLink today are sub-optimal. With UCX-Py alone we’re able to get something like 15 GB/s on large message tranfers. We should benchmark and tune our implementation to see what is taking up the extra time. Until things work more robustly though, this is a secondary priority.

Appendix: Setup

Performing these experiments depends currently on development branches in a few repositories. This section includes my current setup.

Create Conda Environment

conda create -n ucx python=3.7 libtool cmake automake autoconf cython bokeh pytest pkg-config ipython dask numba -y

Note: for some reason using conda-forge makes the autogen step below fail.

Set up UCX

# Clone UCX repository and get branch
git clone https://github.com/openucx/ucx
cd ucx
git remote add Akshay-Venkatesh git@github.com:Akshay-Venkatesh/ucx.git
git remote update Akshay-Venkatesh
git checkout ucx-cuda

# Build
git clean -xfd
export CUDA_HOME=/usr/local/cuda-9.2/
./autogen.sh
mkdir build
cd build
../configure --prefix=$CONDA_PREFIX --enable-debug --with-cuda=$CUDA_HOME --enable-mt --disable-cma CPPFLAGS="-I//usr/local/cuda-9.2/include"
make -j install

# Verify
ucx_info -d
which ucx_info  # verify that this is in the conda environment

# Verify that we see NVLink speeds
ucx_perftest -t tag_bw -m cuda -s 1048576 -n 1000 & ucx_perftest dgx15 -t tag_bw -m cuda -s 1048576 -n 1000

Set up UCX-Py

git clone git@github.com:rapidsai/ucx-py
cd ucx-py

export UCX_PATH=$CONDA_PREFIX
make install

Set up Dask

git clone git@github.com:dask/dask.git
cd dask
pip install -e .
cd ..

git clone git@github.com:dask/distributed.git
cd distributed
pip install -e .
cd ..

Optionally set up cupy

pip install cupy-cuda92==6

Optionally set up cudf

conda install -c rapidsai-nightly -c conda-forge -c numba cudf dask-cudf cudatoolkit=9.2

Optionally set up JupyterLab

conda install ipykernel jupyterlab nb_conda_kernels nodejs

For the Dask dashboard

pip install dask_labextension
jupyter labextension install dask-labextension

My Benchmark

I’ve been using the following benchmark to test communication. It allocates a chunked Dask array, and then adds it to its transpose, which forces a lot of communication, but not much computation.

from collections import defaultdict
import asyncio
import time
import numpy as np
from pprint import pprint
import cupy

import dask.array as da
from dask.distributed import Client, wait
from distributed.utils import format_time, format_bytes

async def f():

    # Set up workers on the local machine
    async with DGX(asynchronous=True, silence_logs=True) as cluster:
        async with Client(cluster, asynchronous=True) as client:

            # Create a simple random array
            rs = da.random.RandomState(RandomState=cupy.random.RandomState)
            x = rs.random((40000, 40000), chunks='128 MiB').persist()
            print(x.npartitions, 'chunks')
            await wait(x)

            # Add X to its transpose, forcing computation
            y = (x + x.T).sum()
            result = await client.compute(y)

            # Collect, aggregate, and print peer-to-peer bandwidths
            incoming_logs = await client.run(lambda dask_worker: dask_worker.incoming_transfer_log)
            bandwidths = defaultdict(list)
            for k, L in incoming_logs.items():
                for d in L:
                    if d['total'] > 1_000_000:
                        bandwidths[k, d['who']].append(d['bandwidth'])
            bandwidths = {
                (cluster.scheduler.workers[w1].name,
                    cluster.scheduler.workers[w2].name): [format_bytes(x) + '/s' for x in np.quantile(v, [0.25, 0.50, 0.75])]
                for (w1, w2), v in bandwidths.items()
            }
            pprint(bandwidths)


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(f())

Note: most of this example is just getting back diagnostics, which can be easily ignored. Also, you can drop the async/await code if you like. I think that there should probably be more examples in the world using Dask with async/await syntax, so I decided to leave it in.