Dask Working Notes - Posts by Matthew Rocklin

Dask Survey 2021, early anecdotes

2021-06-18T00:00:00+00:00

The annual Dask user survey is under way and currently accepting responses at dask.org/survey.

This post provides a preview into early results, focusing on anecdotal responses.

The Dask user survey helps developers focus and prioritize our larger efforts. It’s also a fascinating and rewarding dataset of anecdotal use cases of how people use Dask today. Thank you to everyone who has participated so far, you make a difference.

The survey is still open, and I encourage people to speak up about their experience. This blogpost is intended to encourage participation by giving you a sense for how it affects development, and by sharing user stories provided within the survey.

This article skips all of the quantitative data that we collect, and focuses in on direct feedback listed in the final comments. For a more quantitative analysis see the posts from previous years by Tom at 2020 Dask User Survey Results and 2019 Dask User Survey Results.

How can Dask Improve?

In this post we’re going to look at answers to this one question. This was a long-form response field asking “How can Dask Improve?”. Looking through some of the responses we see that a few of them fall into some common themes. I’ve grouped them here.

In each section we’ll include raw responses, followed up with a few comments from me in response.

Intermediate Documentation

More long-form content about the internals of Dask to understand when things don’t work and why. The “Hacking Dask” tutorial in the Dask 2021 summit was precisely the kind of content I really need, because 90% of my time with Dask is spent not understanding why I’m running out of memory and I feel like I’ve ready all the documentation pages 5 times already (although sometimes I also stumble upon a useful page I’ve never seen before).

There’s also a dearth of documentation of intermediate topics like blockwise in dask.array. (I think I ended up reverse engineering how it worked from docs, GitHub issue comments, reading the code, and black-box reverse engineering with different functions before I finally “got it”.)

Improve documentation and error messages to cover more of the 2nd-level problems that people run into beyond the first-level tutorial examples.

more examples for complex concepts (passing metadata to custom functions, for example). more examples/support for using dask arrays and cupy.

I think the hardest thing about Dask is debugging performance issues with dask delayed and complex mixing of other libraries and not knowing when things are being pickled or not. I am getting better at reading the performance reports, but I think that better documentation and tutorials surrounding understanding the reports would help me greater than new features. For example, make a tutorial that does some non-trivial dask-delayed work (ie not just computing a mean) that is written against best practices and show how the performance improves with each adopted best practice/explain why things were slow with each step. I think there could also be improvements to the performance reports to point out the slowest 5 parts of your code and what lines they are, and possibly relevant docs links.

Response

I really like this theme. We now have a solid community of intermediate-advanced Dask users that we should empower. We usually write materials that target the broad base of beginning users, but maybe we should rethink this a bit. There is a lot of good potential material that advanced users have around performance and debugging that could be fun to publish.

Documentation Organization

Documentation website is sometimes confusing to navigate, better separation of API and examples would help. Maybe this can inspire: https://documentation.divio.com/

I actually think Dask’s documentation is pretty good. But the docs could use some reorganizing – it is often difficult to find the relevant APIs. And there is an incredible amount of HPC insider knowledge that is required to launch a typical workflow - right now much of this knowledge is hidden in the github issues (which is great! but more of it could be pushed into the FAQs to make it more accessible).

More detailed documentation and examples. Start to finish examples that do not assume I know very much (about Dask, command line tools, Cloud technologies, Kubernetes, etc.).

I think an easier introduction to delayed/bags and additional examples for more complex use-cases could be helpful.

Response

We get alternating praise and scorn for our documentation. We have what I would call excellent reference documentation. In fact, if anyone wants to build a dynamic distributed task scheduler today I’m going to claim that distributed.dask.org is probably the most comprehensive reference out there.

However, we lack good narrative documentation, which is the concern raised by most of these comments. This is hard to do because Dask is used in so many different user narratives. It’s challenging to orient the Dask documentation around all of them simultaneously.

I appreciated the direct reference in the first comment to a website with a framework. In general I’d love to talk to people who lay out documentation semi-professionally and learn more.

Functionality

Here is a soup of various feature requests, there are a few themes among them

Have a better pandas support (like multi-index), which can help me migrate my existing code to Dask.

I’d like to see better support for actors. I think having a remote object is a common use case.

Improve Dataframes - multi index!! More feature parity with Pandas API.

Maybe a little less machine learning, more “classical” big data applications (CDF, PDEs, particle physics etc.). Not everything is map-reducable.

Better database integration. Re-writing an SQL query in SQL Alchemy can be very impractical. Would also be great if there were better ways to ensure the process didn’t die from misjudging how much memory was needed per chunk.

Better diagnostic tools; what operations are bottlenecking a task graph? Support for multiindex.

I do work that regularly requires sorting a DataFrame by multiple columns. Pandas can do this single-core; H2O and Spark can do this multicore and distributed. But dask cannot sort_values() on multiple columns at all (such as df.sort_values([ "col1", "col2" ,"col3" ], ascending=False)).

Type-hints! It is very tedious using Dask in a huge ML-Application without even having the option to do some static type-checking.

Additionally it is very frustrating that Dask tries to mimic Pandas API, but then 40% of the API doesn’t work (isn’t implemented), or deviates so far from the Pandas API that some parameters aren’t implemented. Only way to find out about that is to read the docs. With some typehints one could mitigate much of this trial-and-error process when switching from Pandas to Dask.

It’s hard to track everything around dask!!! Actors are a bit unloved, but I find them super useful

Type annotations for all methods for better IDE (VSCode) support

I think the Actor model could use a little love

Response

Interesting trends, not many that I would have expected

MultiIndex (well, this was expected)
Actors
Type hinting for IDE support
SQL access

High Level Optimization

Needs better physical data independence. Manual data chunking, memory management, query optimization are all a big hassle. Automate those more.

Dask makes it easy for users with no parallel computing experience to scale up quickly (me), but we have no sense of how to judge our resource needs. It’d be great if Dask had some tools or tutorials that helped me judge the size of my problem (e.g. memory usage). These may already exist, but examples of how to do it may be hard to find.

Runtime Stability and Advanced Troubleshooting

Stability is the most important factor

I have answered no to the Long Term Support version of dask but often the really great opportunities are those that arre on demand. The problem is that when these fixes are released, their not well advertised and something under the hood has changed. So, it ends up breaking something else or my particular knowledge of the workings are no longer correct. Dask maintainers have a bit of a weird clique and it can feel as a newbie or a learner that your talked down to or in reality. They don’t have the time to help someone. So they should probably have some more maintainers answering some of the more mundane questions via the blog or via some other method, Things we have seen people do wrong or having difficulty in . A bit of basic, a bit of intermediate and a bit of advanced. If the underlying dask API has changed, then these should be updated with new posts with updates of what has changed. Showing a breakdown of doing it the hard way. So people can see what is done step by step with standard workflows that work. Then vs dask, with less boilerplate and/or speed improvement. If there are places where speed isn’t improved. Show that the difference of where it doesnt work alongside the workflow where it might.

We have long deployed dask clusters (weeks to months) and have noticed that they sometimes go into a wonky state. We’ve been unable to identify root cause(s). Redeployment is simple and easy when it does occur, but slightly annoying nonetheless.

My biggest pain point is the scheduler, as I tend to spend time writing infrastructure to manage the scheduler and breaking apart / rewriting tasks graphs to minimize impact on the scheduler.

As my answers make clear (and from previous conversations with Matt, James, and Genevieve) the biggest improvement I’d like to see is stable releases. Stable from both a runtime point of view (i.e. rock solid Dask distributed), and from an API point of view (so I don’t have to fix my code every couple of weeks). So a big +1 to LTS releases.

Better error handling/descriptions of errors, better interoperability between (slightly) different versions

If something goes wrong (in Dask, the batch system, or the interaction between Dask and the batch system), the problem is very opaque and difficult to diagnose. Dask needs significant additional documentation, and probably additional features, to make debugging easier and more transparent.

Better ways of getting out logs of worker memory usage, especially after dask crashes/failures. Ways of getting performance reports written to log files, rather than html files which don’t write if the dask client process fails.

Two big problems for me are when dask fails determining what when wrong and how to fix it.

Response

Stability definitely took a dive last December. I’m feeling good right now though. There is a lot of good work that should be merged in and released in the next few weeks that I think will significantly improve many of the common pain points.

However, there are still many significant improvements yet to be made. I in particular like the theme above in reporting and logging when things fail. We’re ok at this today, but there is a lot of room for growth.

What’s Next?

Do the views above fully express your thoughts on where Dask should go, or is there something missing?

Share your perspective at dask.org/survey. The whole process should take less than five minutes.

Stability of the Dask library

2021-05-21T00:00:00+00:00

Dask is moving fast these days. Sometimes we break things as a result.

Historically this hasn’t been a problem, according to our survey last year most users were fairly happy with Dask’s stability.

However the last year has seen a lot of evolution of the project, which in turn causes code churn. This can cause friction for downstream users today, but also means more-than-incremental changes for the future. We’ve optimized a little bit for long-term growth over short-term stability.

There are two structural things driving some of these changes:

An increase in computational scale
An increase in organizational scale

Computational Scale

Dask today is used across a wider range of problems, a more diverse set of hardware, and at larger scales more routinely than before.

Addressing this increase in scale across many dimensions has caused us to redesign Dask’s internal infrastructure in several ways.

We’ve changed how Dask graphs are represented and communicated to the scheduler
We’ve pulled out Dask’s internal state machines and made them more formalized
We’ve rewritten large chunks of the scheduler in Cython
We’ve overhauled how we serialize messages that go between all Dask servers
We’re now tracking memory with much finer granularity than we did before
… and more

We’ve been doing all of these internal changes with minimal impact to the myriad of downstream user communities (Xarray, Prefect, RAPIDS, XGBoost, …). This is largely due to those downstream developer communities, who help to identify, isolate, and work through the subtle tremors that occur on the surface when we make these subsurface shifts.

Organizational scale

Historically Dask’s core was maintained by a relatively small set of people, mostly at Anaconda. There were dozens of developers that worked on various dask-foo projects, but only a small group that thought about things like serialization, state machines, and so on. In particular I personally tracked every issue and knew the entire project. Whenever a potential conflict arose I was usually able to identify it early.

This has all changed dramatically.

First, there are now several multi-company teams working on different parts of Dask internals.

Second, we’ve also taken some time to redesign parts of Dask internals to make them more maintainable. Dask scheduling is like a finely made clock. Historically parts of that clock were built and designed by individuals with a craftsman-like approach. Now we’re redesigning things with more of a group mindset. This results in more maintainable designs, but it also means that we’re taking apart the clock and putting it back together. It takes a little while to find all of the missing parts :)

How this affects you today

This all started around when we switched to Calendar Versioning at the end of last year (Dask version 2.30.1 rolled over into 2020.12.0 last December). You may have noticed

an increased sensitivity to version mismatches (as we change the Dask protocol different versions of Dask can no longer talk to each other well)
releases with stability issues (2020.12 was particularly rough)
tighter pinning between dask and distributed versions during releases

How this will affect you

We’ve merged in a PR to change the default behavior when moving high level graphs to the scheduler for Dask Dataframes. This should result in much less delay when submitting large computations and almost no delay in optimization. It also opens up a conduit for us to send a lot more semantic information to the scheduler about your computation, which can result in new visualizations and smarter scheduling in the future.

It will also probably break some things.

To be clear, all tests pass among Dask, distributed, xarray, prefect, rapids, and other downstream projects. We’ve done our homework here, but almost certainly we’ve missed something.

This is only one of several larger changes happening in the coming months. We appreciate your patience and your engagement as we make some of these larger shifts. For better or worse end users are the final testing suite :)

Dask User Summit 2021

2021-03-03T00:00:00+00:00

Dask is organizing a user summit in mid-May. This will be a remote event focused on bringing together developers and users of Dask and the distributed PyData stack in different domains.

User Summits like this are particularly important for a project like Dask which serves such a diverse set of use cases. Dask’s user communities include industries like finance, government, health, geoscience, imaging, machine learning, and more. These communities often have very similar problems, but don’t often communicate with each other.

User summits provide a venue for disparate domains to connect over shared technology challenges. Often a solution designed for one domain is useful for others. As technologists, this sharing is critical in order to promote consistent and high quality software solutions across domains, rather than silo’ed solutions.

We organized a summit a year ago, focusing mainly on developers. This was a fantastic time and resulted in a surprising amount of consensus building and forward movement both in technological and domain-specific directions.

For more on our summit last year, see this post.

Organization

We’ve asked NumFOCUS to organize this event for us. NumFOCUS runs the highly successful and community oriented PyData conference series, and had great success with their remote-first PyData Global conference late last year.

Tickets are intended to be reasonably priced on a sliding scale, with assistance given to any in need.

Open CFP

I would like to encourage people submit proposals to talk at summit.dask.org.

I would like to especially extend an invitation to those who are new to the Dask community, or new to speaking in general. This year we’re especially trying to highlight use cases of Dask, rather than developers pushing the technology forward (although these talks are of course welcome as well).

If you have an idea for a talk then please submit something and we’ll work together on making it fit. Alternatively, if you have a colleague that you think would enjoy or grow from speaking then I encourage you to encourage them as well.

Workshops

Finally, I’m excited about an experiment that we’re running this year with workshops. These are intended to be two-hour blocks of time dedicated to a particular topic, organized by a specific community member (perhaps you?). If you have a consistent theme for a set of 3-5 talks then this option gives you the ability to curate and control a dedicated block of the conference. You can invite your colleagues and collaborators. We’ll handle the conference infrastructure while you handle the content.

We stole this structure from workshops at larger academic conferences. We think that it fits Dask well specifically because of the federated nature of our community. We hope that it gives space for sub-communities to assemble and better establish cohesive working groups.

Themes in the past have included topics like Pangeo, RAPIDS, workflow management, imaging, and performance.

Apply to speak

Again, I encourage you and your colleagues to submit applications to speak this year in May. The proposal page is at https://summit.dask.org/present/#guidelines

Estimating Users

2020-01-14T00:00:00+00:00

People often ask me “How many people use Dask?”

As with any non-invasive open source software, the answer to this is “I don’t know”.

There are many possible proxies for user counts, like downloads, GitHub stars, and so on, but most of them are wildly incorrect. As a project maintainer who tries to find employment for other maintainers, I’m incentivized to take the highest number I can find, but that is somewhat dishonest. That number today is in the form of this likely false statement.

Dask has 50-100k daily downloads.

This number comes from looking at the Python Package Index (PyPI) (image from pypistats.org)

This is a huge number, but is almost certainly misleading. Common sense tells us that there are not 100k new Dask users every day.

If you dive in more deeply to numbers like these you will find that they are almost entirely due to automated processes. For example, of Dask’s 100k new users, a surprising number of them seem to be running Linux.

While it’s true that Dask is frequently run on Linux because it is a distributed library, it would be odd to see every machine in that deployment individually pip install dask. It’s more likely that these downloads are the result of automated systems, rather than individual users.

Anecdotally, if you get access to fine grained download data, one finds that a small set of IPs dominate download counts. These tend to come mostly from continuous integration services like Travis and Circle, are coming from AWS, or are coming from a few outliers in the world (sometimes people in China try to mirror everything)..

Check Windows

So, in an effort to avoid this effect we start looking at just Windows downloads.

The magnitudes here seem more honest to me. These monthly numbers translate to about 1000 downloads a day (perhaps multiplied by two or three for OSX and Linux), which seems more in line with my expectations.

However even this is strange. The structure doesn’t match my personal experience. Why the big change in adoption in 2018? What is the big spike in 2019? Anecdotally maintainers did not notice a significant jump in users there. Instead, we’ve experienced smooth continuous growth of adoption over time (this is what most long-term software growth looks like). It’s also odd that there hasn’t been continued growth since 2018. Anecdotally Dask seems to have grown somewhat constantly over the last few years. Phase transitions like these don’t match observed reality (at least in so far as I personally have observed it).

Notebook for plot available here

Documentation views

My favorite metric is looking at weekly unique users to documentation.

This is an over-estimate of users because many people look at the documentation without using the project. This is also an under-estimate because many users don’t consult our documentation on a weekly basis (oh I wish).

This growth pattern matches my expectations and my experience with maintaining a project that has steadily gained traction over several years.

Plot taken from Google Analytics

Dependencies

It’s also important to look at dependencies of a project. For example many users in the earth and geo sciences use Dask through another project, Xarray. These users are much less likely to touch Dask directly, but often use Dask as infrastructure underneath the Xarray library. We should probably add in something like half of Xarray’s users as well.

Plot taken from Google Analytics, supplied by Joe Hamman from Xarray

Summary

Dask has somewhere between 100k new users every day (download counts) or something like 10k users total (weekly unique IPs). The 10k number sounds more likely to me, maybe bumping up to 15k due to dependencies. The fact is though that no one really knows.

Judging the use of community maintained OSS is important as we try to value its impact on society. This is also a fundamentally difficult problem. I hope that this post helps to highlight how these numbers may be misleading, and encourages us all to think more deeply about estimating impact.

Co-locating a Jupyter Server and Dask Scheduler

2019-09-13T00:00:00+00:00

If you want, you can have Dask set up a Jupyter notebook server for you, co-located with the Dask scheduler. There are many ways to do this, but this blog post lists two.

Sometimes people inside of large institutions have complex deployment pains. It takes them a while to stand up a process running on a machine in their cluster, with all of the appropriate networking ports open and such. In that situation, it can sometimes be nice to do this just once, say for Dask, rather than twice, say for Dask and for Jupyter.

Probably in these cases people should invest in a long term solution like JupyterHub, or one of its enterprise variants, but this blogpost gives a couple of hacks in the meantime.

Hack 1: Create a Jupyter server from a Python function call

If your Dask scheduler is already running, connect to it with a Client and run a Python function that starts up a Jupyter server.

from dask.distributed import Client

client = Client("scheduler-address:8786")

def start_juptyer_server():
    from notebook.notebookapp import NotebookApp
    app = NotebookApp()
    app.initialize([])  # add command line args here if you want

client.run_on_scheduler(start_jupyter_server)

If you have a complex networking setup (maybe you’re on the cloud or HPC and had to open up a port explicitly) then you might want to install jupyter-server-proxy (which Dask also uses by default if installed), and then go to http://scheduler-address:8787/proxy/8888 . The Dask dashboard can route your connection to Jupyter (Jupyter is also kind enough to do the same for Dask if it is the main service).

Hack 2: Preload script

This is also a great opportunity to learn about the various ways of adding custom startup and teardown. One such way, is a preload script like the following:

# jupyter-preload.py
from notebook.notebookapp import NotebookApp

def dask_setup(scheduler):
    app = NotebookApp()
    app.initialize([])

dask-scheduler --preload jupyter-preload.py

That script will run at an appropriate time during scheduler startup. You can also put this into configuration

distributed:
  scheduler:
    preload: ["/path/to/jupyter-preload.py"]

Really though, you should use something else

This is mostly a hack. If you’re at an institution then you should ask for something like JuptyerHub.

Or, you might also want to run this in a separate subprocess, so that Jupyter and the Dask scheduler don’t collide with each other. This shouldn’t be so much of a problem (they’re both pretty light weight), but isolating them probably makes sense.

Thanks Nick!

Thanks to Nick Bollweg, who answered a questions on this topic here

Dask on HPC: a case study

2019-08-28T00:00:00+00:00

Dask is deployed on traditional HPC machines with increasing frequency. In the past week I’ve personally helped four different groups get set up. This is a surprisingly individual process, because every HPC machine has its own idiosyncrasies. Each machine uses a job scheduler like SLURM/PBS/SGE/LSF/…, a network file system, and fast interconnect, but each of those sub-systems have slightly different policies on a machine-by-machine basis, which is where things get tricky.

Typically we can solve these problems in about 30 minutes if we have both:

Someone familiar with the machine, like a power-user or an IT administrator
Someone familiar with setting up Dask

These systems span a large range of scale. At different ends of this scale this week I’ve seen both:

A small in-house 24-node SLURM cluster for research work inside of a bio-imaging lab
Summit, the world’s most powerful supercomputer

In this post I’m going to share a few notes of what I went through in dealing with Summit, which was particularly troublesome. Hopefully this gives a sense for the kinds of situations that arise. These tips likely don’t apply to your particular system, but hopefully they give a flavor of what can go wrong, and the processes by which we track things down.

First, Summit is an IBM PowerPC machine, meaning that packages compiled on normal Intel chips won’t work. Fortunately, Anaconda maintains a download of their distribution that works well with the Power architecture, so that gave me a good starting point.

https://www.anaconda.com/distribution/#linux

Packages do seem to be a few months older than for the normal distribution, but I can live with that.

Install Dask-Jobqueue and configure basic information

We need to tell Dask how many cores and how much memory is on each machine. This process is fairly straightforward, is well documented at jobqueue.dask.org with an informative screencast, and even self-directing with error messages.

In [1]: from dask_jobqueue import PBSCluster
In [2]: cluster = PBSCluster()
ValueError: You must specify how many cores to use per job like ``cores=8``

I’m going to skip this section for now because, generally, novice users are able to handle this. For more information, consider watching this YouTube video (30m).

Invalid operations in the job script

So we make a cluster object with all of our information, we call .scale and we get some error message from the job scheduler.

from dask_jobqueue import LSFCluster
cluster = LSFCluster(
    cores=128,
    memory="600 GB",
    project="GEN119",
    walltime="00:30",
)
cluster.scale(3)  # ask for three nodes

Command:
bsub /tmp/tmp4874eufw.sh
stdout:

Typical usage:
  bsub [LSF arguments] jobscript
  bsub [LSF arguments] -Is $SHELL
  bsub -h[elp] [options]
  bsub -V

NOTES:
 * All jobs must specify a walltime (-W) and project id (-P)
 * Standard jobs must specify a node count (-nnodes) or -ln_slots. These jobs cannot specify a resource string (-R).
 * Expert mode jobs (-csm y) must specify a resource string and cannot specify -nnodes or -ln_slots.

stderr:
ERROR: Resource strings (-R) are not supported in easy mode. Please resubmit without a resource string.
ERROR: -n is no longer supported. Please request nodes with -nnodes.
ERROR: No nodes requested. Please request nodes with -nnodes.

Dask-Jobqueue tried to generate a sensible job script from the inputs that you provided, but the resource manager that you’re using may have additional policies that are unique to that cluster. We debug this by looking at the generated script, and comparing against scripts that are known to work on the HPC machine.

print(cluster.job_script())

#!/usr/bin/env bash

#BSUB -J dask-worker
#BSUB -P GEN119
#BSUB -n 128
#BSUB -R "span[hosts=1]"
#BSUB -M 600000
#BSUB -W 00:30
JOB_ID=${LSB_JOBID%.*}

/ccs/home/mrocklin/anaconda/bin/python -m distributed.cli.dask_worker tcp://scheduler:8786 --nthreads 16 --nprocs 8 --memory-limit 75.00GB --name name --nanny --death-timeout 60 --interface ib0 --interface ib0

After comparing notes with existing scripts that we know to work on Summit, we modify keywords to add and remove certain lines in the header.

cluster = LSFCluster(
    cores=128,
    memory="500 GB",
    project="GEN119",
    walltime="00:30",
    job_extra=["-nnodes 1"],          # <--- new!
    header_skip=["-R", "-n ", "-M"],  # <--- new!
)

And when we call scale this seems to make LSF happy. It no longer dumps out large error messages.

>>> cluster.scale(3)  # things seem to pass
>>>

Workers don’t connect to the Scheduler

So things seem fine from LSF’s perspective, but when we connect up a client to our cluster we don’t see anything arriving.

>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> client
<Client: scheduler='tcp://10.41.0.34:41107' processes=0 cores=0>

Two things to check, have the jobs actually made it through the queue? Typically we use a resource manager operation, like qstat, squeue, or bjobs for this. Maybe our jobs are trapped in the queue?

$ bash
JOBID   USER       STAT   SLOTS    QUEUE       START_TIME    FINISH_TIME   JOB_NAME
600785  mrocklin   RUN    43       batch       Aug 26 13:11  Aug 26 13:41  dask-worker
600786  mrocklin   RUN    43       batch       Aug 26 13:11  Aug 26 13:41  dask-worker
600784  mrocklin   RUN    43       batch       Aug 26 13:11  Aug 26 13:41  dask-worker

Nope, it looks like they’re in a running state. Now we go and look at their logs. It can sometimes be tricky to track down the log files from your jobs, but your IT administrator should know where they are. Often they’re where you ran your job from, and have the Job ID in the filename.

$ cat dask-worker.600784.err
distributed.worker - INFO -       Start worker at: tcp://128.219.134.81:44053
distributed.worker - INFO -          Listening to: tcp://128.219.134.81:44053
distributed.worker - INFO -          dashboard at:       128.219.134.81:34583
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         16
distributed.worker - INFO -                Memory:                   75.00 GB
distributed.worker - INFO -       Local Directory: /autofs/nccs-svm1_home1/mrocklin/worker-ybnhk4ib
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
...

So the worker processes have started, but they’re having difficulty connecting to the scheduler. When we ask at IT administrator they identify the address here as on the wrong network interface:

128.219.134.74  <--- not accessible network address

So we run ifconfig, and find the infiniband network interface, ib0, which is more broadly accessible.

cluster = LSFCluster(
    cores=128,
    memory="500 GB",
    project="GEN119",
    walltime="00:30",
    job_extra=["-nnodes 1"],
    header_skip=["-R", "-n ", "-M"],
    interface="ib0",                    # <--- new!
)

We try this out and still, no luck :(

Interactive nodes

The expert user then says “Oh, our login nodes are pretty locked-down, lets try this from an interactive compute node. Things tend to work better there”. We run some arcane bash command (I’ve never seen two of these that look alike so I’m going to omit it here), and things magically start working. Hooray!

We run a tiny Dask computation just to prove that we can do some work.

>>> client = Client(cluster)
>>> client.submit(lambda x: x + 1, 10).result()
11

Actually, it turns out that we were eventually able to get things running from the login nodes on Summit using a slightly different bsub command in LSF, but I’m going to omit details here because we’re fixing this in Dask and it’s unlikely to affect future users (I hope?). Locked down login nodes remain a common cause of no connections across a variety of systems. I’ll say something like 30% of the systems that I interact with.

SSH Tunneling

It’s important to get the dashboard up and running so that you can see what’s going on. Typically we do this with SSH tunnelling. Most HPC people know how to do this and it’s covered in the Youtube screencast above, so I’m going to skip it here.

Jupyter Lab

Many interactive Dask users on HPC today are moving towards using JupyterLab. This choice gives them a notebook, terminals, file browser, and Dask’s dashboard all in a single web tab. This greatly reduces the number of times they have to SSH in, and, with the magic of web proxies, means that they only need to tunnel once.

I conda installed JupyterLab and a proxy library, and then tried to set up the Dask JupyterLab extension.

conda install jupyterlab
pip install jupyter-server-proxy  # to route dashboard through Jupyter's port

Next, we’re going to install the Dask Labextension into JupyterLab in order to get the Dask Dashboard directly into our Jupyter session.. For that, we need nodejs in order to install things into JupyterLab. I thought that this was going to be a pain, given the Power architecture, but amazingly, this also seems to be in Anaconda’s default Power channel.

mrocklin@login2.summit $ conda install nodejs  # Thanks conda packaging devs!

Then I install Dask-Labextension, which is both a Python and a JavaScript package:

pip install dask_labextension
jupyter labextension install dask-labextension

Then I set up a password for my Jupyter sessions

jupyter notebook password

And run JupyterLab in a network friendly way

mrocklin@login2.summit $ jupyter lab --no-browser --ip="login2"

And set up a single SSH tunnel from my home machine to the login node

# Be sure to match the login node's hostname and the Jupyter port below

mrocklin@my-laptop $ ssh -L 8888:login2:8888 summit.olcf.ornl.gov

I can now connect to Jupyter from my laptop by navigating to http://localhost:8888 , run the cluster commands above in a notebook, and things work great. Additionally, thanks to jupyter-server-proxy, Dask’s dashboard is also available at http://localhost:8888/proxy/####/status , where #### is the port currently hosting Dask’s dashboard. You can probably find this by looking at cluster.dashboard_link. It defaults to 8787, but if you’ve started a bunch of Dask schedulers on your system recently it’s possible that that port is taken up and so Dask had to resort to using random ports.

Configuration files

I don’t want to keep typing all of these commands, so now I put things into a single configuration file, and plop that file into ~/.config/dask/summit.yaml (any filename that ends in .yaml will do).

jobqueue:
  lsf:
    cores: 128
    processes: 8
    memory: 500 GB
    job-extra:
      - "-nnodes 1"
    interface: ib0
    header-skip:
      - "-R"
      - "-n "
      - "-M"

labextension:
  factory:
    module: "dask_jobqueue"
    class: "LSFCluster"
    args: []
    kwargs:
      project: your-project-id

Slow worker startup

Now that things are easier to use I find myself using the system more, and some other problems arise.

I notice that it takes a long time to start up a worker. It seems to hang intermittently during startup, so I add a few lines to distributed/__init__.py to print out the state of the main Python thread every second, to see where this is happening:

import threading, sys, time
from . import profile

main_thread = threading.get_ident()

def f():
    while True:
        time.sleep(1)
        frame = sys._current_frames()[main_thread]
        print("".join(profile.call_stack(frame)

thread = threading.Thread(target=f, daemon=True)
thraed.start()

This prints out a traceback that brings us to this code in Dask:

if is_locking_enabled():
    try:
        self._lock_path = os.path.join(self.dir_path + DIR_LOCK_EXT)
        assert not os.path.exists(self._lock_path)
        logger.debug("Locking %r...", self._lock_path)
        # Avoid a race condition before locking the file
        # by taking the global lock
        try:
                with workspace._global_lock():
                    self._lock_file = locket.lock_file(self._lock_path)
                    self._lock_file.acquire()

It looks like Dask is trying to use a file-based lock. Unfortunately some NFS systems don’t like file-based locks, or handle them very slowly. In the case of Summit, the home directory is actually mounted read-only from the compute nodes, so a file-based lock will simply fail. Looking up the is_locking_enabled function we see that it checks a configuration value.

def is_locking_enabled():
    return dask.config.get("distributed.worker.use-file-locking")

So we add that to our config file. At the same time I switch from the forkserver to spawn multiprocessing method (I thought that this might help, but it didn’t), which is relatively harmless.

distributed:
  worker:
    multiprocessing-method: spawn
    use-file-locking: False

jobqueue:
  lsf:
    cores: 128
    processes: 8
    memory: 500 GB
    job-extra:
      - "-nnodes 1"
    interface: ib0
    header-skip:
    - "-R"
    - "-n "
    - "-M"

labextension:
  factory:
     module: 'dask_jobqueue'
     class: 'LSFCluster'
     args: []
     kwargs:
       project: your-project-id

Conclusion

This post outlines many issues that I ran into when getting Dask to run on one specific HPC system. These problems aren’t universal, so you may not run into them, but they’re also not super-rare. Mostly my objective in writing this up is to give people a sense of the sorts of problems that arise when Dask and an HPC system interact.

None of the problems above are that serious. They’ve all happened before and they all have solutions that can be written down in a configuration file. Finding what the problem is though can be challenging, and often requires the combined expertise of individuals that are experienced with Dask and with that particular HPC system.

There are a few configuration files posted here jobqueue.dask.org/en/latest/configurations.html, which may be informative. The Dask Jobqueue issue tracker is also a fairly friendly place, full of both IT professionals and Dask experts.

Also, as a reminder, you don’t need to have an HPC machine in order to use Dask. Dask is conveniently deployable from other Cloud, Hadoop, and local systems. See the Dask setup documentation for more information.

Future work: GPUs

Summit is fast because it has a ton of GPUs. I’m going to work on that next, but that will probably cover enough content to fill up a whole other blogpost :)

Branches

For anyone playing along at home (or on Summit). I’m operating from the following development branches:

Although hopefully within a month of writing this article, everything should be in a nicely released state.

Dask and ITK for large scale image analysis

2019-08-09T00:00:00+00:00

This post explores using the ITK suite of image processing utilities in parallel with Dask Array.

We cover …

A simple but common example of applying deconvolution across a stack of 3d images
Tips on how to make these two libraries work well together
Challenges that we ran into and opportunities for future improvements.

A Worked Example

Let’s start with a full example applying Richardson Lucy deconvolution to a stack of light sheet microscopy data. This is the same data that we showed how to load in our last blogpost on image loading. You can access the data as tiff files from google drive here, and the access the corresponding point spread function images here.

# Load our data from last time¶
import dask.array as da
imgs = da.from_zarr("AOLLSMData_m4_raw.zarr/", "data")

	Array	Chunk
Bytes	188.74 GB	316.15 MB
Shape	(3, 199, 201, 1024, 768)	(1, 1, 201, 1024, 768)
Count	598 Tasks	597 Chunks
Type	uint16	numpy.ndarray

199 3

768 1024 201

This dataset has shape (3, 199, 201, 1024, 768):

3 fluorescence color channels,
199 time points,
201 z-slices,
1024 pixels in the y dimension, and
768 pixels in the x dimension.

# Load our Point Spread Function (PSF)
import dask.array.image
psf = dask.array.image.imread("AOLLSMData/m4/psfs_z0p1/*.tif")[:, None, ...]

	Array	Chunk
Bytes	2.48 MB	827.39 kB
Shape	(3, 1, 101, 64, 64)	(1, 1, 101, 64, 64)
Count	6 Tasks	3 Chunks
Type	uint16	numpy.ndarray

1 3

64 64 101

# Convert data to float32 for computation¶
import numpy as np
imgs = imgs.astype(np.float32)
# Note: the psf needs to be sampled with a voxel spacing
# consistent with the image's sampling
psf = psf.astype(np.float32)

# Apply Richardson-Lucy Deconvolution¶
def richardson_lucy_deconvolution(img, psf, iterations=1):
    """ Apply deconvolution to a single chunk of data """
    import itk

    img = img[0, 0, ...]  # remove leading two length-one dimensions
    psf = psf[0, 0, ...]  # remove leading two length-one dimensions

    image = itk.image_view_from_array(img)   # Convert to ITK object
    kernel = itk.image_view_from_array(psf)  # Convert to ITK object

    deconvolved = itk.richardson_lucy_deconvolution_image_filter(
        image,
        kernel_image=kernel,
        number_of_iterations=iterations
    )

    result = itk.array_from_image(deconvolved)  # Convert back to Numpy array
    result = result[None, None, ...]  # Add back the leading length-one dimensions

    return result

out = da.map_blocks(richardson_lucy_deconvolution, imgs, psf, dtype=np.float32)

# Create a local cluster of dask worker processes
# (this could also point to a distributed cluster if you have it)
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=20, threads_per_process=1)
client = Client(cluster)  # now dask operations use this cluster by default

# Trigger computation and store
out.to_zarr("AOLLSMData_m4_raw.zarr", "deconvolved", overwrite=True)

So in the example above we …

Load data both from Zarr and TIFF files into multi-chunked Dask arrays
Construct a function to apply an ITK routine onto each chunk
Apply that function across the dask array with the dask.array.map_blocks function.
Store the result back into Zarr format

From the perspective of an imaging scientist, the new piece of technology here is the dask.array.map_blocks function. Given a Dask array composed of many NumPy arrays and a function, map_blocks applies that function across each block in parallel, returning a Dask array as a result. It’s a great tool whenever you want to apply an operation across many blocks in a simple fashion. Because Dask arrays are just made out of Numpy arrays it’s an easy way to compose Dask with the rest of the Scientific Python ecosystem.

Building the right function

However in this case there are a few challenges to constructing the right Numpy -> Numpy function, due to both idiosyncrasies in ITK and Dask Array. Let’s look at our function again:

def richardson_lucy_deconvolution(img, psf, iterations=1):
    """ Apply deconvolution to a single chunk of data """
    import itk

    img = img[0, 0, ...]  # remove leading two length-one dimensions
    psf = psf[0, 0, ...]  # remove leading two length-one dimensions

    image = itk.image_view_from_array(img)   # Convert to ITK object
    kernel = itk.image_view_from_array(psf)  # Convert to ITK object

    deconvolved = itk.richardson_lucy_deconvolution_image_filter(
        image,
        kernel_image=kernel,
        number_of_iterations=iterations
    )

    result = itk.array_from_image(deconvolved)  # Convert back to Numpy array
    result = result[None, None, ...]  # Add back the leading length-one dimensions

    return result

out = da.map_blocks(richardson_lucy_deconvolution, imgs, psf, dtype=np.float32)

This is longer than we would like. Instead, we would have preferred to just use the itk function directly, without all of the steps before and after.

deconvolved = da.map_blocks(itk.richardson_lucy_deconvolution_image_filter, imgs, psf)

What were the extra steps in our function and why were they necessary?

Convert to and from ITK Image objects: ITK functions don’t consume and produce Numpy arrays, they consume and produce their own Image data structure. There are convenient functions to convert back and forth, so handling this is straightforward, but it does need to be handled each time. See ITK #1136 for a feature request that would remove the need for this step.
Unpack and pack singleton dimensions: Our Dask arrays have shapes like the following:
```
Array Shape: (3, 199, 201, 1024, 768)
Chunk Shape: (1,   1, 201, 1024, 768)
```
So our map_blocks function gets NumPy arrays of the chunk size, (1, 1, 201, 1024, 768). However, our ITK functions are meant to work on 3d arrays, not 5d arrays, so we need to remove those first two dimensions.
```
img = img[0, 0, ...]  # remove leading two length-one dimensions
psf = psf[0, 0, ...]  # remove leading two length-one dimensions
```
And then when we’re done, Dask expects to get back 5d arrays like what it provided, so we add these singleton dimensions back in
```
result = result[None, None, ...]  # Add back the leading length-one dimensions
```
Again, this is straightforward for users who are accustomed to NumPy slicing syntax, but does need to be done each time. This adds some friction to our development process, and is another step that can confuse users.

But if you’re comfortable working around things like this, then ITK and map_blocks can be a powerful combination if you want to parallelize out ITK operations across a cluster.

Defining a Dask Cluster

Above we used dask.distributed.LocalCluster to set up 20 single-threaded workers on our local workstation:

from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=20, threads_per_process=1)
client = Client(cluster)  # now dask operations use this cluster by default

If you had a distributed resource, this is where you would connect it. You would swap out LocalCluster with one of Dask’s other deployment options.

Also, we found that we needed to use many single-threaded processes rather than one multi-threaded process because ITK functions seem to still hold onto the GIL. This is fine, we just need to be aware of it so that we set up our Dask workers appropriately with one thread per process for maximum efficiency. See ITK #1134 for an active Github issue on this topic.

Serialization

We had some difficulty when using the ITK library across multiple processes, because the library itself didn’t serialize well. (If you don’t understand what that means, don’t worry). We solved a bit of this in ITK #1090, but some issues still remain.

We got around this by including the import in the function rather than outside of it.

def richardson_lucy_deconvolution(img, psf, iterations=1):
    import itk   # <--- we work around serialization issues by importing within the function

That way each task imports itk individually, and we sidestep this issue.

Trying Scikit-Image

We also tried out the Richardson Lucy deconvolution operation in Scikit-Image. Scikit-Image is known for being more Scipy/Numpy native, but not always as fast as ITK. Our experience confirmed this perception.

First, we were glad to see that the scikit-image function worked with map_blocks immediately without any packing/unpacking, dimensionality, or serialization issues:

import skimage.restoration

out = da.map_blocks(skimage.restoration.richardson_lucy, imgs, psf)  # just works

So all of that converting to and from image objects or removing and adding singleton dimensions isn’t necessary here.

In terms of performance we were also happy to see that Scikit-Image released the GIL, so we were able to get very high reported CPU utilization when using a small number of multi-threaded processes. However, even though CPU utilization was high, our parallel performance was poor enough that we stuck with the ITK solution, warts and all. More information about this is available in Github issue scikit-image #4083.

Note: sequentially on a single chunk, ITK ran in around 2 minutes while scikit-image ran in 3 minutes. It was only once we started parallelizing that things became slow.

Regardless, our goal in this experiment was to see how well ITK and Dask array played together. It was nice to see what smooth integration looks like, if only to motivate future development in ITK+Dask relations.

Numba GUFuncs

An alternative to da.map_blocks are Generalized Universal Functions (gufuncs) These are functions that have many magical properties, one of which is that they operate equally well on both NumPy and Dask arrays. If libraries like ITK or Scikit-Image make their functions into gufuncs then they work without users having to do anything special.

The easiest way to implement gufuncs today is with Numba. I did this on our wrapped richardson_lucy function, just to show how it could work, in case other libraries want to take this on in the future.

import numba

@numba.guvectorize(
    ["float32[:,:,:], float32[:,:,:], float32[:,:,:]"],  # we have to specify types
    "(i,j,k),(a,b,c)->(i,j,k)",                          # and dimensionality explicitly
    forceobj=True,
)
def richardson_lucy_deconvolution(img, psf, out):
    # <---- no dimension unpacking!
    iterations = 1
    image = itk.image_view_from_array(np.ascontiguousarray(img))
    kernel = itk.image_view_from_array(np.ascontiguousarray(psf))

    deconvolved = itk.richardson_lucy_deconvolution_image_filter(
        image, kernel_image=kernel, number_of_iterations=iterations
    )
    out[:] = itk.array_from_image(deconvolved)

# Now this function works natively on either NumPy or Dask arrays
out = richardson_lucy_deconvolution(imgs, psf)  # <-- no map_blocks call!

Note that we’ve both lost the dimension unpacking and the map_blocks call. Our function now knows enough information about how it can broadcast that Dask can do the parallelization without being told what to do explicitly.

This adds some burden onto library maintainers, but makes the user experience much more smooth.

GPU Acceleration

When doing some user research on image processing and Dask, almost everyone we interviewed said that they wanted faster deconvolution. This seemed to be a major pain point. Now we know why. It’s both very common, and very slow.

Running deconvolution on a single chunk of this size takes around 2-4 minutes, and we have hundreds of chunks in a single dataset. Multi-core parallelism can help a bit here, but this problem may also be ripe for GPU acceleration. Similar operations typically have 100x speedups on GPUs. This might be a more pragmatic solution than scaling out to large distributed clusters.

What’s next?

This experiment both …

Gives us an example that other imaging scientists can copy and modify to be effective with Dask and ITK together.
Highlights areas of improvement where developers from the different libraries can work to remove some of these rough interactions spots in the future.

It’s worth noting that Dask has done this with lots of libraries within the Scipy ecosystem, including Pandas, Scikit-Image, Scikit-Learn, and others.

We’re also going to continue with our imaging experiment, while these technical issues get worked out in the background. Next up, segmentation!

Python and GPUs: A Status Update

2019-06-19T00:00:00+00:00

This blogpost was delivered in talk form at the recent PASC 2019 conference. Slides for that talk are here.

We’re improving the state of scalable GPU computing in Python.

This post lays out the current status, and describes future work. It also summarizes and links to several other more blogposts from recent months that drill down into different topics for the interested reader.

Broadly we cover briefly the following categories:

Python libraries written in CUDA like CuPy and RAPIDS
Python-CUDA compilers, specifically Numba
Scaling these libraries out with Dask
Network communication with UCX
Packaging with Conda

Performance of GPU accelerated Python Libraries

Probably the easiest way for a Python programmer to get access to GPU performance is to use a GPU-accelerated Python library. These provide a set of common operations that are well tuned and integrate well together.

Many users know libraries for deep learning like PyTorch and TensorFlow, but there are several other for more general purpose computing. These tend to copy the APIs of popular Python projects:

Numpy on the GPU: CuPy
Numpy on the GPU (again): Jax
Pandas on the GPU: RAPIDS cuDF
Scikit-Learn on the GPU: RAPIDS cuML

These libraries build GPU accelerated variants of popular Python libraries like NumPy, Pandas, and Scikit-Learn. In order to better understand the relative performance differences Peter Entschev recently put together a benchmark suite to help with comparisons. He has produced the following image showing the relative speedup between GPU and CPU:

There are lots of interesting results there. Peter goes into more depth in this in his blogpost.

More broadly though, we see that there is variability in performance. Our mental model for what is fast and slow on the CPU doesn’t neccessarily carry over to the GPU. Fortunately though, due consistent APIs, users that are familiar with Python can easily experiment with GPU acceleration without learning CUDA.

Numba: Compiling Python to CUDA

See also this recent blogpost about Numba stencils and the attached GPU notebook

The built-in operations in GPU libraries like CuPy and RAPIDS cover most common operations. However, in real-world settings we often find messy situations that require writing a little bit of custom code. Switching down to C/C++/CUDA in these cases can be challenging, especially for users that are primarily Python developers. This is where Numba can come in.

Python has this same problem on the CPU as well. Users often couldn’t be bothered to learn C/C++ to write fast custom code. To address this there are tools like Cython or Numba, which let Python programmers write fast numeric code without learning much beyond the Python language.

For example, Numba accelerates the for-loop style code below about 500x on the CPU, from slow Python speeds up to fast C/Fortran speeds.

import numba  # We added these two lines for a 500x speedup

@numba.jit    # We added these two lines for a 500x speedup
def sum(x):
    total = 0
    for i in range(x.shape[0]):
        total += x[i]
    return total

The ability to drop down to low-level performant code without context switching out of Python is useful, particularly if you don’t already know C/C++ or have a compiler chain set up for you (which is the case for most Python users today).

This benefit is even more pronounced on the GPU. While many Python programmers know a little bit of C, very few of them know CUDA. Even if they did, they would probably have difficulty in setting up the compiler tools and development environment.

Enter numba.cuda.jit Numba’s backend for CUDA. Numba.cuda.jit allows Python users to author, compile, and run CUDA code, written in Python, interactively without leaving a Python session. Here is an image of writing a stencil computation that smoothes a 2d-image all from within a Jupyter Notebook:

Here is a simplified comparison of Numba CPU/GPU code to compare programming style.. The GPU code gets a 200x speed improvement over a single CPU core.

CPU – 600 ms

@numba.jit
def _smooth(x):
    out = np.empty_like(x)
    for i in range(1, x.shape[0] - 1):
        for j in range(1, x.shape[1] - 1):
            out[i, j] = x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                        x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                        x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) // 9

    return out

or if we use the fancy numba.stencil decorator …

@numba.stencil
def _smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9

GPU – 3 ms

@numba.cuda.jit
def smooth_gpu(x, out):
    i, j = cuda.grid(2)
    n, m = x.shape
    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                     x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                     x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9

Numba.cuda.jit has been out in the wild for years. It’s accessible, mature, and fun to play with. If you have a machine with a GPU in it and some curiosity then we strongly recommend that you try it out.

conda install numba
# or
pip install numba

>>> import numba.cuda

Scaling with Dask

As mentioned in previous blogposts ( 1, 2, 3, 4 ) we’ve been generalizing Dask, to operate not just with Numpy arrays and Pandas dataframes, but with anything that looks enough like Numpy (like CuPy or Sparse or Jax) or enough like Pandas (like RAPIDS cuDF) to scale those libraries out too. This is working out well. Here is a brief video showing Dask array computing an SVD in parallel, and seeing what happens when we swap out the Numpy library for CuPy.

We see that there is about a 10x speed improvement on the computation. Most importantly, we were able to switch between a CPU implementation and a GPU implementation with a small one-line change, but continue using the sophisticated algorithms with Dask Array, like it’s parallel SVD implementation.

We also saw a relative slowdown in communication. In general almost all non-trivial Dask + GPU work today is becoming communication-bound. We’ve gotten fast enough at computation that the relative importance of communication has grown significantly. We’re working to resolve this with our next topic, UCX.

Communication with UCX

See this talk by Akshay Venkatesh or view the slides

Also see this recent blogpost about UCX and Dask

We’ve been integrating the OpenUCX library into Python with UCX-Py. UCX provides uniform access to transports like TCP, InfiniBand, shared memory, and NVLink. UCX-Py is the first time that access to many of these transports has been easily accessible from the Python language.

Using UCX and Dask together we’re able to get significant speedups. Here is a trace of the SVD computation from before both before and after adding UCX:

Before UCX:

After UCX:

There is still a great deal to do here though (the blogpost linked above has several items in the Future Work section).

People can try out UCX and UCX-Py with highly experimental conda packages:

conda create -n ucx -c conda-forge -c jakirkham/label/ucx cudatoolkit=9.2 ucx-proc=*=gpu ucx ucx-py python=3.7

We hope that this work will also affect non-GPU users on HPC systems with Infiniband, or even users on consumer hardware due to the easy access to shared memory communication.

Packaging

In an earlier blogpost we discussed the challenges around installing the wrong versions of CUDA enabled packages that don’t match the CUDA driver installed on the system. Fortunately due to recent work from Stan Seibert and Michael Sarahan at Anaconda, Conda 4.7 now has a special cuda meta-package that is set to the version of the installed driver. This should make it much easier for users in the future to install the correct package.

Conda 4.7 was just releasead, and comes with many new features other than the cuda meta-package. You can read more about it here.

conda update conda

There is still plenty of work to do in the packaging space today. Everyone who builds conda packages does it their own way, resulting in headache and heterogeneity. This is largely due to not having centralized infrastructure to build and test CUDA enabled packages, like we have in Conda Forge. Fortunately, the Conda Forge community is working together with Anaconda and NVIDIA to help resolve this, though that will likely take some time.

Summary

This post gave an update of the status of some of the efforts behind GPU computing in Python. It also provided a variety of links for future reading. We include them below if you would like to learn more:

Slides
Numpy on the GPU: CuPy
Numpy on the GPU (again): Jax
Pandas on the GPU: RAPIDS cuDF
Scikit-Learn on the GPU: RAPIDS cuML
Benchmark suite
Numba CUDA JIT notebook
A talk on UCX
A blogpost on UCX and Dask
Conda 4.7

Experiments in High Performance Networking with UCX and DGX

2019-06-09T00:00:00+00:00

This post is about experimental and rapidly changing software. Code examples in this post should not be relied upon to work in the future.

This post talks about connecting UCX, a high performance networking library, to Dask, a parallel Python library, to accelerate communication-heavy workloads, particularly when using GPUs.

Additionally, we do this work on a DGX, a high-end multi-CPU multi-GPU machine with a complex internal network. Working in this context was good to force improvements in setting up Dask in heterogeneous situations targeting different network cards, CPU sockets, GPUs, and so on..

Motivation

Many distributed computing workloads are communication-bound. This is common in cases like the following:

Dataframe joins
Machine learning algorithms
Complex array computations

Communication becomes a bigger bottleneck as we accelerate our computation, such as when we use GPUs for computing.

Historically, high performance communication was only available using MPI, or with custom solutions. This post describes an effort to get close to the communication bandwidth of MPI while still maintaining the ease of programmability and accessibility of a dynamic system like Dask.

UCX, Python, and Dask

To get high performance networking in Dask, we wrapped UCX with Python and then connected that to Dask.

The OpenUCX project provides a uniform API around various high performance networking libraries like InfiniBand, traditional networking protocols like TCP/shared memory, and GPU-specific protocols like NVLink. It is a layer beneath something like OpenMPI (the main user of OpenUCX today) that figures out which networking system to use.

Python users today don’t have much access to these network libraries, except through MPI, which is sometimes not ideal. (Try searching for “infiniband” on PyPI.)

This led us to create UCX-Py . UCX-Py is a Python wrapper around the UCX C library, which provides a Pythonic API, both with blocking syntax appropriate for traditional HPC programs, as well as a non-blocking async/await syntax for more concurrent programs (like Dask). For more information on UCX I recommend watching Akshay’s UCX talk from the GPU Technology Conference 2019.

Note: UCX-Py was primarily developed by Akshay Venkatesh (UCX, NVIDIA) Tom Augspurger (Dask, Pandas, Anaconda), and Ben Zaitlen (NVIDIA, RAPIDS, Dask))

We then extended Dask communications to optionally use UCX. If you have UCX and UCX-Py installed, then you can use the ucx:// protocol in addresses or the --protocol ucx flag when starting things up, something like this.

$ dask-scheduler --protocol ucx
Scheduler started at ucx://127.0.0.1:8786

$ dask-worker ucx://127.0.0.1:8786

>>> from dask.distributed import Client
>>> client = Client('ucx://127.0.0.1:8786')

Experiment

We modified our SVD with Dask and CuPy benchmark benchmark to use the UCX protocol for inter-process communication and ran it on half of a DGX machine, using four GPUs. Here is a minimal implementation of the UCX-enabled code:

import cupy
import dask
import dask.array
from dask.distributed import Client, wait
from dask_cuda import DGX

# Define DGX cluster and client
cluster = DGX(CUDA_VISIBLE_DEVICES=[0, 1, 2, 3])
client = Client(cluster)

# Create random data
rs = dask.array.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.random((1000000, 1000), chunks=(10000, 1000))
x = x.persist()

# Perform distributed SVD
u, s, v = dask.array.linalg.svd(x)
u, s, v = dask.persist(u, s, v)
_ = wait([u, s, v])

By using UCX the overall communication times are reduced by an order of magnitude. To produce the task-stream figures below, the benchmark was run on a DGX-1 with CUDA_VISIBLE_DEVICES=[0,1,2,3]. It is clear that the red task bars, corresponding to inter-process communication, are significantly compressed. Communications that were taking 500ms-1s before now take around 20ms.

Before UCX:

After UCX:

Diving into the Details

On a GPU using NVLink we can get somewhere between 5-10 GB/s throughput between pairs of GPUs. On a CPU this drops down to 1-2 GB/s (which seems well below optimal). These speeds can affect all Dask workloads (array, dataframe, xarray, ML, …), but when the proper hardware is present, other bottlenecks may occur, such as serialization when dealing with text or JSON-like data.

This of course, depends on this fancy networking hardware being present. On the GPU example above we’re mostly relying on NVLink, but we would also get improved performance on an HPC InfiniBand network or even on a single laptop machine using shared memory transports.

The examples above was run on a DGX machine, which includes all of these transports and more (as well as numerous GPUs).

DGX

The test machine used above was a DGX-1, which has eight GPUs, two CPU sockets, four Infiniband network cards, and a complex NVLink arrangement. This is a good example of non-uniform hardware. Certain CPUs are closer to certain GPUs and network cards, and understanding this proximity has an order-of-magnitude effect on performance. This situation isn’t unique to DGX machines. The same situation arises when we have …

Multiple workers in one node, with several nodes in a cluster
Multiple nodes in one rack, with several racks in a data center
Multiple data centers, such as is the case with hybrid cloud

Working with the DGX was interesting because it forced us to start thinking about heterogeneity, and making it easier to specify complex deployment scenarios with Dask.

Here is a diagram showing how the GPUs, CPUs, and Infiniband cards are connected to each other in a DGX-1:

And here the output of nvidia-smi showing the NVLink, networking, and CPU affinity structure (this is mostly orthogonal to the structure displayed above).

$ nvidia-smi  topo -m
     GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7   ib0   ib1   ib2   ib3
GPU0   X    NV1   NV1   NV2   NV2   SYS   SYS   SYS   PIX   SYS   PHB   SYS
GPU1  NV1    X    NV2   NV1   SYS   NV2   SYS   SYS   PIX   SYS   PHB   SYS
GPU2  NV1   NV2    X    NV2   SYS   SYS   NV1   SYS   PHB   SYS   PIX   SYS
GPU3  NV2   NV1   NV2    X    SYS   SYS   SYS   NV1   PHB   SYS   PIX   SYS
GPU4  NV2   SYS   SYS   SYS    X    NV1   NV1   NV2   SYS   PIX   SYS   PHB
GPU5  SYS   NV2   SYS   SYS   NV1    X    NV2   NV1   SYS   PIX   SYS   PHB
GPU6  SYS   SYS   NV1   SYS   NV1   NV2    X    NV2   SYS   PHB   SYS   PIX
GPU7  SYS   SYS   SYS   NV1   NV2   NV1   NV2    X    SYS   PHB   SYS   PIX
ib0   PIX   PIX   PHB   PHB   SYS   SYS   SYS   SYS    X    SYS   PHB   SYS
ib1   SYS   SYS   SYS   SYS   PIX   PIX   PHB   PHB   SYS    X    SYS   PHB
ib2   PHB   PHB   PIX   PIX   SYS   SYS   SYS   SYS   PHB   SYS    X    SYS
ib3   SYS   SYS   SYS   SYS   PHB   PHB   PIX   PIX   SYS   PHB   SYS    X

    CPU Affinity
GPU0  0-19,40-59
GPU1  0-19,40-59
GPU2  0-19,40-59
GPU3  0-19,40-59
GPU4  20-39,60-79
GPU5  20-39,60-79
GPU6  20-39,60-79
GPU7  20-39,60-79

Legend:

  X    = Self
  SYS  = Traverse PCIe as well as the SMP interconnect between NUMA nodes
  NODE = Travrese PCIe as well as the interconnect between PCIe Host Bridges
  PHB  = Traverse PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Traverse multiple PCIe switches (without PCIe Host Bridge)
  PIX  = Traverse a single PCIe switch
  NV#  = Traverse a bonded set of # NVLinks

The DGX was originally designed for deep learning applications. The complex network infrastructure above can be well used by specialized NVIDIA networking libraries like NCCL, which knows how to route things correctly, but is something of a challenge for other more general purpose systems like Dask to adapt to.

Fortunately, in meeting this challenge we were able to clean up a number of related issues in Dask. In particular we can now:

Specify a more heterogeneous worker configuration when starting up a local cluster dask/distributed #2675
Learn bandwidth over time dask/distributed #2658
Add Worker plugins to help handle things like CPU affinity (though this is quite general) dask/distributed #2453

With these changes we’re now able to describe most of the DGX structure as configuration in the Python function below:

import os

from dask.distributed import Nanny, SpecCluster, Scheduler
from distributed.worker import TOTAL_MEMORY

from dask_cuda.local_cuda_cluster import cuda_visible_devices


class CPUAffinity:
    """ A Worker plugin to pin CPU affinity """
    def __init__(self, cores):
        self.cores = cores

    def setup(self, worker=None):
        os.sched_setaffinity(0, self.cores)


affinity = {  # See nvidia-smi topo -m
    0: list(range(0, 20)) + list(range(40, 60)),
    1: list(range(0, 20)) + list(range(40, 60)),
    2: list(range(0, 20)) + list(range(40, 60)),
    3: list(range(0, 20)) + list(range(40, 60)),
    4: list(range(20, 40)) + list(range(60, 79)),
    5: list(range(20, 40)) + list(range(60, 79)),
    6: list(range(20, 40)) + list(range(60, 79)),
    7: list(range(20, 40)) + list(range(60, 79)),
}

def DGX(
    interface="ib",
    dashboard_address=":8787",
    threads_per_worker=1,
    silence_logs=True,
    CUDA_VISIBLE_DEVICES=None,
    **kwargs
):
    """ A Local Cluster for a DGX 1 machine

    NVIDIA's DGX-1 machine has a complex architecture mapping CPUs,
    GPUs, and network hardware.  This function creates a local cluster
    that tries to respect this hardware as much as possible.

    It creates one Dask worker process per GPU, and assigns each worker
    process the correct CPU cores and Network interface cards to
    maximize performance.

    That being said, things aren't perfect.  Today a DGX has very high
    performance between certain sets of GPUs and not others.  A Dask DGX
    cluster that uses only certain tightly coupled parts of the computer
    will have significantly higher bandwidth than a deployment on the
    entire thing.

    Parameters
    ----------
    interface: str
        The interface prefix for the infiniband networking cards.  This is
        often "ib"` or "bond".  We will add the numeric suffix 0,1,2,3 as
        appropriate.  Defaults to "ib".
    dashboard_address: str
        The address for the scheduler dashboard.  Defaults to ":8787".
    CUDA_VISIBLE_DEVICES: str
        String like ``"0,1,2,3"`` or ``[0, 1, 2, 3]`` to restrict
        activity to different GPUs

    Examples
    --------
    >>> from dask_cuda import DGX
    >>> from dask.distributed import Client
    >>> cluster = DGX(interface='ib')
    >>> client = Client(cluster)
    """
    if CUDA_VISIBLE_DEVICES is None:
        CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "0,1,2,3,4,5,6,7")
    if isinstance(CUDA_VISIBLE_DEVICES, str):
        CUDA_VISIBLE_DEVICES = CUDA_VISIBLE_DEVICES.split(",")
    CUDA_VISIBLE_DEVICES = list(map(int, CUDA_VISIBLE_DEVICES))
    memory_limit = TOTAL_MEMORY / 8

    spec = {
        i: {
            "cls": Nanny,
            "options": {
                "env": {
                    "CUDA_VISIBLE_DEVICES": cuda_visible_devices(
                        ii, CUDA_VISIBLE_DEVICES
                    ),
                    "UCX_TLS": "rc,cuda_copy,cuda_ipc",
                },
                "interface": interface + str(i // 2),
                "protocol": "ucx",
                "ncores": threads_per_worker,
                "data": dict,
                "preload": ["dask_cuda.initialize_context"],
                "dashboard_address": ":0",
                "plugins": [CPUAffinity(affinity[i])],
                "silence_logs": silence_logs,
                "memory_limit": memory_limit,
            },
        }
        for ii, i in enumerate(CUDA_VISIBLE_DEVICES)
    }

    scheduler = {
        "cls": Scheduler,
        "options": {
            "interface": interface + str(CUDA_VISIBLE_DEVICES[0] // 2),
            "protocol": "ucx",
            "dashboard_address": dashboard_address,
        },
    }

    return SpecCluster(
        workers=spec,
        scheduler=scheduler,
        silence_logs=silence_logs,
        **kwargs
    )

However, we never got the NVLink structure down. The Dask scheduler currently still assumes uniform bandwidths between workers. We’ve started to make small steps towards changing this, but we’re not there yet (this will be useful as well for people that want to think about in-rack or cross-data-center deployments).

As usual, in solving a highly specific problem, we were able to solve a number of lingering general features, which then made our specific problem easy to write down.

Future Work

There has been significant effort over the last few months make everything above work. In particular we …

Modified UCX to support client-server workloads
Wrapped UCX with UCX-Py and design a Python async-await friendly interface
Wrapped UCX-Py with Dask
Hooked everything together to make generic workloads function well

The result is quite nice, especially for more communication heavy workloads. However there is still plenty to do. This section details what we’re thinking about now to continue this work.

Routing within complex networks: If you restrict yourself to four of the eight GPUs in a DGX, you can get 5-12 GB/s between pairs of GPUs. For some workloads this can be significant. It makes the system feel much more like a single unit than a bunch of isolated machines.

However we still can’t get great performance across the whole DGX because there are many GPU-pairs that are not connected by NVLink, and so we get 10x slower speeds. These dominate communication costs if you naively try to use the full DGX.

This might be solved either by:
1. Teaching Dask to avoid these communications
2. Teaching UCX to route communications like these through a chain of multiple NVLink connections
3. Avoiding complex networks altogether. Newer systems like the DGX-2 use NVSwitch, which provides uniform connectivity, with each GPU connected to every other GPU.
Edit: I’ve since learned that UCX should be able to handle this. We should still get PCIe speeds (around 4-7 GB/s) even when we don’t have NVLink once an upstream bug gets fixed. Hooray!
CPU: We can get 1-2 GB/s across InfiniBand, which isn’t bad, but also wasn’t the full 5-8 GB/s that we were hoping for. This deserves more serious profiling to determine what is going wrong. The current guess is that this has to do with memory allocations.
```
In [1]: %time _ = b'0' * 1000000000  # 1 GB
CPU times: user 248 ms, sys: 223 ms, total: 472 ms
Wall time: 470 ms   # <<----- Around 2 GB/s.  Slower than I expected
```
Probably we’re just doing something dumb here.
Package UCX: Currently I’m building the UCX and UCX-Py libraries from source (see appendix below for instructions). Ideally these would become conda packages. John Kirkham (Conda Forge, NVIDIA, Dask) is taking a look at this along with the UCX developers from Mellanox.

See ucx-py #65 for more information.
Learn Heterogeneous Bandwidths: In order to make good scheduling decisions Dask needs to estimate how long it will take to move data between machines. This question is now becoming much more complex, and depends on both the source and destination machines (the network topology) the data type (NumPy array, GPU array, Pandas Dataframe with text) and more. In complex situations our bandwidths can span a 100x range (100 MB/s to 10 GB/s).

Dask will have to develop more complex models for bandwidth, and learn these over time.

See dask/distributed #2743 for more information.
Support other GPU libraries: To send GPU data around we need to teach Dask how to serialize Python objects into GPU buffers. There is code in the dask/distributed repository to do this for Numba, CuPy, and RAPIDS cuDF objects, but we’ve really only tested CuPy seriously. We should expand this by some of the following steps:
1. Try a distributed Dask cuDF join computation
  
  See dask/distributed #2746 for initial work here.
2. Teach Dask to serialize array GPU libraries, like PyTorch and TensorFlow, or possibly anything that supports the __cuda_array_interface__ protocol.
Track down communication failures: We still occasionally get unexplained communication failures. We should stress test this system to discover rough corners.
TCP: Groups with high performing TCP networks can’t yet make use of UCX+Dask (though they can use either one individually).

Currently using UCX in a client-server mode as we’re doing with Dask requires access to RDMA libraries, which are often not found on systems without networking systems like InfiniBand. This means that groups with high performing TCP networks can’t make use of UCX+Dask.

This is in progress at openucx/ucx #3570
Commodity Hardware: Currently this code is only really useful on high performance Linux systems that have InfiniBand or NVLink. However, it would be nice to also use this on more commodity systems, including personal laptop computers using TCP and shared memory.

Currently Dask uses TCP for inter-process communication on a single machine. Using UCX on a personal computer would give us access to shared memory speeds, which tend to be an order of magnitude faster.

See openucx/ucx #3663 for more information.
Tune Performance: The 5-10 GB/s bandwidths that we see with NVLink today are sub-optimal. With UCX-Py alone we’re able to get something like 15 GB/s on large message tranfers. We should benchmark and tune our implementation to see what is taking up the extra time. Until things work more robustly though, this is a secondary priority.

Appendix: Setup

Performing these experiments depends currently on development branches in a few repositories. This section includes my current setup.

Create Conda Environment

conda create -n ucx python=3.7 libtool cmake automake autoconf cython bokeh pytest pkg-config ipython dask numba -y

Note: for some reason using conda-forge makes the autogen step below fail.

Set up UCX

# Clone UCX repository and get branch
git clone https://github.com/openucx/ucx
cd ucx
git remote add Akshay-Venkatesh git@github.com:Akshay-Venkatesh/ucx.git
git remote update Akshay-Venkatesh
git checkout ucx-cuda

# Build
git clean -xfd
export CUDA_HOME=/usr/local/cuda-9.2/
./autogen.sh
mkdir build
cd build
../configure --prefix=$CONDA_PREFIX --enable-debug --with-cuda=$CUDA_HOME --enable-mt --disable-cma CPPFLAGS="-I//usr/local/cuda-9.2/include"
make -j install

# Verify
ucx_info -d
which ucx_info  # verify that this is in the conda environment

# Verify that we see NVLink speeds
ucx_perftest -t tag_bw -m cuda -s 1048576 -n 1000 & ucx_perftest dgx15 -t tag_bw -m cuda -s 1048576 -n 1000

Set up UCX-Py

git clone git@github.com:rapidsai/ucx-py
cd ucx-py

export UCX_PATH=$CONDA_PREFIX
make install

Set up Dask

git clone git@github.com:dask/dask.git
cd dask
pip install -e .
cd ..

git clone git@github.com:dask/distributed.git
cd distributed
pip install -e .
cd ..

Optionally set up cupy

pip install cupy-cuda92==6

Optionally set up cudf

conda install -c rapidsai-nightly -c conda-forge -c numba cudf dask-cudf cudatoolkit=9.2

Optionally set up JupyterLab

conda install ipykernel jupyterlab nb_conda_kernels nodejs

For the Dask dashboard

pip install dask_labextension
jupyter labextension install dask-labextension

My Benchmark

I’ve been using the following benchmark to test communication. It allocates a chunked Dask array, and then adds it to its transpose, which forces a lot of communication, but not much computation.

from collections import defaultdict
import asyncio
import time
import numpy as np
from pprint import pprint
import cupy

import dask.array as da
from dask.distributed import Client, wait
from distributed.utils import format_time, format_bytes

async def f():

    # Set up workers on the local machine
    async with DGX(asynchronous=True, silence_logs=True) as cluster:
        async with Client(cluster, asynchronous=True) as client:

            # Create a simple random array
            rs = da.random.RandomState(RandomState=cupy.random.RandomState)
            x = rs.random((40000, 40000), chunks='128 MiB').persist()
            print(x.npartitions, 'chunks')
            await wait(x)

            # Add X to its transpose, forcing computation
            y = (x + x.T).sum()
            result = await client.compute(y)

            # Collect, aggregate, and print peer-to-peer bandwidths
            incoming_logs = await client.run(lambda dask_worker: dask_worker.incoming_transfer_log)
            bandwidths = defaultdict(list)
            for k, L in incoming_logs.items():
                for d in L:
                    if d['total'] > 1_000_000:
                        bandwidths[k, d['who']].append(d['bandwidth'])
            bandwidths = {
                (cluster.scheduler.workers[w1].name,
                    cluster.scheduler.workers[w2].name): [format_bytes(x) + '/s' for x in np.quantile(v, [0.25, 0.50, 0.75])]
                for (w1, w2), v in bandwidths.items()
            }
            pprint(bandwidths)


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(f())

Note: most of this example is just getting back diagnostics, which can be easily ignored. Also, you can drop the async/await code if you like. I think that there should probably be more examples in the world using Dask with async/await syntax, so I decided to leave it in.

Composing Dask Array with Numba Stencils

2019-04-09T00:00:00+00:00

In this post we explore four array computing technologies, and how they work together to achieve powerful results.

Numba’s stencil decorator to craft localized compute kernels
Numba’s Just-In-Time (JIT) compiler for array computing in Python
Dask Array for parallelizing array computations across many chunks
NumPy’s Generalized Universal Functions (gufuncs) to tie everything together

In the end we’ll show how a novice developer can write a small amount of Python to efficiently compute localized computation on large amounts of data. In particular we’ll write a simple function to smooth images and apply that in parallel across a large stack of images.

Here is the full code, we’ll dive into it piece by piece below.

import numba

@numba.stencil
def _smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9


@numba.guvectorize(
    [(numba.int8[:, :], numba.int8[:, :])],
    '(n, m) -> (n, m)'
)
def smooth(x, out):
    out[:] = _smooth(x)


# If you want fake data
import dask.array as da
x = da.ones((1000000, 1000, 1000), chunks=('auto', -1, -1), dtype='int8')

# If you have actual data
import dask_image
x = dask_image.imread.imread('/path/to/*.png')

y = smooth(x)
# dask.array<transpose, shape=(1000000, 1000, 1000), dtype=int8, chunksize=(125, 1000, 1000)>

Note: the smooth function above is more commonly referred to as the 2D mean filter in the image processing community.

Now, lets break this down a bit

Docs:: https://numba.pydata.org/numba-doc/dev/user/stencil.html

Many array computing functions operate only on a local region of the array. This is common in image processing, signals processing, simulation, the solution of differential equations, anomaly detection, time series analysis, and more. Typically we write code that looks like the following:

def _smooth(x):
    out = np.empty_like(x)
    for i in range(1, x.shape[0] - 1):
        for j in range(1, x.shape[1] - 1):
            out[i, j] = x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                        x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                        x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) // 9

    return out

Or something similar. The numba.stencil decorator makes this a bit easier to write down. You just write down what happens on every element, and Numba handles the rest.

@numba.stencil
def _smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9

Numba JIT

Docs: http://numba.pydata.org/

When we run this function on a NumPy array, we find that it is slow, operating at Python speeds.

x = np.ones((100, 100))
timeit _smooth(x)
527 ms ± 44.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But if we JIT compile this function with Numba, then it runs more quickly.

@numba.njit
def smooth(x):
    return _smooth(x)

%timeit smooth(x)
70.8 µs ± 6.38 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

For those counting, that’s over 1000x faster!

Note: this function already exists as scipy.ndimage.uniform_filter, which operates at the same speed.

Dask Array

Docs: https://docs.dask.org/en/latest/array.html

In these applications people often have many such arrays and they want to apply this function over all of them. In principle they could do this with a for loop.

from glob import glob
import skimage.io

for fn in glob('/path/to/*.png'):
    img = skimage.io.imread(fn)
    out = smooth(img)
    skimage.io.imsave(fn.replace('.png', '.out.png'), out)

If they wanted to then do this in parallel they would maybe use the multiprocessing or concurrent.futures modules. If they wanted to do this across a cluster then they could rewrite their code with PySpark or some other system.

Or, they could use Dask array, which will handle both the pipelining and the parallelism (single machine or on a cluster) all while still looking mostly like a NumPy array.

import dask_image
x = dask_image.imread('/path/to/*.png')  # a large lazy array of all of our images
y = x.map_blocks(smooth, dtype='int8')

And then because each of the chunks of a Dask array are just NumPy arrays, we can use the map_blocks function to apply this function across all of our images, and then save them out.

This is fine, but lets go a bit further, and discuss generalized universal functions from NumPy.

Generalized Universal Functions

Numba Docs: https://numba.pydata.org/numba-doc/dev/user/vectorize.html

NumPy Docs: https://docs.scipy.org/doc/numpy-1.16.0/reference/c-api.generalized-ufuncs.html

A generalized universal function (gufunc) is a normal function that has been annotated with typing and dimension information. For example we can redefine our smooth function as a gufunc as follows:

@numba.guvectorize(
    [(numba.int8[:, :], numba.int8[:, :])],
    '(n, m) -> (n, m)'
)
def smooth(x, out):
    out[:] = _smooth(x)

This function knows that it consumes a 2d array of int8’s and produces a 2d array of int8’s of the same dimensions.

This sort of annotation is a small change, but it gives other systems like Dask enough information to use it intelligently. Rather than call functions like map_blocks, we can just use the function directly, as if our Dask Array was just a very large NumPy array.

# Before gufuncs
y = x.map_blocks(smooth, dtype='int8')

# After gufuncs
y = smooth(x)

This is nice. If you write library code with gufunc semantics then that code just works with systems like Dask, without you having to build in explicit support for parallel computing. This makes the lives of users much easier.

Finished result

Lets see the full example one more time.

import numba
import dask.array as da

@numba.stencil
def _smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9


@numba.guvectorize(
    [(numba.int8[:, :], numba.int8[:, :])],
    '(n, m) -> (n, m)'
)
def smooth(x, out):
    out[:] = _smooth(x)

x = da.ones((1000000, 1000, 1000), chunks=('auto', -1, -1), dtype='int8')
smooth(x)

This code is decently approachable by novice users. They may not understand the internal details of gufuncs or Dask arrays or JIT compilation, but they can probably copy-paste-and-modify the example above to suit their needs.

The parts that they do want to change are easy to change, like the stencil computation, and creating an array of their own data.

This workflow is efficient and scalable, using low-level compiled code and potentially clusters of thousands of computers.

What could be better

There are a few things that could make this workflow nicer.

It would be nice not to have to specify dtypes in guvectorize, but instead specialize to types as they arrive. numba/numba #2979
Support GPU accelerators for the stencil computations using numba.cuda.jit. Stencil computations are obvious candidates for GPU acceleration, and this is a good accessible point where novice users can specify what they want in a way that is sufficiently constrained for automated systems to rewrite it as CUDA somewhat easily. numba/numba 3915
It would have been nicer to be able to apply the @guvectorize decorator directly on top of the stencil function like this.
```
@numba.guvectorize(...)
@numba.stencil
def average(x):
    ...
```
Rather than have two functions. numba/numba #3914

You may have noticed that our guvectorize function had to assign its result into an out parameter.

@numba.guvectorize(
    [(numba.int8[:, :], numba.int8[:, :])],
    '(n, m) -> (n, m)'
)
def smooth(x, out):
    out[:] = _smooth(x)

It would have been nicer, perhaps, to just return the output

def smooth(x):
    return _smooth(x)

numba/numba #3916

The dask-image library could use a imsave function

dask/dask-image #110

Aspirational Result

With all of these, we might then be able to write the code above as follows

# This is aspirational

import numba
import dask_image

@numba.guvectorize(
    [(numba.int8[:, :], numba.int8[:, :])],
    signature='(n, m) -> (n, m)',
    target='gpu'
)
@numba.stencil
def smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9

x = dask_image.io.imread('/path/to/*.png')
y = smooth(x)
dask_image.io.imsave(y, '/path/to/out/*.png')

Update: Now with GPUs!

After writing this blogpost I did a small update where I used numba.cuda.jit to implement the same smooth function on a GPU to achieve a 200x speedup with only a modest increase to code complexity.

That notebook is here.

Building GPU Groupby-Aggregations for Dask

2019-03-04T00:00:00+00:00

We’ve sufficiently aligned Dask DataFrame and cuDF to get groupby aggregations like the following to work well.

df.groupby('x').y.mean()

This post describes the kind of work we had to do as a model for future development.

Plan

As outlined in a previous post, Dask, Pandas, and GPUs: first steps, our plan to produce distributed GPU dataframes was to combine Dask DataFrame with cudf. In particular, we had to

change Dask DataFrame so that it would parallelize not just around the Pandas DataFrames that it works with today, but around anything that looked enough like a Pandas DataFrame
change cuDF so that it would look enough like a Pandas DataFrame to fit within the algorithms in Dask DataFrame

Changes

On the Dask side this mostly meant replacing

Replacing isinstance(df, pd.DataFrame) checks with is_dataframe_like(df) checks (after defining a suitable is_dataframe_like/is_series_like/is_index_like functions
Avoiding some more exotic functionality in Pandas, and instead trying to use more common functionality that we can expect to be in most DataFrame implementations

On the cuDF side this means making dozens of tiny changes to align the cuDF API to the Pandas API, and to add in missing features.

Dask Changes:
- Remove explicit pandas checks and provide cudf lazy registration #4359
- Replace isinstance(…, pandas) with is_dataframe_like #4375
- Add has_parallel_type
- Lazily register more cudf functions and move to backends file #4396
- Avoid checking against types in is_dataframe_like #4418
- Replace cudf-specific code with dask-cudf import #4470
- Avoid groupby.agg(callable) in groupby-var #4482 – this one is notable in that by simplifying our Pandas usage we actually got a significant speedup on the Pandas side.
cuDF Changes:

I don’t really expect anyone to go through all of those issues, but my hope is that by skimming over the issue titles people will get a sense for the kinds of changes we’re making here. It’s a large number of small things.

Also, kudos to Thomson Comer who solved most of the cuDF issues above.

There are still some pending issues

Square Root #1055, needed for groupby-std
cuDF needs multi-index support for columns #483, needed for:
```
gropuby.agg({'x': ['sum', mean'], 'y': ['min', 'max']})
```

But things mostly work

But generally things work pretty well today:

In [1]: import dask_cudf

In [2]: df = dask_cudf.read_csv('yellow_tripdata_2016-*.csv')

In [3]: df.groupby('passenger_count').trip_distance.mean().compute()
Out[3]: <cudf.Series nrows=10 >

In [4]: _.to_pandas()
Out[4]:
0    0.625424
1    4.976895
2    4.470014
3    5.955262
4    4.328076
5    3.079661
6    2.998077
7    3.147452
8    5.165570
9    5.916169
dtype: float64

Experience

First, most of this work was handled by the cuDF developers (which may be evident from the relative lengths of the issue lists above). When we started this process it felt like a never-ending stream of tiny issues. We weren’t able to see the next set of issues until we had finished the current set. Fortunately, most of them were pretty easy to fix. Additionally, as we went on, it seemed to get a bit easier over time.

Additionally, lots of things work other than groupby-aggregations as a result of the changes above. From the perspective of someone accustomed to Pandas, The cuDF library is starting to feel more reliable. We hit missing functionality less frequently when using cuDF on other operations.

What’s next?

More recently we’ve been working on the various join/merge operations in Dask DataFrame like indexed joins on a sorted column, joins between large and small dataframes (a common special case) and so on. Getting these algorithms from the mainline Dask DataFrame codebase to work with cuDF is resulting in a similar set of issues to what we saw above with groupby-aggregations, but so far the list is much smaller. We hope that this is a trend as we continue on to other sets of functionality into the future like I/O, time-series operations, rolling windows, and so on.

Running Dask and MPI programs together

2019-01-31T00:00:00+00:00

We present an experiment on how to pass data from a loosely coupled parallel computing system like Dask to a tightly coupled parallel computing system like MPI.

We give motivation and a complete digestible example.

Here is a gist of the code and results.

Motivation

Disclaimer: Nothing in this post is polished or production ready. This is an experiment designed to start conversation. No long-term support is offered.

We often get the following question:

How do I use Dask to pre-process my data, but then pass those results to a traditional MPI application?

You might want to do this because you’re supporting legacy code written in MPI, or because your computation requires tightly coupled parallelism of the sort that only MPI can deliver.

First solution: Write to disk

The simplest thing to do of course is to write your Dask results to disk and then load them back from disk with MPI. Given the relative cost of your computation to data loading, this might be a great choice.

For the rest of this blogpost we’re going to assume that it’s not.

Second solution

We have a trivial MPI library written in MPI4Py where each rank just prints out all the data that it was given. In principle though it could call into C++ code, and do arbitrary MPI things.

# my_mpi_lib.py
from mpi4py import MPI

comm = MPI.COMM_WORLD

def print_data_and_rank(chunks: list):
    """ Fake function that mocks out how an MPI function should operate

    -   It takes in a list of chunks of data that are present on this machine
    -   It does whatever it wants to with this data and MPI
        Here for simplicity we just print the data and print the rank
    -   Maybe it returns something
    """
    rank = comm.Get_rank()

    for chunk in chunks:
        print("on rank:", rank)
        print(chunk)

    return sum(chunk.sum() for chunk in chunks)

In our dask program we’re going to use Dask normally to load in data, do some preprocessing, and then hand off all of that data to each MPI rank, which will call the print_data_and_rank function above to initialize the MPI computation.

# my_dask_script.py

# Set up Dask workers from within an MPI job using the dask_mpi project
# See https://dask-mpi.readthedocs.io/en/latest/

from dask_mpi import initialize
initialize()

from dask.distributed import Client, wait, futures_of
client = Client()

# Use Dask Array to "load" data (actually just create random data here)

import dask.array as da
x = da.random.random(100000000, chunks=(1000000,))
x = x.persist()
wait(x)

# Find out where data is on each worker
# TODO: This could be improved on the Dask side to reduce boiler plate

from toolz import first
from collections import defaultdict
key_to_part_dict = {str(part.key): part for part in futures_of(x)}
who_has = client.who_has(x)
worker_map = defaultdict(list)
for key, workers in who_has.items():
    worker_map[first(workers)].append(key_to_part_dict[key])


# Call an MPI-enabled function on the list of data present on each worker

from my_mpi_lib import print_data_and_rank

futures = [client.submit(print_data_and_rank, list_of_parts, workers=worker)
           for worker, list_of_parts in worker_map.items()]

wait(futures)

client.close()

Then we can call this mix of Dask and an MPI program using normal mpirun or mpiexec commands.

mpirun -np 5 python my_dask_script.py

What just happened

So MPI started up and ran our script. The dask-mpi project set a Dask scheduler on rank 0, runs our client code on rank 1, and then runs a bunch of workers on ranks 2+.

Rank 0: Runs a Dask scheduler
Rank 1: Runs our script
Ranks 2+: Run Dask workers

Our script then created a Dask array, though presumably here it would read in data from some source, do more complex Dask manipulations before continuing on.

We then wait until all of the Dask work has finished and is in a quiet state. We then query the state in the scheduler to find out where all of that data lives. That’s this code here:

# Find out where data is on each worker
# TODO: This could be improved on the Dask side to reduce boiler plate

from toolz import first
from collections import defaultdict
key_to_part_dict = {str(part.key): part for part in futures_of(x)}
who_has = client.who_has(x)
worker_map = defaultdict(list)
for key, workers in who_has.items():
    worker_map[first(workers)].append(key_to_part_dict[key])

Admittedly, this code is gross, and not particularly friendly or obvious to non-Dask experts (or even Dask experts themselves, I had to steal this from the Dask XGBoost project, which does the same trick).

But after that we just call our MPI library’s initialize function, print_data_and_rank on all of our data using Dask’s Futures interface. That function gets the data directly from local memory (the Dask workers and MPI ranks are in the same process), and does whatever the MPI application wants.

Future work

This could be improved in a few ways:

The “gross” code referred to above could probably be placed into some library code to make this pattern easier for people to use.
Ideally the Dask part of the computation wouldn’t also have to be managed by MPI, but could maybe start up MPI on its own.

You could imagine Dask running on something like Kubernetes doing highly dynamic work, scaling up and down as necessary. Then it would get to a point where it needed to run some MPI code so it would, itself, start up MPI on its worker processes and run the MPI application on its data.
We haven’t really said anything about resilience here. My guess is that this isn’t hard to do (Dask has plenty of mechanisms to build complex inter-task relationships) but I didn’t solve it above.

Here is a gist of the code and results.

Single-Node Multi-GPU Dataframe Joins

2019-01-29T00:00:00+00:00

We experiment with single-node multi-GPU joins using cuDF and Dask. We find that the in-GPU computation is faster than communication. We also present context and plans for near-future work, including improving high performance communication in Dask with UCX.

Here is a notebook of the experiment in this post

Introduction

In a recent post we showed how Dask + cuDF could accelerate reading CSV files using multiple GPUs in parallel. That operation quickly became bound by the speed of our disk after we added a few GPUs. Now we try a very different kind of operation, multi-GPU joins.

This workload can be communication-heavy, especially if the column on which we are joining is not sorted nicely, and so provides a good example on the other extreme from parsing CSV.

Benchmark

Construct random data using the CPU

Here we use Dask array and Dask dataframe to construct two random tables with a shared id column. We can play with the number of rows of each table and the number of keys to make the join challenging in a variety of ways.

import dask.array as da
import dask.dataframe as dd

n_rows = 1000000000
n_keys = 5000000

left = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='x'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1)

n_rows = 10000000

right = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='y'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1)

Send to the GPUs

We have two Dask dataframes composed of many Pandas dataframes of our random data. We now map the cudf.from_pandas function across these to make a Dask dataframe of cuDF dataframes.

import dask
import cudf

gleft = left.map_partitions(cudf.from_pandas)
gright = right.map_partitions(cudf.from_pandas)

gleft, gright = dask.persist(gleft, gright)  # persist data in device memory

What’s nice here is that there wasn’t any special dask_pandas_dataframe_to_dask_cudf_dataframe function. Dask composed nicely with cuDF. We didn’t need to do anything special to support it.

We’ll also persisted the data in device memory.

After this, simple operations are easy and fast and use our eight GPUs.

>>> gleft.x.sum().compute()  # this takes 250ms
500004719.254711

Join

We’ll use standard Pandas syntax to merge the datasets, persist the result in RAM, and then wait

out = gleft.merge(gright, on=['id'])  # this is lazy

Profile and analyze results

We now look at the Dask diagnostic plots for this computation.

Task stream and communication

When we look at Dask’s task stream plot we see that each of our eight threads (each of which manages a single GPU) spent most of its time in communication (red is communication time). The actual merge and concat tasks are quite fast relative to the data transfer time.

That’s not too surprising. For this computation I’ve turned off any attempt to communicate between devices (more on this below) so the data is being moved from the GPU to the CPU memory, then serialized and put onto a TCP socket. We’re moving tens of GB on a single machine, so we’re seeing about 1GB/s total throughput of the system, which is typical for TCP-on-localhost in Python.

Flamegraph of computation

We can also look more deeply at the computational costs in Dask’s flamegraph-style plot. This shows which lines of our functions were taking up the most time (down to the Python level at least).

This Flame graph shows which lines of cudf code we spent time on while computing (excluding the main communication costs mentioned above). It may be interesting for those trying to further optimize performance. It shows that most of our costs are in memory allocation. Like communication, this has actually also been fixed in RAPIDS’ optional memory management pool, it just isn’t default yet, so I didn’t use it here.

Plans for efficient communication

The cuDF library actually has a decent approach to single-node multi-GPU communication that I’ve intentionally turned off for this experiment. That approach cleverly used Dask to communicate device pointer information using Dask’s normal channels (this is small and fast) and then used that information to initiate a side-channel communication for the bulk of the data. This approach was effective, but somewhat fragile. I’m inclined to move on for it in favor of …

UCX. The UCX project provides a single API that wraps around several transports like TCP, Infiniband, shared memory, and also GPU-specific transports. UCX claims to find the best way to communicate data between two points given the hardware available. If Dask were able to use this for communication then it would provide both efficient GPU-to-GPU communication on a single machine, and also efficient inter-machine communication when efficient networking hardware like Infiniband was present, even outside the context of GPUs.

There is some work we need to do here:

We need to make a Python wrapper around UCX
We need to make an optional Dask Comm around this ucx-py library that allows users to specify endpoints like ucx://path-to-scheduler
We need to make Python memoryview-like objects that refer to device memory
…

This work is already in progress by Akshay Vekatesh, who works on UCX, and Tom Augspurger a core Dask/Pandas developer. I suspect that they’ll write about it soon. I’m looking forward to seeing what comes of it, both for Dask and for high performance Python generally.

It’s worth pointing out that this effort won’t just help GPU users. It should help anyone on advanced networking hardware, including the mainstream scientific HPC community.

Summary

Single-node Mutli-GPU joins have a lot of promise. In fact, earlier RAPIDS developers got this running much faster than I was able to do above through the clever communication tricks I briefly mentioned. The main purpose of this post is to provide a benchmark for joins that we can use in the future, and to highlight when communication can be essential in parallel computing.

Now that GPUs have accelerated the computation time of each of our chunks of work we increasingly find that other systems become the bottleneck. We didn’t care as much about communication before because computational costs were comparable. Now that computation is an order of magnitude cheaper, other aspects of our stack become much more important.

I’m looking forward to seeing where this goes.

Come help!

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

If you’re interested in being paid to focus more on these topics, then consider applying for a job. NVIDIA’s RAPIDS team is looking to hire engineers for Dask development with GPUs and other data analytics library development projects.

Senior Library Software Engineer - RAPIDS

Dask, Pandas, and GPUs: first steps

2019-01-13T00:00:00+00:00

We’re building a distributed GPU Pandas dataframe out of cuDF and Dask Dataframe. This effort is young.

This post describes the current situation, our general approach, and gives examples of what does and doesn’t work today. We end with some notes on scaling performance.

You can also view the experiment in this post as a notebook.

And here is a table of results:

Architecture	Time	Bandwidth
Single CPU Core	3min 14s	50 MB/s
Eight CPU Cores	58s	170 MB/s
Forty CPU Cores	35s	285 MB/s
One GPU	11s	900 MB/s
Eight GPUs	5s	2000 MB/s

Building Blocks: cuDF and Dask

Building a distributed GPU-backed dataframe is a large endeavor. Fortunately we’re starting on a good foundation and can assemble much of this system from existing components:

The cuDF library aims to implement the Pandas API on the GPU. It gets good speedups on standard operations like reading CSV files, filtering and aggregating columns, joins, and so on.
```
import cudf  # looks and feels like Pandas, but runs on the GPU

df = cudf.read_csv('myfile.csv')
df = df[df.name == 'Alice']
df.groupby('id').value.mean()
```
cuDF is part of the growing RAPIDS initiative.
The Dask Dataframe library provides parallel algorithms around the Pandas API. It composes large operations like distributed groupbys or distributed joins from a task graph of many smaller single-node groupbys or joins accordingly (and many other operations).
```
import dask.dataframe as dd  # looks and feels like Pandas, but runs in parallel

df = dd.read_csv('myfile.*.csv')
df = df[df.name == 'Alice']
df.groupby('id').value.mean().compute()
```
The Dask distributed task scheduler provides general-purpose parallel execution given complex task graphs. It’s good for adding multi-node computing into an existing codebase.

Given these building blocks, our approach is to make the cuDF API close enough to Pandas that we can reuse the Dask Dataframe algorithms.

Benefits and Challenges to this approach

This approach has a few benefits:

We get to reuse the parallel algorithms found in Dask Dataframe originally designed for Pandas.
It consolidates the development effort within a single codebase so that future effort spent on CPU Dataframes benefits GPU Dataframes and vice versa. Maintenance costs are shared.
By building code that works equally with two DataFrame implementations (CPU and GPU) we establish conventions and protocols that will make it easier for other projects to do the same, either with these two Pandas-like libraries, or with future Pandas-like libraries.

This approach also aims to demonstrate that the ecosystem should support Pandas-like libraries, rather than just Pandas. For example, if (when?) the Arrow library develops a computational system then we’ll be in a better place to roll that in as well.
When doing any refactor we tend to clean up existing code.

For example, to make dask dataframe ready for a new GPU Parquet reader we end up refactoring and simplifying our Parquet I/O logic.

The approach also has some drawbacks. Namely, it places API pressure on cuDF to match Pandas so:

Slight differences in API now cause larger problems, such as these:
- Join column ordering differs rapidsai/cudf #251
- Groupby aggregation column ordering differs rapidsai/cudf #483
cuDF has some pressure on it to repeat what some believe to be mistakes in the Pandas API.

For example, cuDF today supports missing values arguably more sensibly than Pandas. Should cuDF have to revert to the old way of doing things just to match Pandas semantics? Dask Dataframe will probably need to be more flexible in order to handle evolution and small differences in semantics.

Alternatives

We could also write a new dask-dataframe-style project around cuDF that deviates from the Pandas/Dask Dataframe API. Until recently this has actually been the approach, and the dask-cudf project did exactly this. This was probably a good choice early on to get started and prototype things. The project was able to implement a wide range of functionality including groupby-aggregations, joins, and so on using dask delayed.

We’re redoing this now on top of dask dataframe though, which means that we’re losing some functionality that dask-cudf already had, but hopefully the functionality that we add now will be more stable and established on a firmer base.

Status Today

Today very little works, but what does is decently smooth.

Here is a simple example that reads some data from many CSV files, picks out a column, and does some aggregations.

from dask_cuda import LocalCUDACluster
import dask_cudf
from dask.distributed import Client

cluster = LocalCUDACluster()  # runs on eight local GPUs
client = Client(cluster)

gdf = dask_cudf.read_csv('data/nyc/many/*.csv')  # wrap around many CSV files

>>> gdf.passenger_count.sum().compute()
184464740

Also note, NYC Taxi ridership is significantly less than it was a few years ago

What I’m excited about in the example above

All of the infrastructure surrounding the cuDF code, like the cluster setup, diagnostics, JupyterLab environment, and so on, came for free, like any other new Dask project.

Here is an image of my JupyterLab setup
Our df object is actually just a normal Dask Dataframe. We didn’t have to write new __repr__, __add__, or .sum() implementations, and probably many functions we didn’t think about work well today (though also many don’t).
We’re tightly integrated and more connected to other systems. For example, if we wanted to convert our dask-cudf-dataframe to a dask-pandas-dataframe then we would just use the cuDF to_pandas function:
```
df = df.map_partitions(cudf.DataFrame.to_pandas)
```
We don’t have to write anything special like a separate .to_dask_dataframe method or handle other special cases.

Dask parallelism is orthogonal to the choice of CPU or GPU.
It’s easy to switch hardware. By avoiding separate dask-cudf code paths it’s easier to add cuDF to an existing Dask+Pandas codebase to run on GPUs, or to remove cuDF and use Pandas if we want our code to be runnable without GPUs.

There are more examples of this in the scaling section below.

What’s wrong with the example above

In general the answer is many small things.

The cudf.read_csv function doesn’t yet support reading chunks from a single CSV file, and so doesn’t work well with very large CSV files. We had to split our large CSV files into many smaller CSV files first with normal Dask+Pandas:
```
import dask.dataframe as dd
(df = dd.read_csv('few-large/*.csv')
        .repartition(npartitions=100)
        .to_csv('many-small/*.csv', index=False))
```
(See rapidsai/cudf #568)
Many operations that used to work in dask-cudf like groupby-aggregations and joins no longer work. We’re going to need to slightly modify many cuDF APIs over the next couple of months to more closely match their Pandas equivalents.
I ran the timing cell twice because it currently takes a few seconds to import cudf today. rapidsai/cudf #627
We had to make Dask Dataframe a bit more flexible and assume less about its constituent dataframes being exactly Pandas dataframes. (see dask/dask #4359 and dask/dask #4375 for examples). I suspect that there will by many more small changes like these necessary in the future.

These problems are representative of dozens more similar issues. They are all fixable and indeed, many are actively being fixed today by the good folks working on RAPIDS.

Near Term Schedule

The RAPIDS group is currently busy working to release 0.5, which includes some of the fixes necessary to run the example above, and also many unrelated stability improvements. This will probably keep them busy for a week or two during which I don’t expect to see much Dask + cuDF work going on other than planning.

After that, Dask parallelism support will be a top priority, so I look forward to seeing some rapid progress here.

Scaling Results

In my last post about combining Dask Array with CuPy, a GPU-accelerated Numpy, we saw impressive speedups from using many GPUs on a simple problem that manipulated some simple random data.

Dask Array + CuPy on Random Data

Architecture	Time
Single CPU Core	2hr 39min
Forty CPU Cores	11min 30s
One GPU	1 min 37s
Eight GPUs	19s

That exercise was easy to scale because it was almost entirely bound by the computation of creating random data.

Dask DataFrame + cuDF on CSV data

We did a similar study on the read_csv example above, which is bound mostly by reading CSV data from disk and then parsing it. You can see a notebook available here. We have similar (though less impressive) numbers to present.

Architecture	Time	Bandwidth
Single CPU Core	3min 14s	50 MB/s
Eight CPU Cores	58s	170 MB/s
Forty CPU Cores	35s	285 MB/s
One GPU	11s	900 MB/s
Eight GPUs	5s	2000 MB/s

The bandwidth numbers were computed by noting that the data was around 10 GB on disk

Analysis

First, I want to emphasize again that it’s easy to test a wide variety of architectures using this setup because of the Pandas API compatibility between all of the different projects. We’re seeing a wide range of performance (40x span) across a variety of different hardware with a wide range of cost points.

Second, note that this problem scales less well than our previous example with CuPy, both on CPU and GPU. I suspect that this is because this example is also bound by I/O and not just number-crunching. While the jump from single-CPU to single-GPU is large, the jump from single-CPU to many-CPU or single-GPU to many-GPU is not as large as we would have liked. For GPUs for example we got around a 2x speedup when we added 8x as many GPUs.

At first one might think that this is because we’re saturating disk read speeds. However two pieces of evidence go against that guess:

NVIDIA folks familiar with my current hardware inform me that they’re able to get much higher I/O throughput when they’re careful
The CPU scaling is similarly poor, despite the fact that it’s obviously not reaching full I/O bandwidth

Instead, it’s likely that we’re just not treating our disks and IO pipelines carefully.

We might consider working to think more carefully about data locality within a single machine. Alternatively, we might just choose to use a smaller machine, or many smaller machines. My team has been asking me to start playing with some cheaper systems than a DGX, I may experiment with those soon. It may be that for data-loading and pre-processing workloads the previous wisdom of “pack as much computation as you can into a single box” no longer holds (without us doing more work that is).

Come help

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

Senior Library Software Engineer - RAPIDS

GPU Dask Arrays, first steps

2019-01-03T00:00:00+00:00

The following code creates and manipulates 2 TB of randomly generated data.

import dask.array as da

rs = da.random.RandomState()
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))
(x + 1)[::2, ::2].sum().compute(scheduler='threads')

On a single CPU, this computation takes two hours.

On an eight-GPU single-node system this computation takes nineteen seconds.

Actually this computation isn’t that impressive. It’s a simple workload, for which most of the time is spent creating and destroying random data. The computation and communication patterns are simple, reflecting the simplicity commonly found in data processing workloads.

What is impressive is that we were able to create a distributed parallel GPU array quickly by composing these four existing libraries:

CuPy provides a partial implementation of Numpy on the GPU.
Dask Array provides chunked algorithms on top of Numpy-like libraries like Numpy and CuPy.

This enables us to operate on more data than we could fit in memory by operating on that data in chunks.
The Dask distributed task scheduler runs those algorithms in parallel, easily coordinating work across many CPU cores.
The Dask CUDA to extend Dask distributed with GPU support.

These tools already exist. We had to connect them together with a small amount of glue code and minor modifications. By mashing these tools together we can quickly build and switch between different architectures to explore what is best for our application.

For this example we relied on the following changes upstream:

Comparison among single/multi CPU/GPU

We can now easily run some experiments on different architectures. This is easy because …

We can switch between CPU and GPU by switching between Numpy and CuPy.
We can switch between single/multi-CPU-core and single/multi-GPU by switching between Dask’s different task schedulers.

These libraries allow us to quickly judge the costs of this computation for the following hardware choices:

Single-threaded CPU
Multi-threaded CPU with 40 cores (80 H/T)
Single-GPU
Multi-GPU on a single machine with 8 GPUs

We present code for these four choices below, but first, we present a table of results.

Results

Architecture	Time
Single CPU Core	2hr 39min
Forty CPU Cores	11min 30s
One GPU	1 min 37s
Eight GPUs	19s

Setup

import cupy
import dask.array as da

# generate chunked dask arrays of mamy numpy random arrays
rs = da.random.RandomState()
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))

print(x.nbytes / 1e9)  # 2 TB
# 2000.0

CPU timing

(x + 1)[::2, ::2].sum().compute(scheduler='single-threaded')
(x + 1)[::2, ::2].sum().compute(scheduler='threads')

Single GPU timing

We switch from CPU to GPU by changing our data source to generate CuPy arrays rather than NumPy arrays. Everything else should more or less work the same without special handling for CuPy.

(This actually isn’t true yet, many things in dask.array will break for non-NumPy arrays, but we’re working on it actively both within Dask, within NumPy, and within the GPU array libraries. Regardless, everything in this example works fine.)

# generate chunked dask arrays of mamy cupy random arrays
rs = da.random.RandomState(RandomState=cupy.random.RandomState)  # <-- we specify cupy here
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))

(x + 1)[::2, ::2].sum().compute(scheduler='single-threaded')

Multi GPU timing

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

(x + 1)[::2, ::2].sum().compute()

And again, here are the results:

Architecture	Time
Single CPU Core	2hr 39min
Forty CPU Cores	11min 30s
One GPU	1 min 37s
Eight GPUs	19s

First, this is my first time playing with an 40-core system. I was surprised to see that many cores. I was also pleased to see that Dask’s normal threaded scheduler happily saturates many cores.

Although later on it did dive down to around 5000-6000%, and if you do the math you’ll see that we’re not getting a 40x speedup. My guess is that performance would improve if we were to play with some mixture of threads and processes, like having ten processes with eight threads each.

The jump from the biggest multi-core CPU to a single GPU is still an order of magnitude though. The jump to multi-GPU is another order of magnitude, and brings the computation down to 19s, which is short enough that I’m willing to wait for it to finish before walking away from my computer.

Actually, it’s quite fun to watch on the dashboard (especially after you’ve been waiting for three hours for the sequential solution to run):

Conclusion

This computation was simple, but the range in architecture just explored was extensive. We swapped out the underlying architecture from CPU to GPU (which had an entirely different codebase) and tried both multi-core CPU parallelism as well as multi-GPU many-core parallelism.

We did this in less than twenty lines of code, making this experiment something that an undergraduate student or other novice could perform at home. We’re approaching a point where experimenting with multi-GPU systems is approachable to non-experts (at least for array computing).

Here is a notebook for the experiment above

Room for improvement

We can work to expand the computation above in a variety of directions. There is a ton of work we still have to do to make this reliable.

Use more complex array computing workloads

The Dask Array algorithms were designed first around Numpy. We’ve only recently started making them more generic to other kinds of arrays (like GPU arrays, sparse arrays, and so on). As a result there are still many bugs when exploring these non-Numpy workloads.

For example if you were to switch sum for mean in the computation above you would get an error because our mean computation contains an easy to fix error that assumes Numpy arrays exactly.
Use Pandas and cuDF instead of Numpy and CuPy

The cuDF library aims to reimplement the Pandas API on the GPU, much like how CuPy reimplements the NumPy API. Using Dask DataFrame with cuDF will require some work on both sides, but is quite doable.

I believe that there is plenty of low-hanging fruit here.
Improve and move LocalCUDACluster

The LocalCUDAClutster class used above is an experimental Cluster type that creates as many workers locally as you have GPUs, and assigns each worker to prefer a different GPU. This makes it easy for people to load balance across GPUs on a single-node system without thinking too much about it. This appears to be a common pain-point in the ecosystem today.

However, the LocalCUDACluster probably shouldn’t live in the dask/distributed repository (it seems too CUDA specific) so will probably move to some dask-cuda repository. Additionally there are still many questions about how to handle concurrency on top of GPUs, balancing between CPU cores and GPU cores, and so on.
Multi-node computation

There’s no reason that we couldn’t accelerate computations like these further by using multiple multi-GPU nodes. This is doable today with manual setup, but we should also improve the existing deployment solutions dask-kubernetes, dask-yarn, and dask-jobqueue, to make this easier for non-experts who want to use a cluster of multi-GPU resources.
Expense

The machine I ran this on is expensive. Well, it’s nowhere close to as expensive to own and operate as a traditional cluster that you would need for these kinds of results, but it’s still well beyond the price point of a hobbyist or student.

It would be useful to run this on a more budget system to get a sense of the tradeoffs on more reasonably priced systems. I should probably also learn more about provisioning GPUs on the cloud.

Come help!

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

If you’re interested in being paid to focus more on these topics, then consider applying for a job. The NVIDIA corporation is hiring around the use of Dask with GPUs.

Senior Library Software Engineer - RAPIDS

That’s a fairly generic posting. If you’re interested the posting doesn’t seem to fit then please apply anyway and we’ll tweak things.