Dask Working Notes - Posted in 2018

Dask Version 1.0

2018-11-29T00:00:00+00:00

We are pleased to announce the release of Dask version 1.0.0!

Usually in release blogposts we outline important features and changes since the last major version. Because of the 1.0 version number, this post will be a bit different. Instead we’ll talk about what this version number means to us, and discuss the broader context of Dask projects more generally.

Version 1.0 software means different things to different groups. In some communities it might mean …

The first version of a package
When a package is first ready for production use
When a package has reached API stability
…

It is common in the PyData ecosystem to wait a long time before releasing a version 1.0. For example neither Pandas nor Scikit-Learn, arguably two of the most well used PyData packages in production, have yet declared a 1.0 version number (today they are at versions 0.23 and 0.20 respectively). And yet each package is widely used in production by organizations that demand high degrees of stability.

Dask is not as API-stable as Pandas or Scikit-Learn, but it’s pretty close. The project rarely invents new APIs, instead preferring to implement pre-existing APIs (like the NumPy/Pandas/Scikit-Learn APIs) or standard language protocols (like async-await, concurrent.futures, Queues, Locks, and so on). Additionally, Dask is well used in production today across sectors ranging from risk-tolerant industries like startups and quantitative finance shops, to risk-averse institutions like banks, large enterprises, and governments.

When we say that Dask has reached 1.0 we mean that it is ready to be used in production. We are late in saying this. This happened a long time ago.

Development will continue as before

Dask is living software that exists in a rapidly evolving space. Nothing is changing about our internal stability practices. We will continue to add new features, deprecate old ones, and fix bugs with the same policies. We always try to minimize negative effects on users when making these internal changes while maximizing the speed at which we can deliver new bugfixes and features. This is hard and requires care, but we believe that we’ve done this decently in the past so hopefully you haven’t noticed much. We will continue to operate the same way into the future.

The 1.0 version change does not affect our development cycle. There are no LTS versions beyond what we already provide.

Different Dask packages move at different speeds

Dask is able to evolve and experiment rapidly while maintaining a stable core because it is split into sub-packages, each of which evolves independently, has its own maintainers, its own versions, and its own release cycle. Some Dask subprojects have had versions above 1.0 for a long time, while others are still unstable.

Dask’s version number is hard to define today because it is composed of so many independent efforts by different groups. This is similar to situation in Jupyter, or in the Numeric Python ecosystem itself.

Thanks

Finally, we’re grateful to everyone who has contributed to the project over the years, either by contributing code, reviews, documentation, discussion, bug reports, well written questions and answers, visual designs, and well wishes. This means a lot to us.

Today there are dozens of dask-* packages on PyPI that support thousands of users and several more that incorporate Dask for parallelism. We’re thankful to play a role in such a vibrant community.

Dask-jobqueue

2018-10-08T00:00:00+00:00

This work was done in collaboration with Matthew Rocklin (Anaconda), Jim Edwards (NCAR), Guillaume Eynard-Bontemps (CNES), and Loïc Estève (INRIA), and is supported, in part, by the US National Science Foundation Earth Cube program. The dask-jobqueue package is a spinoff of the Pangeo Project. This blogpost was previously published here

TLDR; Dask-jobqueue allows you to seamlessly deploy dask on HPC clusters that use a variety of job queuing systems such as PBS, Slurm, SGE, or LSF. Dask-jobqueue provides a Pythonic user interface that manages dask workers/clusters through the submission, execution, and deletion of individual jobs on a HPC system. It gives users the ability to interactively scale workloads across large HPC systems; turning an interactive Jupyter Notebook into a powerful tool for scalable computation on very large datasets.

Install with:

conda install -c conda-forge dask-jobqueue

pip install dask-jobqueue

And checkout the dask-jobqueue documentation: http://jobqueue.dask.org

Large high-performance computer (HPC) clusters are ubiquitous throughout the computational sciences. These HPC systems include powerful hardware, including many large compute nodes, high-speed interconnects, and parallel file systems. An example of such systems that we use at NCAR is named Cheyenne. Cheyenne is a fairly large machine, with about 150k cores and over 300 TB of total memory.

Cheyenne is a 5.34-petaflops, high-performance computer operated by NCAR.

These systems frequently use a job queueing system, such as PBS, Slurm, or SGE, to manage the queueing and execution of many concurrent jobs from numerous users. A “job” is a single execution of a program that is to be run on some set of resources on the user’s HPC system. These jobs are often submitted via the command line:

qsub do_thing_a.sh

Where do_thing_a.sh is a shell script that might look something like this:

#!/bin/bash
#PBS -N thing_a
#PBS -q premium
#PBS -A 123456789
#PBS -l select=1:ncpus=36:mem=109G

echo “doing thing A”

In this example “-N” specifies the name of this job, “-q” specifies the queue where the job should be run, “-A” specifies a project code to bill for the CPU time used while the job is run, and “-l” specifies the hardware specifications for this job. Each job queueing system has slightly different syntax for configuring and submitting these jobs.

This interface has led to the development of a few common workflow patterns:

MPI if you want to scale. MPI stands for the Message Passing Interface. It is a widely adopted interface allowing parallel computation across traditional HPC clusters. Many large computational models are written in languages like C and Fortran and use MPI to manage their parallel execution. For the old-timers out there, this is the go-to solution when it comes time to scale complex computations.
Batch it. It is quite common for scientific processing pipelines to include a few steps that can be easily parallelized by submitting multiple jobs in parallel. Maybe you want to “do_thing_a.sh” 500 times with slightly different inputs — easy, just submit all the jobs separately (or in what some queueing systems refer to as “array-job”).
Serial is still okay. Computers are pretty fast these days, right? Maybe you don’t need to parallelize your programing at all. Okay, so keep it serial and get some coffee while your job is running.

The Problem

None of the workflow patterns listed above allow for interactive analysis on very large data analysis. When I’m prototyping new processing method, I often want to work interactively, say in a Jupyter Notebook. Writing MPI code on the fly is hard and expensive, batch jobs are inherently not interactive, and serial just won’t do when I start working on many TBs of data. Our experience is that these workflows tend to be fairly inelegant and difficult to transfer between applications, yielding lots of duplicated effort along the way.

One of the aims of the Pangeo project is to facilitate interactive data on very large datasets. Pangeo leverages Jupyter and dask, along with a number of more domain specific packages like xarray to make this possible. The problem is we didn’t have a particularly palatable method for deploying dask on our HPC clusters.

The System

Jupyter Notebooks are web applications that support interactive code execution, display of figures and animations, and in-line explanatory text and equations. They are quickly becoming the standard open-source format for interactive computing in Python.
Dask is a library for parallel computing that coordinates well with Python’s existing scientific software ecosystem, including libraries like NumPy, Pandas, Scikit-Learn, and xarray. In many cases, it offers users the ability to take existing workflows and quickly scale them to much larger applications. *Dask-distributed* is an extension of dask that facilitates parallel execution across many computers.
Dask-jobqueue is a new Python package that we’ve built to facilitate the deployment of dask on HPC clusters and interfacing with a number of job queuing systems. Its usage is concise and Pythonic.

from dask_jobqueue import PBSCluster
from dask.distributed import Client

cluster = PBSCluster(cores=36,
                     memory="108GB",
                     queue="premium")
cluster.scale(10)
client = Client(cluster)

What’s happening under the hood?

In the call to PBSCluster() we are telling dask-jobqueue how we want to configure each job. In this case, we set each job to have 1 Worker, each using 36 cores (threads) and 108 GB of memory. We also tell the PBS queueing system we’d like to submit this job to the “premium” queue. This step also starts a Scheduler to manage workers that we’ll add later.
It is not until we call the cluster.scale() method that we interact with the PBS system. Here we start 10 workers, or equivalently 10 PBS jobs. For each job, dask-jobqueue creates a shell command similar to the one above (except dask-worker is called instead of echo) and submits the job via a subprocess call.
Finally, we connect to the cluster by instantiating the Client class. From here, the rest of our code looks just as it would if we were using one of dask’s local schedulers.

Dask-jobqueue is easily customizable to help users capitalize on advanced HPC features. A more complicated example that would work on NCAR’s Cheyenne super computer is:

cluster = PBSCluster(cores=36,
                    processes=18,
                    memory="108GB",
                    project='P48500028',
                    queue='premium',
                    resource_spec='select=1:ncpus=36:mem=109G',
                    walltime='02:00:00',
                    interface='ib0',
                    local_directory='$TMPDIR')

In this example, we instruct the PBSCluster to 1) use up to 36 cores per job, 2) use 18 worker processes per job, 3) use the large memory nodes with 109 GB each, 4) use a longer walltime than is standard, 5) use the InfiniBand network interface (ib0), and 6) use the fast SSD disks as its local directory space.

Finally, Dask offers the ability to “autoscale” clusters based on a set of heuristics. When the cluster needs more CPU or memory, it will scale up. When the cluster has unused resources, it will scale down. Dask-jobqueue supports this with a simple interface:

cluster.adapt(minimum=18, maximum=360)

In this example, we tell our cluster to autoscale between 18 and 360 workers (or 1 and 20 jobs).

Demonstration

We have put together a fairly comprehensive screen cast that walks users through all the steps of setting up Jupyter and Dask (and dask-jobqueue) on an HPC cluster:

Conclusions

Dask jobqueue makes it much easier to deploy Dask on HPC clusters. The package provides a Pythonic interface to common job-queueing systems. It is also easily customizable.

The autoscaling functionality allows for a fundamentally different way to do science on HPC clusters. Start your Jupyter Notebook, instantiate your dask cluster, and then do science — let dask determine when to scale up and down depending on the computational demand. We think this bursting approach to interactive parallel computing offers many benefits.

Finally, in developing dask-jobqueue, we’ve run into a few challenges that are worth mentioning.

Queueing systems are highly customizable. System administrators seem to have a lot of control over their particularly implementation of each queueing system. In practice, this means that it is often difficult to simultaneously cover all permutations of a particular queueing system. We’ve generally found that things seem to be flexible enough and welcome feedback in the cases where they are not.
CI testing has required a fair bit of work to setup. The target environment for using dask-jobqueue is on existing HPC clusters. In order to facilitate continuous integration testing of dask-jobqueue, we’ve had to configure multiple queueing systems (PBS, Slurm, SGE) to run in docker using Travis CI. This has been a laborious task and one we’re still working on.
We’ve built dask-jobqueue to operate in the dask-deploy framework. If you are familiar with dask-kubernetes or dask-yarn, you’ll recognize the basic syntax in dask-jobqueue as well. The coincident development of these dask deployment packages has recently brought up some important coordination discussions (e.g. dask/distributed#2235).

Refactor Documentation

2018-09-27T00:00:00+00:00

This work is supported by Anaconda Inc

We recently changed how we organize and connect Dask’s documentation. Our approach may prove useful for other umbrella projects that spread documentation across many different builds and sites.

Dask splits documentation into many pages

Dask’s documentation is split into several different websites, each managed by a different team for a different sub-project:

dask.pydata.org : Main site
distributed.readthedocs.org : Distributed scheduler
dask-ml.readthedocs.io : Dask for machine learning
dask-kubernetes.readthedocs.io : Dask on Kubernetes
dask-jobqueue.readthedocs.io : Dask on HPC systems
dask-yarn.readthedocs.io : Dask on Hadoop systems
dask-examples.readthedocs.io : Examples that use Dask
matthewrocklin.com/blog, jcrist.github.io, tomaugspurger.github.io, martindurant.github.io/blog : Developers’ personal blogs

This split in documentation matches the split in development teams. Each of sub-project’s team manages its own docs in its own way. They release at their own pace and make their own decisions about technology. This makes it much more likely that developers maintain the documentation as they develop and change software libraries.

We make it easy to write documentation. This choice causes many different documentation systems to emerge.

This approach is common. A web search for Jupyter Documentation yields the following list:

Different teams developing semi-independently create different web pages. This is inevitable. Asking a large distributed team to coordinate on a single cohesive website adds substantial friction, which results in worse documentation coverage.

Problem

However, while using separate websites results in excellent coverage, it also fragments the documentation. This makes it harder for users to smoothly navigate between sites and discover appropriate content.

Monolithic documentation is good for readers, modular documentation is good for writers.

Our Solutions

Over the last month we took steps to connect our documentation and make it more cohesive, while still enabling independent development. This post outlines the following steps:

Organize under a single domain, dask.org
Develop a sphinx template project for uniform style
Include a cross-project navbar in addition to the within-project table-of-contents

We did some other things along the way that we find useful, but are probably more specific to just Dask.

We moved this blog to blog.dask.org
We improved our example notebooks to host both a static site and also a live Binder

1: Organize under a single domains, Dask.org

Previously we had some documentation under readthedocs, some under the dask.pydata.org subdomain (thanks NumFOCUS!) and some pages on personal websites, like matthewrocklin.com/blog.

While looking for a new dask domain to host all of our content we noticed that dask.org redirected to anaconda.org, and were pleased to learn that someone at Anaconda Inc had the foresight to register the domain early on.

Anaconda was happy to transfer ownership of the domain to NumFOCUS, who helps us to maintain it now. Now all of our documentation is available under that single domain as subdomains:

This uniformity means that the thing you want is probably at that-thing.dask.org, which is a bit easier to guess than otherwise.

Many thanks to Andy Terrel and Tom Augspurger for managing this move, and to Anaconda for generously donating the domain.

2: Cross-project Navigation Bar

We wanted a way for readers to quickly discover the other sites that were available to them. All of our sites have side-navigation-bars to help readers navigate within a particular site, but now they also have a top-navigation-bar to help them navigate between projects.

This navigation bar is managed independently from all of the documentation projects at our new Sphinx theme.

3: Dask Sphinx Theme

To give a uniform sense of style we developed our own Sphinx HTML theme. This inherits from ReadTheDocs’ theme, but with changed styling to match Dask color and visual style. We publish this theme as a package on PyPI that all of our projects’ Sphinx builds can import and use if they want. We can change style in this one package and publish to PyPI and all of the projects will pick up those changes on their next build without having to copy stylesheets around to different repositories.

This allows several different projects to evolve content (which they care about) and build process separately from style (which they typically don’t care as much about). We have a single style sheet that gets used everywhere easily.

4: Move Dask Blogging to blog.dask.org

Previously most announcements about Dask were written and published from one of the maintainers’ personal blogs. This split information about the project and made it hard for people to discover good content. There also wasn’t a good way for a community member to suggest a blog for distribution to the general community, other than by starting their own.

Now we have an official blog at blog.dask.org which serves files submitted to github.com/dask/dask-blog. These posts are simple markdown files that should be easy for people to generate. For example the source for this post is available at github.com/dask/dask-blog/blob/gh-pages/_posts/2018-09-27-docs-refactor.md

We encourage community members to share posts about work they’ve done with Dask by submitting pull requests to that repository.

5: Host Examples as both static HTML and live Binder sessions

The Dask community maintains a set of example notebooks that show people how to use Dask in a variety of ways. These notebooks live at github.com/dask/dask-examples and are easy for users to download and run.

To get more value from these notebooks we now expose them in two additional ways:

As static HTML at examples.dask.org, rendered with the nbsphinx plugin.

Seeing them statically rendered and being able to quickly navigate between them really increases the pleasure of exploring them. We hope that this encourages users to explore more broadly.
As live-runnable notebooks on the cloud using mybinder.org. You can play with any of these notebooks by clicking on this button: .

This allows people to explore more deeply. Also, because we’ve connected up the Dask JupyterLab extension to this environment, users get an immediate instinctual experience of what parallel computing feels like (if you haven’t used the dask dashboard during computation you really should give that link a try).

Now that these examples get much more exposure we hope that this encourages community members to submit new examples. We hope that by providing infrastructure more content creators will come as well.

We also encourage other projects to take a look at what we’ve done in github.com/dask/dask-examples. We think that this model might be broadly useful across other projects.

Conclusion

Thank you for reading. We hope that this post pushes readers to re-explore Dask’s documentation, and that it pushes developers to consider some of the approaches above for their own projects.

Dask Development Log

2018-09-17T00:00:00+00:00

This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Since the last update in the 0.19.0 release blogpost two weeks ago we’ve seen activity in the following areas:

Update Dask examples to use JupyterLab on Binder
Render Dask examples into static HTML pages for easier viewing
Consolidate and unify disparate documentation
Retire the hdfs3 library in favor of the solution in Apache Arrow.
Continue work on hyper-parameter selection for incrementally trained models
Publish two small bugfix releases
Blogpost from the Pangeo community about combining Binder with Dask
Skein/Yarn Update

The new dask-labextension embeds Dask’s dashboard plots into a JupyterLab session so that you can get easy access to information about your computations from Jupyter directly. This was released a few weeks ago as part of the previous release post.

However since then we’ve hooked this up to our live examples system that lets users try out Dask on a small cloud instance using mybinder.org. If you want to try out Dask and JupyterLab together then head here:

Thanks to Ian Rose for managing this.

2: Render Dask Examples as static documentation

Using the nbsphinx Sphinx extension to automatically run and render Jupyter Notebooks we’ve turned our live examples repository into static documentation for easy viewing.

These examples are currently available at https://dask.org/dask-examples/ but will soon be available at examples.dask.org and from the navbar at all dask pages.

Thanks to Tom Augspurger for putting this together.

3: Consolidate documentation under a single org and style

Dask documentation is currently spread out in many small hosted sites, each associated to a particular subpackage like dask-ml, dask-kubernetes, dask-distributed, etc.. This eases development (developers are encouraged to modify documentation as they modify code) but results in a fragmented experience because users don’t know how to discover and efficiently explore our full documentation.

To resolve this we’re doing two things:

Moving all sites under the dask.org domain

Anaconda Inc, the company that employs several of the Dask developers (myself included) recently donated the domain dask.org to NumFOCUS. We’ve been slowly moving over all of our independent sites to use that location for our documentation.
Develop a uniform Sphinx theme dask-sphinx-theme

This has both uniform styling and also includes a navbar that gets automatically shared between the projects. The navbar makes it easy to discover and explore content and is something that we can keep up-to-date in a single repository.

You can see how this works by going to any of the Dask sites, like docs.dask.org.

Thanks to Tom Augspurger for managing this work and Andy Terrel for patiently handling things on the NumFOCUS side and domain name side.

4: Retire the hdfs3 library

For years the Dask community has maintained the hdfs3 library that allows for native access to the Hadoop file system from Python. This used Pivotal’s libhdfs3 library written in C++ and was, for a long while the only performant way to maturely manipulate HDFS from Python.

Since then though PyArrow has developed efficient bindings to the standard libhdfs library and exposed it through their Pythonic file system interface, which is fortunately Dask-compatible.

We’ve been telling people to use the Arrow solution for a while now and thought we’d now do so officially (see dask/hdfs3 #170). As of the last bugfix release Dask will use Arrow by default and, while the hdfs3 library is still available, Dask maintainers probably won’t spend much time on it in the future.

Thanks to Martin Durant for building and maintaining HDFS3 over all this time.

5: Hyper-parameter selection for incrementally trained models

In Dask-ML we continue to work on hyper-parameter selection for models that implement the partial_fit API. We’ve built algorithms and infrastructure to handle this well, and are currently fine tuning API, parameter names, etc..

If you have any interest in this process, come on over to dask/dask-ml #356.

Thanks to Tom Augspurger and Scott Sievert for this work.

6: Two small bugfix releases

We’ve been trying to increase the frequency of bugfix releases while things are stable. Since our last writing there have been two minor bugfix releases. You can read more about them here:

7: Binder + Dask

The Pangeo community has done work to integrate Binder with Dask and has written about the process here: Pangeo meets Binder

Thanks to Joe Hamman for this work and the blogpost.

8: Skein/Yarn Update

The Dask-Yarn connection to deploy Dask on Hadoop clusters uses a library Skein to easily manage Yarn jobs from Python.

Skein has seen a lot of activity over the last few weeks, including the following:

A Web UI for the project. See jcrist/skein #68
A Tensorflow on Yarn project from Criteo that uses Skein. See github.com/criteo/tf-yarn

This work is mostly managed by Jim Crist and other Skein contributors.

Dask Release 0.19.0

2018-09-05T00:00:00+00:00

This work is supported by Anaconda Inc.

I’m pleased to announce the release of Dask version 0.19.0. This is a major release with bug fixes and new features. The last release was 0.18.2 on July 23rd. This blogpost outlines notable changes since the last release blogpost for 0.18.0 on June 14th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

A ton of work has happened over the past two months, but most of the changes are small and diffuse. Stability, feature parity with upstream libraries (like Numpy and Pandas), and performance have all significantly improved, but in ways that are difficult to condense into blogpost form.

That being said, here are a few of the more exciting changes in the new release.

Python Versions

We’ve dropped official support for Python 3.4 and added official support for Python 3.7.

Deploy on Hadoop Clusters

Over the past few months Jim Crist has bulit a suite of tools to deploy applications on YARN, the primary cluster manager used in Hadoop clusters.

Conda-pack: packs up Conda environments for redistribution to distributed clusters, especially when Python or Conda may not be present.
Skein: easily launches and manages YARN applications from non-JVM systems
Dask-Yarn: a thin library around Skein to launch and manage Dask clusters

Jim has written about Skein and Dask-Yarn in two recent blogposts:

Implement Actors

To address this we’ve added an experimental Actors framework to Dask alongside the standard task-scheduling system. This provides reduced latencies, removes scheduling overhead, and provides the ability to directly mutate state on a worker, but loses niceties like resilience and diagnostics. The idea to adopt Actors was shamelessly stolen from the Ray Project :)

class Counter:
    def __init__(self):
        self.n = 0

    def increment(self):
        self.n += 1
        return self.n

counter = client.submit(Counter, actor=True).result()

>>> future = counter.increment()
>>> future.result()
1

You can read more about actors in the Actors documentation.

Dashboard improvements

The Dask dashboard is a critical tool to understand distributed performance. There are a few accessibility issues that trip up beginning users that we’ve addressed in this release.

Save task stream plots

You can now save a task stream record by wrapping a computation in the get_task_stream context manager.

from dask.distributed import Client, get_task_stream
client = Client(processes=False)

import dask
df = dask.datasets.timeseries()

with get_task_stream(plot='save', filename='my-task-stream.html') as ts:
    df.x.std().compute()

>>> ts.data
[{'key': "('make-timeseries-edc372a35b317f328bf2bb5e636ae038', 0)",
  'nbytes': 8175440,
  'startstops': [('compute', 1535661384.2876947, 1535661384.3366017)],
  'status': 'OK',
  'thread': 139754603898624,
  'worker': 'inproc://192.168.50.100/15417/2'},

  ...

This gives you the start and stop time of every task on every worker done during that time. It also saves that data as an HTML file that you can share with others. This is very valuable for communicating performance issues within a team. I typically upload the HTML file as a gist and then share it with rawgit.com

$ gist my-task-stream.html
https://gist.github.com/f48a121bf03c869ec586a036296ece1a

Robust to different screen sizes

The Dashboard’s layout was designed to be used on a single screen, side-by-side with a Jupyter notebook. This is how many Dask developers operate when working on a laptop, however it is not how many users operate for one of two reasons:

They are working in an office setting where they have several screens
They are new to Dask and uncomfortable splitting their screen into two halves

In these cases the styling of the dashboard becomes odd. Fortunately, Luke Canavan and Derek Ludwig recently improved the CSS for the dashboard considerably, allowing it to switch between narrow and wide screens. Here is a snapshot.

Jupyter Lab Extension

You can now embed Dashboard panes directly within Jupyter Lab using the newly updated dask-labextension.

jupyter labextension install dask-labextension

This allows you to layout your own dashboard directly within JupyterLab. You can combine plots from different pages, control their sizing, and so on. You will need to provide the address of the dashboard server (http://localhost:8787 by default on local machines) but after that everything should persist between sessions. Now when I open up JupyterLab and start up a Dask Client, I get this:

Thanks to Ian Rose for doing most of the work here.

Outreach

Dask Stories

People who use Dask have been writing about their experiences at Dask Stories. In the last couple months the following people have written about and contributed their experience:

These stories help people understand where Dask is and is not applicable, and provide useful context around how it gets used in practice. We welcome further contributions to this project. It’s very valuable to the broader community.

Dask Examples

The Dask-Examples repository maintains easy-to-run examples using Dask on a small machine, suitable for an entry-level laptop or for a small cloud instance. These are hosted on mybinder.org and are integrated into our documentation. A number of new examples have arisen recently, particularly in machine learning. We encourage people to try them out by clicking the link below.

Other Projects

The dask-image project was recently released. It includes a number of image processing routines around dask arrays.

This project is mostly maintained by John Kirkham.
Dask-ML saw a recent bugfix release
The TPOT library for automated machine learning recently published a new release that adds Dask support to parallelize their model training. More information is available on the TPOT documentation

Acknowledgements

Since June 14th, the following people have contributed to the following repositories:

The core Dask repository for parallel algorithms:

Anderson Banihirwe
Andre Thrill
Aurélien Ponte
Christoph Moehl
Cloves Almeida
Daniel Rothenberg
Danilo Horta
Davis Bennett
Elliott Sales de Andrade
Eric Bonfadini
GPistre
George Sakkis
Guido Imperiale
Hans Moritz Günther
Henrique Ribeiro
Hugo
Irina Truong
Itamar Turner-Trauring
Jacob Tomlinson
James Bourbeau
Jan Margeta
Javad
Jeremy Chen
Jim Crist
Joe Hamman
John Kirkham
John Mrziglod
Julia Signell
Marco Rossi
Mark Harfouche
Martin Durant
Matt Lee
Matthew Rocklin
Mike Neish
Robert Sare
Scott Sievert
Stephan Hoyer
Tobias de Jong
Tom Augspurger
WZY
Yu Feng
Yuval Langer
minebogy
nmiles2718
rtobar

The dask/distributed repository for distributed computing:

Anderson Banihirwe
Aurélien Ponte
Bartosz Marcinkowski
Dave Hirschfeld
Derek Ludwig
Dror Birkman
Guillaume EB
Jacob Tomlinson
Joe Hamman
John Kirkham
Loïc Estève
Luke Canavan
Marius van Niekerk
Martin Durant
Matt Nicolls
Matthew Rocklin
Mike DePalatis
Olivier Grisel
Phil Tooley
Ray Bell
Tom Augspurger
Yu Feng

The dask/dask-examples repository for easy-to-run examples:

Albert DeFusco
Dan Vatterott
Guillaume EB
Matthew Rocklin
Scott Sievert
Tom Augspurger
mholtzscher

High level performance of Pandas, Dask, Spark, and Arrow

2018-08-28T00:00:00+00:00

This work is supported by Anaconda Inc

How does Dask dataframe performance compare to Pandas? Also, what about Spark dataframes and what about Arrow? How do they compare?

I get this question every few weeks. This post is to avoid repetition.

Caveats

This answer is likely to change over time. I’m writing this in August 2018
This question and answer are very high level. More technical answers are possible, but not contained here.

Answers

Pandas

If you’re coming from Python and have smallish datasets then Pandas is the right choice. It’s usable, widely understood, efficient, and well maintained.

Benefits of Parallelism

The performance benefit (or drawback) of using a parallel dataframe like Dask dataframes or Spark dataframes over Pandas will differ based on the kinds of computations you do:

If you’re doing small computations then Pandas is always the right choice. The administrative costs of parallelizing will outweigh any benefit. You should not parallelize if your computations are taking less than, say, 100ms.
For simple operations like filtering, cleaning, and aggregating large data you should expect linear speedup by using a parallel dataframes.

If you’re on a 20-core computer you might expect a 20x speedup. If you’re on a 1000-core cluster you might expect a 1000x speedup, assuming that you have a problem big enough to spread across 1000 cores. As you scale up administrative overhead will increase, so you should expect the speedup to decrease a bit.
For complex operations like distributed joins it’s more complicated. You might get linear speedups like above, or you might even get slowdowns. Someone experienced in database-like computations and parallel computing can probably predict pretty well which computations will do well.

However, configuration may be required. Often people find that parallel solutions don’t meet expectations when they first try them out. Unfortunately most distributed systems require some configuration to perform optimally.

There are other options to speed up Pandas

Many people looking to speed up Pandas don’t need parallelism. There are often several other tricks like encoding text data, using efficient file formats, avoiding groupby.apply, and so on that are more effective at speeding up Pandas than switching to parallelism.

Comparing Apache Spark and Dask

Assuming that yes, I do want parallelism, should I choose Apache Spark, or Dask dataframes?

This is often decided more by cultural preferences (JVM vs Python, all-in-one-tool vs integration with other tools) than performance differences, but I’ll try to outline a few things here:

Spark dataframes will be much better when you have large SQL-style queries (think 100+ line queries) where their query optimizer can kick in.
Dask dataframes will be much better when queries go beyond typical database queries. This happens most often in time series, random access, and other complex computations.
Spark will integrate better with JVM and data engineering technology. Spark will also come with everything pre-packaged. Spark is its own ecosystem.
Dask will integrate better with Python code. Dask is designed to integrate with other libraries and pre-existing systems. If you’re coming from an existing Pandas-based workflow then it’s usually much easier to evolve to Dask.

Generally speaking for most operations you’ll be fine using either one. People often choose between Pandas/Dask and Spark based on cultural preference. Either they have people that really like the Python ecosystem, or they have people that really like the Spark ecosystem.

Dataframes are also only a small part of each project. Spark and Dask both do many other things that aren’t dataframes. For example Spark has a graph analysis library, Dask doesn’t. Dask supports multi-dimensional arrays, Spark doesn’t. Spark is generally higher level and all-in-one while Dask is lower-level and focuses on integrating into other tools.

For more information, see Dask’s “Comparison to Spark documentation” or this interview with Steppingblocks, a data analytics company, on why they switched from Spark to Dask.

Apache Arrow

What about Arrow? Is Arrow faster than Pandas?

This question doesn’t quite make sense… yet.

Arrow is not a replacement for Pandas. Today Arrow is useful to people building systems and not to analysts directly like Pandas. Arrow is used to move data between different computational systems and file formats. Arrow does not do computation today, but is commonly used as a component in other libraries that do do computation. For example, if you use Pandas or Spark or Dask today you may be using Arrow without knowing it. Today Arrow is more useful for other libraries than it is to end-users.

However, this is likely to change in the future. Arrow developers plan to write computational code around Arrow that we would expect to be faster than the code in either Pandas or Spark. This is probably a year or two away though. There will probably be some effort to make this semi-compatible with Pandas, but it’s much too early to tell.

Building SAGA optimization for Dask arrays

2018-08-07T00:00:00+00:00

This work is supported by ETH Zurich, Anaconda Inc, and the Berkeley Institute for Data Science

At a recent Scikit-learn/Scikit-image/Dask sprint at BIDS, Fabian Pedregosa (a machine learning researcher and Scikit-learn developer) and Matthew Rocklin (Dask core developer) sat down together to develop an implementation of the incremental optimization algorithm SAGA on parallel Dask datasets. The result is a sequential algorithm that can be run on any dask array, and so allows the data to be stored on disk or even distributed among different machines.

It was interesting both to see how the algorithm performed and also to see the ease and challenges to run a research algorithm on a Dask distributed dataset.

We started with an initial implementation that Fabian had written for Numpy arrays using Numba. The following code solves an optimization problem of the form

\[ min_x \sum_{i=1}^n f(a_i^t x, b_i) \]

import numpy as np
from numba import njit
from sklearn.linear_model.sag import get_auto_step_size
from sklearn.utils.extmath import row_norms

@njit
def deriv_logistic(p, y):
    # derivative of logistic loss
    # same as in lightning (with minus sign)
    p *= y
    if p > 0:
        phi = 1. / (1 + np.exp(-p))
    else:
        exp_t = np.exp(p)
        phi = exp_t / (1. + exp_t)
    return (phi - 1) * y

@njit
def SAGA(A, b, step_size, max_iter=100):
  """
  SAGA algorithm

  A : n_samples x n_features numpy array
  b : n_samples numpy array with values -1 or 1
  """

    n_samples, n_features = A.shape
    memory_gradient = np.zeros(n_samples)
    gradient_average = np.zeros(n_features)
    x = np.zeros(n_features)  # vector of coefficients
    step_size = 0.3 * get_auto_step_size(row_norms(A, squared=True).max(), 0, 'log', False)

    for _ in range(max_iter):
        # sample randomly
        idx = np.arange(memory_gradient.size)
        np.random.shuffle(idx)

        # .. inner iteration ..
        for i in idx:
            grad_i = deriv_logistic(np.dot(x, A[i]), b[i])

            # .. update coefficients ..
            delta = (grad_i - memory_gradient[i]) * A[i]
            x -= step_size * (delta + gradient_average)

            # .. update memory terms ..
            gradient_average += (grad_i - memory_gradient[i]) * A[i] / n_samples
            memory_gradient[i] = grad_i

        # monitor convergence
        print('gradient norm:', np.linalg.norm(gradient_average))

    return x

This implementation is a simplified version of the SAGA implementation that Fabian uses regularly as part of his research, and that assumes that \(f\) is the logistic loss, i.e., \(f(z) = \log(1 + \exp(-z))\). It can be used to solve problems with other values of \(f\) by overwriting the function deriv_logistic.

We wanted to apply it across a parallel Dask array by applying it to each chunk of the Dask array, a smaller Numpy array, one at a time, carrying along a set of parameters along the way.

Development Process

In order to better understand the challenges of writing Dask algorithms, Fabian did most of the actual coding to start. Fabian is good example of a researcher who knows how to program well and how to design ML algorithms, but has no direct exposure to the Dask library. This was an educational opportunity both for Fabian and for Matt. Fabian learned how to use Dask, and Matt learned how to introduce Dask to researchers like Fabian.

Step 1: Build a sequential algorithm with pure functions

To start we actually didn’t use Dask at all, instead, Fabian modified his implementation in a few ways:

It should operate over a list of Numpy arrays. A list of Numpy arrays is similar to a Dask array, but simpler.
It should separate blocks of logic into separate functions, these will eventually become tasks, so they should be sizable chunks of work. In this case, this led to the creating of the function _chunk_saga that performs an iteration of the SAGA algorithm on a subset of the data.
These functions should not modify their inputs, nor should they depend on global state. All information that those functions require (like the parameters that we’re learning in our algorithm) should be explicitly provided as inputs.

These requested modifications affect performance a bit, we end up making more copies of the parameters and more copies of intermediate state. In terms of programming difficulty this took a bit of time (around a couple hours) but is a straightforward task that Fabian didn’t seem to find challenging or foreign.

These changes resulted in the following code:

from numba import njit
from sklearn.utils.extmath import row_norms
from sklearn.linear_model.sag import get_auto_step_size


@njit
def _chunk_saga(A, b, n_samples, f_deriv, x, memory_gradient, gradient_average, step_size):
    # Make explicit copies of inputs
    x = x.copy()
    gradient_average = gradient_average.copy()
    memory_gradient = memory_gradient.copy()

    # Sample randomly
    idx = np.arange(memory_gradient.size)
    np.random.shuffle(idx)

    # .. inner iteration ..
    for i in idx:
        grad_i = f_deriv(np.dot(x, A[i]), b[i])

        # .. update coefficients ..
        delta = (grad_i - memory_gradient[i]) * A[i]
        x -= step_size * (delta + gradient_average)

        # .. update memory terms ..
        gradient_average += (grad_i - memory_gradient[i]) * A[i] / n_samples
        memory_gradient[i] = grad_i

    return x, memory_gradient, gradient_average


def full_saga(data, max_iter=100, callback=None):
  """
  data: list of (A, b), where A is a n_samples x n_features
  numpy array and b is a n_samples numpy array
  """
    n_samples = 0
    for A, b in data:
        n_samples += A.shape[0]
    n_features = data[0][0].shape[1]
    memory_gradients = [np.zeros(A.shape[0]) for (A, b) in data]
    gradient_average = np.zeros(n_features)
    x = np.zeros(n_features)

    steps = [get_auto_step_size(row_norms(A, squared=True).max(), 0, 'log', False) for (A, b) in data]
    step_size = 0.3 * np.min(steps)

    for _ in range(max_iter):
        for i, (A, b) in enumerate(data):
            x, memory_gradients[i], gradient_average = _chunk_saga(
                    A, b, n_samples, deriv_logistic, x, memory_gradients[i],
                    gradient_average, step_size)
        if callback is not None:
            print(callback(x, data))

    return x

Step 2: Apply dask.delayed

Once functions neither modified their inputs nor relied on global state we went over a dask.delayed example, and then applied the @dask.delayed decorator to the functions that Fabian had written. Fabian did this at first in about five minutes and to our mutual surprise, things actually worked

@dask.delayed(nout=3)                               # <<<---- New
@njit
def _chunk_saga(A, b, n_samples, f_deriv, x, memory_gradient, gradient_average, step_size):
    ...

def full_saga(data, max_iter=100, callback=None):
    n_samples = 0
    for A, b in data:
        n_samples += A.shape[0]
    data = dask.persist(*data)                      # <<<---- New

    ...

    for _ in range(max_iter):
        for i, (A, b) in enumerate(data):
            x, memory_gradients[i], gradient_average = _chunk_saga(
                    A, b, n_samples, deriv_logistic, x, memory_gradients[i],
                    gradient_average, step_size)
        cb = dask.delayed(callback)(x, data)        # <<<---- Changed

        x, cb = dask.persist(x, cb)                 # <<<---- New
        print(cb.compute()

However, they didn’t work that well. When we took a look at the dask dashboard we find that there is a lot of dead space, a sign that we’re still doing a lot of computation on the client side.

Step 3: Diagnose and add more dask.delayed calls

While things worked, they were also fairly slow. If you notice the dashboard plot above you’ll see that there is plenty of white in between colored rectangles. This shows that there are long periods where none of the workers is doing any work.

This is a common sign that we’re mixing work between the workers (which shows up on the dashbaord) and the client. The solution to this is usually more targetted use of dask.delayed. Dask delayed is trivial to start using, but does require some experience to use well. It’s important to keep track of which operations and variables are delayed and which aren’t. There is some cost to mixing between them.

At this point Matt stepped in and added delayed in a few more places and the dashboard plot started looking cleaner.

@dask.delayed(nout=3)                               # <<<---- New
@njit
def _chunk_saga(A, b, n_samples, f_deriv, x, memory_gradient, gradient_average, step_size):
    ...

def full_saga(data, max_iter=100, callback=None):
    n_samples = 0
    for A, b in data:
        n_samples += A.shape[0]
    n_features = data[0][0].shape[1]
    data = dask.persist(*data)                      # <<<---- New
    memory_gradients = [dask.delayed(np.zeros)(A.shape[0])
                        for (A, b) in data]         # <<<---- Changed
    gradient_average = dask.delayed(np.zeros)(n_features)  #  Changed
    x = dask.delayed(np.zeros)(n_features)          # <<<---- Changed

    steps = [dask.delayed(get_auto_step_size)(
                dask.delayed(row_norms)(A, squared=True).max(),
                0, 'log', False)
             for (A, b) in data]                    # <<<---- Changed
    step_size = 0.3 * dask.delayed(np.min)(steps)   # <<<---- Changed

    for _ in range(max_iter):
        for i, (A, b) in enumerate(data):
            x, memory_gradients[i], gradient_average = _chunk_saga(
                    A, b, n_samples, deriv_logistic, x, memory_gradients[i],
                    gradient_average, step_size)
        cb = dask.delayed(callback)(x, data)        # <<<---- Changed
        x, memory_gradients, gradient_average, step_size, cb = \
            dask.persist(x, memory_gradients, gradient_average, step_size, cb)  # New
        print(cb.compute())                         # <<<---- changed

    return x

From a dask perspective this now looks good. We see that one partial_fit call is active at any given time with no large horizontal gaps between partial_fit calls. We’re not getting any parallelism (this is just a sequential algorithm) but we don’t have much dead space. The model seems to jump between the various workers, processing on a chunk of data before moving on to new data.

Step 4: Profile

The dashboard image above gives confidence that our algorithm is operating as it should. The block-sequential nature of the algorithm comes out cleanly, and the gaps between tasks are very short.

However, when we look at the profile plot of the computation across all of our cores (Dask constantly runs a profiler on all threads on all workers to get this information) we see that most of our time is spent compiling Numba code.

We started a conversation for this on the numba issue tracker which has since been resolved. That same computation over the same time now looks like this:

The tasks, which used to take seconds, now take tens of milliseconds, so we can process through many more chunks in the same amount of time.

Future Work

This was a useful experience to build an interesting algorithm. Most of the work above took place in an afternoon. We came away from this activity with a few tasks of our own:

Build a normal Scikit-Learn style estimator class for this algorithm so that people can use it without thinking too much about delayed objects, and can instead just use dask arrays or dataframes
Integrate some of Fabian’s research on this algorithm that improves performance with sparse data and in multi-threaded environments.
Think about how to improve the learning experience so that dask.delayed can teach new users how to use it correctly

Dask Development Log

2018-08-02T00:00:00+00:00

This work is supported by Anaconda Inc

Over the last two weeks we’ve seen activity in the following areas:

An experimental Actor solution for stateful processing
Machine learning experiments with hyper-parameter selection and parameter servers.
Development of more preprocessing transformers
Statistical profiling of the distributed scheduler’s internal event loop thread and internal optimizations
A new release of dask-yarn
A new narrative on dask-stories about modelling mobile networks
Support for LSF clusters in dask-jobqueue
Test suite cleanup for intermittent failures

Some advanced workloads want to directly manage and mutate state on workers. A task-based framework like Dask can be forced into this kind of workload using long-running-tasks, but it’s an uncomfortable experience. To address this we’ve been adding an experimental Actors framework to Dask alongside the standard task-scheduling system. This provides reduced latencies, removes scheduling overhead, and provides the ability to directly mutate state on a worker, but loses niceties like resilience and diagnostics.

The idea to adopt Actors was shamelessly stolen from the Ray Project :)

Work for Actors is happening in dask/distributed #2133.

class Counter:
    def __init__(self):
        self.n = 0

    def increment(self):
        self.n += 1
        return self.n

counter = client.submit(Counter, actor=True).result()

>>> future = counter.increment()
>>> future.result()
1

Machine learning experiments

Hyper parameter optimization on incrementally trained models

Many Scikit-Learn-style estimators feature a partial_fit method that enables incremental training on batches of data. This is particularly well suited for systems like Dask array or Dask dataframe, that are built from many batches of Numpy arrays or Pandas dataframes. It’s a nice fit because all of the computational algorithm work is already done in Scikit-Learn, Dask just has to administratively move models around to data and call scikit-learn (or other machine learning models that follow the fit/transform/predict/score API). This approach provides a nice community interface between parallelism and machine learning developers.

However, this training is inherently sequential because the model only trains on one batch of data at a time. We’re leaving a lot of processing power on the table.

To address this we can combine incremental training with hyper-parameter selection and train several models on the same data at the same time. This is often required anyway, and lets us be more efficient with our computation.

However there are many ways to do incremental training with hyper-parameter selection, and the right algorithm likely depends on the problem at hand. This is an active field of research and so it’s hard for a general project like Dask to pick and implement a single method that works well for everyone. There is probably a handful of methods that will be necessary with various options on them.

To help experimentation here we’ve been experimenting with some lower-level tooling that we think will be helpful in a variety of cases. This accepts a policy from the user as a Python function that gets scores from recent evaluations, and asks for how much further to progress on each set of hyper-parameters before checking in again. This allows us to model a few common situations like random search with early stopping conditions, successive halving, and variations of those easily without having to write any Dask code:

This work is done by Scott Sievert and myself

Parameter Servers

To improve the speed of training large models Scott Sievert has been using Actors (mentioned above) to develop simple examples for parameter servers. These are helping to identify and motivate performance and diagnostic improvements improvements within Dask itself:

These parameter servers manage the communication of models produced by different workers, and leave the computation to the underlying deep learning library. This is ongoing work.

Dataframe Preprocessing Transformers

We’ve started to orient some of the Dask-ML work around case studies. Our first, written by Scott Sievert, uses the Criteo dataset for ads. It’s a good example of a combined dense/sparse dataset that can be somewhat large (around 1TB). The first challenge we’re running into is preprocessing. These have lead to a few preprocessing improvements:

Some of these are also based off of improved dataframe handling features in the upcoming 0.20 release for Scikit-Learn.

This work is done by Roman Yurchak, James Bourbeau, Daniel Severo, and Tom Augspurger.

Profiling the main thread

Profiling concurrent code is hard. Traditional profilers like CProfile become confused by passing control between all of the different coroutines. This means that we haven’t done a very comprehensive job of profiling and tuning the distributed scheduler and workers. Statistical profilers on the other hand tend to do a bit better. We’ve taken the statistical profiler that we usually use on Dask worker threads (available in the dashboard on the “Profile” tab) and have applied it to the central administrative threads running the Tornado event loop as well. This has highlighted a few issues that we weren’t able to spot before, and should hopefully result in reduced overhead in future releases.

New release of Dask-Yarn

There is a new release of Dask-Yarn and the underlying library for managing Yarn jobs, Skein. These include a number of bug-fixes and improved concurrency primitives for YARN applications. The new features are documented here, and were implemented in jcrist/skein #40.

This work was done by Jim Crist

Support for LSF clusters in Dask-Jobqueue

Dask-jobqueue supports Dask use on traditional HPC cluster managers like SGE, SLURM, PBS, and others. We’ve recently added support for LSF clusters

Work was done in dask/dask-jobqueue #78 by Ray Bell.

New Dask Story on mobile networks

The Dask Stories repository holds narrative about how people use Dask. Sameer Lalwani recently added a story about using Dask to model mobile communication networks. It’s worth a read.

Test suite cleanup

The dask.distributed test suite has been suffering from intermittent failures recently. These are tests that fail very infrequently, and so are hard to catch when writing them, but show up when future unrelated PRs run the test suite on continuous integration and get failures. They add friction to the development process, but are expensive to track down (testing distributed systems is hard).

We’re taking a bit of time this week to track these down. Progress here:

Pickle isn't slow, it's a protocol

2018-07-23T00:00:00+00:00

This work is supported by Anaconda Inc

tl;dr: Pickle isn’t slow, it’s a protocol. Protocols are important for ecosystems.

A recent Dask issue showed that using Dask with PyTorch was slow because sending PyTorch models between Dask workers took a long time (Dask GitHub issue).

This turned out to be because serializing PyTorch models with pickle was very slow (1 MB/s for GPU based models, 50 MB/s for CPU based models). There is no architectural reason why this needs to be this slow. Every part of the hardware pipeline is much faster than this.

We could have fixed this in Dask by special-casing PyTorch models (Dask has it’s own optional serialization system for performance), but being good ecosystem citizens, we decided to raise the performance problem in an issue upstream (PyTorch Github issue). This resulted in a five-line-fix to PyTorch that turned a 1-50 MB/s serialization bandwidth into a 1 GB/s bandwidth, which is more than fast enough for many use cases (PR to PyTorch).

     def __reduce__(self):
-        return type(self), (self.tolist(),)
+        b = io.BytesIO()
+        torch.save(self, b)
+        return (_load_from_bytes, (b.getvalue(),))


+def _load_from_bytes(b):
+    return torch.load(io.BytesIO(b))

Thanks to the PyTorch maintainers this problem was solved pretty easily. PyTorch tensors and models now serialize efficiently in Dask or in any other Python library that might want to use them in distributed systems like PySpark, IPython parallel, Ray, or anything else without having to add special-case code or do anything special. We didn’t solve a Dask problem, we solved an ecosystem problem.

However before we solved this problem we discussed things a bit. This comment stuck with me:

This comment contains two beliefs that are both very common, and that I find somewhat counter-productive:

Pickle is slow
You should use our specialized methods instead

I’m sort of picking on the PyTorch maintainers here a bit (sorry!) but I’ve found that they’re quite widespread, so I’d like to address them here.

Pickle is not slow. Pickle is a protocol. We implement pickle. If it’s slow then it is our fault, not Pickle’s.

To be clear, there are many reasons not to use Pickle.

It’s not cross-language
It’s not very easy to parse
It doesn’t provide random access
It’s insecure
etc..

So you shouldn’t store your data or create public services using Pickle, but for things like moving data on a wire it’s a great default choice if you’re moving strictly from Python processes to Python processes in a trusted and uniform environment.

It’s great because it’s as fast as you can make it (up a a memory copy) and other libraries in the ecosystem can use it without needing to special case your code into theirs.

This is the change we did for PyTorch.

     def __reduce__(self):
-        return type(self), (self.tolist(),)
+        b = io.BytesIO()
+        torch.save(self, b)
+        return (_load_from_bytes, (b.getvalue(),))


+def _load_from_bytes(b):
+    return torch.load(io.BytesIO(b))

The slow part wasn’t Pickle, it was the .tolist() call within __reduce__ that converted a PyTorch tensor into a list of Python ints and floats. I suspect that the common belief of “Pickle is just slow” stopped anyone else from investigating the poor performance here. I was surprised to learn that a project as active and well maintained as PyTorch hadn’t fixed this already.

As a reminder, you can implement the pickle protocol by providing the __reduce__ method on your class. The __reduce__ function returns a loading function and sufficient arguments to reconstitute your object. Here we used torch’s existing save/load functions to create a bytestring that we could pass around.

Just use our specialized option

Specialized options can be great. They can have nice APIs with many options, they can tune themselves to specialized communication hardware if it exists (like RDMA or NVLink), and so on. But people need to learn about them first, and learning about them can be hard in two ways.

Hard for users

Today we use a large and rapidly changing set of libraries. It’s hard for users to become experts in all of them. Increasingly we rely on new libraries making it easy for us by adhering to standard APIs, providing informative error messages that lead to good behavior, and so on..

Hard for other libraries

Other libraries that need to interact definitely won’t read the documentation, and even if they did it’s not sensible for every library to special case every other library’s favorite method to turn their objects into bytes. Ecosystems of libraries depend strongly on the presence of protocols and a strong consensus around implementing them consistently and efficiently.

Sometimes Specialized Options are Appropriate

There are good reasons to support specialized options. Sometimes you need more than 1GB/s bandwidth. While this is rare in general (very few pipelines process faster than 1GB/s/node), it is true in the particular case of PyTorch when they are doing parallel training on a single machine with multiple processes. Soumith (PyTorch maintainer) writes the following:

When sending Tensors over multiprocessing, our custom serializer actually shortcuts them through shared memory, i.e. it moves the underlying Storages to shared memory and restores the Tensor in the other process to point to the shared memory. We did this for the following reasons:

Speed: we save on memory copies, especially if we amortize the cost of moving a Tensor to shared memory before sending it into the multiprocessing Queue. The total cost of actually moving a Tensor from one process to another ends up being O(1), and independent of the Tensor’s size
Sharing: If Tensor A and Tensor B are views of each other, once we serialize and send them, we want to preserve this property of them being views. This is critical for neural-nets where it’s common to re-view the weights / biases and use them for another. With the default pickle solution, this property is actually lost.

Dask Development Log, Scipy 2018

2018-07-17T00:00:00+00:00

This work is supported by Anaconda Inc

Last week many Dask developers gathered for the annual SciPy 2018 conference. As a result, very little work was completed, but many projects were started or discussed. To reflect this change in activity this blogpost will highlight possible changes and opportunities for readers to further engage in development.

The dask-jobqueue project was a hit at the conference. Dask-jobqueue helps people launch Dask on traditional job schedulers like PBS, SGE, SLURM, Torque, LSF, and others that are commonly found on high performance computers. These are very common among scientific, research, and high performance machine learning groups but commonly a bit hard to use with anything other than MPI.

This project came up in the Pangeo talk, lightning talks, and the Dask Birds of a Feather session.

During sprints a number of people came up and we went through the process of configuring Dask on common supercomputers like Cheyenne, Titan, and Cori. This process usually takes around fifteen minutes and will likely be the subject of a future blogpost. We published known-good configurations for these clusters on our configuration documentation

Additionally, there is a JupyterHub issue to improve documentation on best practices to deploy JupyterHub on these machines. The community has done this well a few times now, and it might be time to write up something for everyone else.

Get involved

If you have access to a supercomputer then please try things out. There is a 30-minute Youtube video screencast on the dask-jobqueue documentation that should help you get started.

If you are an administrator on a supercomputer you might consider helping to build a configuration file and place it in /etc/dask for your users. You might also want to get involved in the JupyterHub on HPC conversation.

Dask / Scikit-learn talk

Olivier Grisel and Tom Augspurger prepared and delivered a great talk on the current state of the new Dask-ML project.

MyBinder and Bokeh Servers

Not a Dask change, but Min Ragan-Kelley showed how to run services through mybinder.org that are not only Jupyter. As an example, here is a repository that deploys a Bokeh server application with a single click.

I think that by composing with Binder Min effectively just created the free-to-use hosted Bokeh server service. Presumably this same model could be easily adapted to other applications just as easily.

Dask and Automated Machine Learning with TPOT

Dask and TPOT developers are discussing paralellizing the automatic-machine-learning tool TPOT.

TPOT uses genetic algorithms to search over a space of scikit-learn style pipelines to automatically find a decently performing pipeline and model. This involves a fair amount of computation which Dask can help to parallelize out to multiple machines.

Get involved

Trivial things work now, but to make this efficient we’ll need to dive in a bit more deeply. Extending that pull request to dive within pipelines would be a good task if anyone wants to get involved. This would help to share intermediate results between pipelines.

Dask and Scikit-Optimize

Among various features, Scikit-optimize offers a BayesSearchCV object that is like Scikit-Learn’s GridSearchCV and RandomSearchCV, but is a bit smarter about how to choose new parameters to test given previous results. Hyper-parameter optimization is a low-hanging fruit for Dask-ML workloads today, so we investigated how the project might help here.

So far we’re just experimenting using Scikit-Learn/Dask integration through joblib to see what opportunities there are. Dicussion among Dask and Scikit-Optimize developers is happening here:

Issue: dask/dask-ml #300

Centralize PyData/Scipy tutorials on Binder

We’re putting a bunch of the PyData/Scipy tutorials on Binder, and hope to embed snippets of Youtube videos into the notebooks themselves.

This effort lives here:

pydata-tutorials.readthedocs.io

Motivation

The PyData and SciPy community delivers tutorials as part of most conferences. This activity generates both educational Jupyter notebooks and explanatory videos that teach people how to use the ecosystem.

However, this content isn’t very discoverable after the conference. People can search on Youtube for their topic of choice and hopefully find a link to the notebooks to download locally, but this is a somewhat noisy process. It’s not clear which tutorial to choose and it’s difficult to match up the video with the notebooks during exercises. We’re probably not getting as much value out of these resources as we could be.

To help increase access we’re going to try a few things:

Produce a centralized website with links to recent tutorials delivered for each topic
Ensure that those notebooks run easily on Binder
Embed sections of the talk on Youtube within each notebook so that the explanation of the section is tied to the exercises

Get involved

This only really works long-term under a community maintenance model. So far we’ve only done a few hours of work and there is still plenty to do in the following tasks:

Find good tutorials for inclusion
Ensure that they work well on mybinder.org
- are self-contained and don’t rely on external scripts to run
- have an environment.yml or requirements.txt
- don’t require a lot of resources
Find video for the tutorial
Submit a pull request to the tutorial repository that embeds a link to the youtube talk at the top cell of the notebook at the proper time for each notebook

Dask, Actors, and Ray

I really enjoyed the talk on Ray another distributed task scheduler for Python. I suspect that Dask will steal ideas for actors for stateful operation. I hope that Ray takes on ideas for using standard Python interfaces so that more of the community can adopt it more quickly. I encourage people to check out the talk and give Ray a try. It’s pretty slick.

Planning conversations for Dask-ML

Dask and Scikit-learn developers had the opportunity to sit down again and raise a number of issues to help plan near-term development. This focused mostly around building important case studies to motivate future development, and identifying algorithms and other projects to target for near-term integration.

Case Studies

Algorithms

Get involved

We could use help in building out case studies to drive future development in the project. There are also several algorithmic places to get involved. Dask-ML is a young and fast-moving project with many opportunities for new developers to get involved.

Dask and UMAP for low-dimensional embeddings

Leland McKinnes gave a great talk Uniform Manifold Approximation and Projection for Dimensionality Reduction in which he lays out a well founded algorithm for dimensionality reduction, similar to PCA or T-SNE, but with some nice properties. He worked together with some Dask developers where we identified some challenges due to dask array slicing with random-ish slices.

A proposal to fix this problem lives here, if anyone wants a fun problem to work on:

dask/dask #3409 (comment)

Dask stories

We soft-launched Dask Stories a webpage and project to collect user and share stories about how people use Dask in practice. We’re also delivering a separate blogpost about this today.

See blogpost: Who uses Dask?

If you use Dask and want to share your story we would absolutely welcome your experience. Having people like yourself share how they use Dask is incredibly important for the project.

Who uses Dask?

2018-07-16T00:00:00+00:00

This work is supported by Anaconda Inc

People often ask general questions like “Who uses Dask?” or more specific questions like the following:

For what applications do people use Dask dataframe?
How many machines do people often use with Dask?
How far does Dask scale?
Does dask get used on imaging data?
Does anyone use Dask with Kubernetes/Yarn/SGE/Mesos/… ?
Does anyone in the insurance industry use Dask?
…

This yields interesting and productive conversations where new users can dive into historical use cases which informs their choices if and how they use the project in the future.

New users can learn a lot from existing users.

To further enable this conversation we’ve made a new tiny project, dask-stories. This is a small documentation page where people can submit how they use Dask and have that published for others to see.

To seed this site six generous users have written down how their group uses Dask. You can read about them here:

We’ve focused on a few questions, available in our template that focus on problems over technology, and include negative as well as positive feedback to get a complete picture.

Who am I?
What problem am I trying to solve?
How Dask helps?
What pain points did I run into with Dask?
What technology do I use around Dask?

Contributions to this site are simple Markdown documents submitted as pull requests to github.com/dask/dask-stories. The site is then built with ReadTheDocs and updated immediately. We tried to make this as smooth and familiar to our existing userbase as possible.

This is important. Sharing real-world experiences like this are probably more valuable than code contributions to the Dask project at this stage. Dask is more technically mature than it is well-known. Users look to other users to help them understand a project (think of every time you’ve Googled for “some tool in some topic”)

If you use Dask today in an interesting way then please share your story. The world would love to hear your voice.

If you maintain another project you might consider implementing the same model. I hope that this proves successful enough for other projects in the ecosystem to reuse.

Dask Development Log

2018-07-08T00:00:00+00:00

This work is supported by Anaconda Inc

Current efforts for June 2018 in Dask and Dask-related projects include the following:

Yarn Deployment
More examples for machine learning
Incremental machine learning
HPC Deployment configuration

Dask developers often get asked How do I deploy Dask on my Hadoop/Spark/Hive cluster?. We haven’t had a very good answer until recently.

Most Hadoop/Spark/Hive clusters are actually Yarn clusters. Yarn is the most common cluster manager used by many clusters that are typically used to run Hadoop/Spark/Hive jobs including any cluster purchased from a vendor like Cloudera or Hortonworks. If your application can run on Yarn then it can be a first class citizen here.

Unfortunately Yarn has really only been accessible through a Java API, and so has been difficult for Dask to interact with. That’s changing now with a few projects, including:

dask-yarn: an easy way to launch Dask on Yarn clusters
skein: an easy way to launch generic services on Yarn clusters (this is primarily what backs dask-yarn)
conda-pack: an easy way to bundle together a conda package into a redeployable environment, such as is useful when launching Python applications on Yarn

This work is all being done by Jim Crist who is, I believe, currently writing up a blogpost about the topic at large. Dask-yarn was soft-released last week though, so people should give it a try and report feedback on the dask-yarn issue tracker. If you ever wanted direct help on your cluster, now is the right time because Jim is working on this actively and is not yet drowned in user requests so generally has a fair bit of time to investigate particular cases.

from dask_yarn import YarnCluster
from dask.distributed import Client

# Create a cluster where each worker has two cores and eight GB of memory
cluster = YarnCluster(environment='environment.tar.gz',
                      worker_vcores=2,
                      worker_memory="8GB")
# Scale out to ten such workers
cluster.scale(10)

# Connect to the cluster
client = Client(cluster)

More examples for machine learning

Dask maintains a Binder of simple examples that show off various ways to use the project. This allows people to click a link on the web and quickly be taken to a Jupyter notebook running on the cloud. It’s a fun way to quickly experience and learn about a new project.

Previously we had a single example for arrays, dataframes, delayed, machine learning, etc.

Now Scott Sievert is expanding the examples within the machine learning section. He has submitted the following two so far:

I believe he’s planning on more. If you use dask-ml and have recommendations or want to help, you might want to engage in the dask-ml issue tracker or dask-examples issue tracker.

Incremental training

The incremental training mentioned as an example above is also new-ish. This is a Scikit-Learn style meta-estimator that wraps around other estimators that support the partial_fit method. It enables training on large datasets in an incremental or batchwise fashion.

Before

from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(...)

import pandas as pd

for filename in filenames:
    df = pd.read_csv(filename)
    X, y = ...

    sgd.partial_fit(X, y)

After

from sklearn.linear_model import SGDClassifier
from dask_ml.wrappers import Incremental

sgd = SGDClassifier(...)
inc = Incremental(sgd)

import dask.dataframe as dd

df = dd.read_csv(filenames)
X, y = ...
inc.fit(X, y)

Analysis

From a parallel computing perspective this is a very simple and un-sexy way of doing things. However my understanding is that it’s also quite pragmatic. In a distributed context we leave a lot of possible computation on the table (the solution is inherently sequential) but it’s fun to see the model jump around the cluster as it absorbs various chunks of data and then moves on.

There’s ongoing work on how best to combine this with other work like pipelines and hyper-parameter searches to fill in the extra computation.

This work was primarily done by Tom Augspurger with help from Scott Sievert

Dask User Stories

Dask developers are often asked “Who uses Dask?”. This is a hard question to answer because, even though we’re inundated with thousands of requests for help from various companies and research groups, it’s never fully clear who minds having their information shared with others.

We’re now trying to crowdsource this information in a more explicit way by having users tell their own stories. Hopefully this helps other users in their field understand how Dask can help and when it might (or might not) be useful to them.

We originally collected this information in a Google Form but have since then moved it to a Github repository. Eventually we’ll publish this as a proper web site and include it in our documentation.

If you use Dask and want to share your story this is a great way to contribute to the project. Arguably Dask needs more help with spreading the word than it does with technical solutions.

HPC Deployments

The Dask Jobqueue package for deploying Dask on traditional HPC machines is nearing another release. We’ve changed around a lot of the parameters and configuration options in order to improve the onboarding experience for new users. It has been going very smoothly in recent engagements with new groups, but will mean a breaking change for existing users of the sub-project.

Dask Scaling Limits

2018-06-26T00:00:00+00:00

This work is supported by Anaconda Inc.

For the first year of Dask’s life it focused exclusively on single node parallelism. We felt then that efficiently supporting 100+GB datasets on personal laptops or 1TB datasets on large workstations was a sweet spot for productivity, especially when avoiding the pain of deploying and configuring distributed systems. We still believe in the efficiency of single-node parallelism, but in the years since, Dask has extended itself to support larger distributed systems.

After that first year, Dask focused equally on both single-node and distributed parallelism. We maintain two entirely separate schedulers, one optimized for each case. This allows Dask to be very simple to use on single machines, but also scale up to thousand-node clusters and 100+TB datasets when needed with the same API.

Dask’s distributed system has a single central scheduler and many distributed workers. This is a common architecture today that scales out to a few thousand nodes. Roughly speaking Dask scales about the same as a system like Apache Spark, but less well than a high-performance system like MPI.

An Example

Most Dask examples in blogposts or talks are on modestly sized datasets, usually in the 10-50GB range. This, combined with Dask’s history with medium-data on single-nodes may have given people a more humble impression of Dask than is appropriate.

As a small nudge, here is an example using Dask to interact with 50 36-core nodes on an artificial terabyte dataset.

This is a common size for a typical modestly sized Dask cluster. We usually see Dask deployment sizes either in the tens of machines (usually with Hadoop style or ad-hoc enterprise clusters), or in the few-thousand range (usually with high performance computers or cloud deployments). We’re showing the modest case here just due to lack of resources. Everything in that example should work fine scaling out a couple extra orders of magnitude.

Challenges to Scaling Out

For the rest of the article we’ll talk about common causes that we see today that get in the way of scaling out. These are collected from experience working both with people in the open source community, as well as private contracts.

Simple Map-Reduce style

If you’re doing simple map-reduce style parallelism then things will be pretty smooth out to a large number of nodes. However, there are still some limitations to keep in mind:

The scheduler will have at least one, and possibly a few connections open to each worker. You’ll want to ensure that your machines can have many open file handles at once. Some Linux distributions cap this at 1024 by default, but it is easy to change.
The scheduler has an overhead of around 200 microseconds per task. So if each task takes one second then your scheduler can saturate 5000 cores, but if each task takes only 100ms then your scheduler can only saturate around 500 cores, and so on. Task duration imposes an inversely proportional constraint on scaling.

If you want to scale larger than this then your tasks will need to start doing more work in each task to avoid overhead. Often this involves moving inner for loops within tasks rather than spreading them out to many tasks.

More complex algorithms

If you’re doing more complex algorithms (which is common among Dask users) then many more things can break along the way. High performance computing isn’t about doing any one thing well, it’s about doing nothing badly. This section lists a few issues that arise for larger deployments:

Dask collection algorithms may be suboptimal.

The parallel algorithms in Dask-array/bag/dataframe/ml are pretty good, but as Dask scales out to larger clusters and its algorithms are used by more domains we invariably find that small corners of the API fail beyond a certain point. Luckily these are usually pretty easy to fix after they are reported.
The graph size may grow too large for the scheduler

The metadata describing your computation has to all fit on a single machine, the Dask scheduler. This metadata, the task graph, can grow big if you’re not careful. It’s nice to have a scheduler process with at least a few gigabytes of memory if you’re going to be processing million-node task graphs. A task takes up around 1kB of memory if you’re careful to avoid closing over any unnecessary local data.
The graph serialization time may become annoying for interactive use

Again, if you have million node task graphs you’re going to be serializaing them up and passing them from the client to the scheduler. This is fine, assuming they fit at both ends, but can take up some time and limit interactivity. If you press compute and nothing shows up on the dashboard for a minute or two, this is what’s happening.
The interactive dashboard plots stop being as useful

Those beautiful plots on the dashboard were mostly designed for deployments with 1-100 nodes, but not 1000s. Seeing the start and stop time of every task of a million-task computation just isn’t something that our brains can fully understand.

This is something that we would like to improve. If anyone out there is interested in scalable performance diagnostics, please get involved.
Other components that you rely on, like distributed storage, may also start to break

Dask provides users more power than they’re accustomed to. It’s easy for them to accidentally clobber some other component of their systems, like distributed storage, a local database, the network, and so on, with too many requests.

Many of these systems provide abstractions that are very well tested and stable for normal single-machine use, but that quickly become brittle when you have a thousand machines acting on them with the full creativity of a novice user. Dask provies some primitives like distributed locks and queues to help control access to these resources, but it’s on the user to use them well and not break things.

Conclusion

Dask scales happily out to tens of nodes, like in the example above, or to thousands of nodes, which I’m not showing here simply due to lack of resources.

Dask provides this scalability while still maintaining the flexibility and freedom to build custom systems that has defined the project since it began. However, the combination of scalability and freedom makes it hard for Dask to fully protect users from breaking things. It’s much easier to protect users when you can constrain what they can do. When users stick to standard workflows like Dask dataframe or Dask array they’ll probably be ok, but when operating with full creativity at the thousand-node scale some expertise will invariably be necessary. We try hard to provide the diagnostics and tools necessary to investigate issues and control operation. The project is getting better at this every day, in large part due to some expert users out there.

A Call for Examples

Do you use Dask on more than one machine to do interesting work? We’d love to hear about it either in the comments below, or in this online form.

Dask Release 0.18.0

2018-06-14T00:00:00+00:00

This work is supported by Anaconda Inc.

I’m pleased to announce the release of Dask version 0.18.0. This is a major release with breaking changes and new features. The last release was 0.17.5 on May 4th. This blogpost outlines notable changes since the last release blogpost for 0.17.2 on March 21st.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

We list some breaking changes below, followed up by changes that are less important, but still fun.

The Dask core library is nearing a 1.0 release. Before that happens, we need to do some housecleaning. This release starts that process, replaces some existing interfaces, and builds up some needed infrastructure. Almost all of the changes in this release include clean deprecation warnings, but future releases will remove the old functionality, so now would be a good time to check in.

As happens with any release that starts breaking things, many other smaller breaks get added on as well. I’m personally very happy with this release because many aspects of using Dask now feel a lot cleaner, however heavy users of Dask will likely experience mild friction. Hopefully this post helps explain some of the larger changes.

Notable Breaking changes

Centralized configuration

Taking full advantage of Dask sometimes requires user configuration, especially in a distributed setting. This might be to control logging verbosity, specify cluster configuration, provide credentials for security, or any of several other options that arise in production.

We’ve found that different computing cultures like to specify configuration in several different ways:

Configuration files
Environment variables
Directly within Python code

Previously this was handled with a variety of different solutions among the different dask subprojects. The dask-distributed project had one system, dask-kubernetes had another, and so on.

Now we centralize configuration in the dask.config module, which collects configuration from config files, environment variables, and runtime code, and makes it centrally available to all Dask subprojects. A number of Dask subprojects (dask.distributed, dask-kubernetes, and dask-jobqueue), are being co-released at the same time to take advantage of this.

If you were actively using Dask.distributed’s configuration files some things have changed:

The configuration is now namespaced and more heavily nested. Here is an example from the dask.distributed default config file today:

distributed:
  version: 2
  scheduler:
  allowed-failures: 3 # number of retries before a task is considered bad
  work-stealing: True # workers should steal tasks from each other
  worker-ttl: null # like '60s'. Workers must heartbeat faster than this

  worker:
  multiprocessing-method: forkserver
  use-file-locking: True

The default configuration location has moved from ~/.dask/config.yaml to ~/.config/dask/distributed.yaml, where it will live along side several other files like kubernetes.yaml, jobqueue.yaml, and so on.

However, your old configuration files will still be found and their values will be used appropriately. We don’t make any attempt to migrate your old config values to the new location though. You may want to delete the auto-generated ~/.dask/config.yaml file at some point, if you felt like being particularly clean.

You can learn more about Dask’s configuration in Dask’s configuration documentation

Replaced the common get= keyword with scheduler=

Dask can execute code with a variety of scheduler backends based on threads, processes, single-threaded execution, or distributed clusters.

Previously, users selected between these backends using the somewhat generically named get= keyword:

x.compute(get=dask.threaded.get)
x.compute(get=dask.multiprocessing.get)
x.compute(get=dask.local.get_sync)

We’ve replaced this with a newer, and hopefully more clear, scheduler= keyword:

x.compute(scheduler='threads')
x.compute(scheduler='processes')
x.compute(scheduler='single-threaded')

The get= keyword has been deprecated and will raise a warning. It will be removed entirely on the next major release.

For more information, see documentation on selecting different schedulers.

Replaced dask.set_options with dask.config.set

Related to the configuration changes, we now include runtime state in the configuration. Previously people used to set runtime state with the dask.set_options context manager. Now we recommend using dask.config.set:

with dask.set_options(scheduler='threads'):  # Before
    ...

with dask.config.set(scheduler='threads'):  # After
    ...

The dask.set_options function is now an alias to dask.config.set.

Removed the dask.array.learn subpackage

This was unadvertised and saw very little use. All functionality (and much more) is now available in Dask-ML.

Other

We’ve removed the token= keyword from map_blocks and moved the functionality to the name= keyword.
The dask.distributed.worker_client automatically rejoins the threadpool when you close the context manager.
The Dask.distributed protocol now interprets msgpack arrays as tuples rather than lists.

Fun new features

Arrays

Generalized Universal Functions

Dask.array now supports Numpy-style Generalized Universal Functions (gufuncs) transparently. This means that you can apply normal Numpy GUFuncs, like eig in the example below, directly onto a Dask arrays:

import dask.array as da
import numpy as np

# Apply a Numpy GUFunc, eig, directly onto a Dask array
x = da.random.normal(size=(10, 10, 10), chunks=(2, 10, 10))
w, v = np.linalg._umath_linalg.eig(x, output_dtypes=(float, float))
# w and v are dask arrays with eig applied along the latter two axes

Numpy has gufuncs of many of its internal functions, but they haven’t yet decided to switch these out to the public API. Additionally we can define GUFuncs with other projects, like Numba:

import numba

@numba.vectorize([float64(float64, float64)])
def f(x, y):
    return x + y

z = f(x, y)  # if x and y are dask arrays, then z will be too

What I like about this is that Dask and Numba developers didn’t coordinate at all on this feature, it’s just that they both support the Numpy GUFunc protocol, so you get interactions like this for free.

For more information see Dask’s GUFunc documentation. This work was done by Markus Gonser (@magonser).

New “auto” value for rechunking

Dask arrays now accept a value, “auto”, wherever a chunk value would previously be accepted. This asks Dask to rechunk those dimensions to achieve a good default chunk size.

x = x.rechunk({
    0: x.shape[0], # single chunk in this dimension
  # 1: 100e6 / x.dtype.itemsize / x.shape[0],  # before we had to calculate manually
    1: 'auto'      # Now we allow this dimension to respond to get ideal chunk size
})

# or
x = da.from_array(img, chunks='auto')

This also checks the array.chunk-size config value for optimal chunk sizes

>>> dask.config.get('array.chunk-size')
'128MiB'

To be clear, this doesn’t support “automatic chunking”, which is a very hard problem in general. Users still need to be aware of their computations and how they want to chunk, this just makes it marginally easier to make good decisions.

Algorithmic improvements

Dask.array gained a full einsum implementation thanks to Simon Perkins.

Also, Dask.array’s QR decompositions has become nicer in two ways:

They support short-and-fat arrays
The tall-and-skinny variant now operates more robustly in less memory. Here is a friendly GIF of execution:

This work is greatly appreciated and was done by Jeremy Chan.

Native support for the Zarr format for chunked n-dimensional arrays landed thanks to Martin Durant and John A Kirkham. Zarr has been especially useful due to its speed, simple spec, support of the full NetCDF style conventions, and amenability to cloud storage.

Dataframes and Pandas 0.23

As usual, Dask Dataframes had many small improvements. Of note is continued compatibility with the just-released Pandas 0.23, and some new data ingestion formats.

Dask.dataframe is consistent with changes in the recent Pandas 0.23 release thanks to Tom Augspurger.

Orc support

Dask.dataframe has grown a reader for the Apache ORC format.

Orc is a format for tabular data storage that is common in the Hadoop ecosystem. The new dd.read_orc function parallelizes around similarly new ORC functionality within PyArrow . Thanks to Jim Crist for the work on the Arrow side and Martin Durant for parallelizing it with Dask.

Read_json support

Dask.dataframe now has also grown a reader for JSON files.

The dd.read_json function matches most of the pandas.read_json API.

This came about shortly after a recent PyCon 2018 talk comparing Spark and Dask dataframe where Irina Truong mentioned that it was missing. Thanks to Martin Durant and Irina Truong for this contribution.

See the dataframe data ingestion documentation for more information about JSON, ORC, or any of the other formats supported by Dask.dataframe.

Joblib

The Joblib library for parallel computing within Scikit-Learn has had a Dask backend for a while now. While it has always been pretty easy to use, it’s now becoming much easier to use well without much expertise. After using this in practice for a while together with the Scikit-Learn developers, we’ve identified and smoothed over a number of usability issues. These changes will only be fully available after the next Scikit-Learn release (hopefully soon) at which point we’ll probably release a new blogpost dedicated to the topic.

Acknowledgements

Since March 21st, the following people have contributed to the following repositories:

The core Dask repository for parallel algorithms:

Andrethrill
Beomi
Brendan Martin
Christopher Ren
Guido Imperiale
Diane Trout
fjetter
Frederick
Henry Doupe
James Bourbeau
Jeremy Chen
Jim Crist
John A Kirkham
Jon Mease
Jörg Dietrich
Kevin Mader
Ksenia Bobrova
Larsr
Marc Pfister
Markus Gonser
Martin Durant
Matt Lee
Matthew Rocklin
Pierre-Bartet
Scott Sievert
Simon Perkins
Stefan van der Walt
Stephan Hoyer
Tom Augspurger
Uwe L. Korn
Yu Feng

The dask/distributed repository for distributed computing:

Bmaisonn
Grant Jenks
Henry Doupe
Irene Rodriguez
Irina Truong
John A Kirkham
Joseph Atkins-Turkish
Kenneth Koski
Loïc Estève
Marius van Niekerk
Martin Durant
Matthew Rocklin
Olivier Grisel
Russ Bubley
Tom Augspurger
Tony Lorenzo

The dask-kubernetes repository for deploying Dask on Kubernetes

Brendan Martin
J Gerard
Matthew Rocklin
Olivier Grisel
Yuvi Panda

The dask-jobqueue repository for deploying Dask on HPC job schedulers

Guillaume Eynard-Bontemps
jgerardsimcock
Joseph Hamman
Loïc Estève
Matthew Rocklin
Ray Bell
Rich Signell
Shawn Taylor
Spencer Clark

The dask-ml repository for scalable machine learning:

Christopher Ren
Jeremy Chen
Matthew Rocklin
Scott Sievert
Tom Augspurger

Acknowledgements

Thanks to Scott Sievert and James Bourbeau for their help editing this article.

Beyond Numpy Arrays in Python

2018-05-27T00:00:00+00:00

In recent years Python’s array computing ecosystem has grown organically to support GPUs, sparse, and distributed arrays. This is wonderful and a great example of the growth that can occur in decentralized open source development.

However to solidify this growth and apply it across the ecosystem we now need to do some central planning to move from a pair-wise model where packages need to know about each other to an ecosystem model where packages can negotiate by developing and adhering to community-standard protocols.

With moderate effort we can define a subset of the Numpy API that works well across all of them, allowing the ecosystem to more smoothly transition between hardware. This post describes the opportunities and challenges to accomplish this.

We start by discussing two kinds of libraries:

Libraries that implement the Numpy API
Libraries that consume the Numpy API and build new functionality on top of it

Libraries that Implement the Numpy API

The Numpy array is one of the foundations of the numeric Python ecosystem, and serves as the standard model for similar libraries in other languages. Today it is used to analyze satellite and biomedical imagery, financial models, genomes, oceans and the atmosphere, super-computer simulations, and data from thousands of other domains.

However, Numpy was designed several years ago, and its implementation is no longer optimal for some modern hardware, particularly multi-core workstations, many-core GPUs, and distributed clusters.

Fortunately other libraries implement the Numpy array API on these other architectures:

CuPy: implements the Numpy API on GPUs with CUDA
Sparse: implements the Numpy API for sparse arrays that are mostly zeros
Dask array: implements the Numpy API in parallel for multi-core workstations or distributed clusters

So even when the Numpy implementation is no longer ideal, the Numpy API lives on in successor projects.

Note: the Numpy implementation remains ideal most of the time. Dense in-memory arrays are still the common case. This blogpost is about the minority of cases where Numpy is not ideal

So today we can write code similar code between all of Numpy, GPU, sparse, and parallel arrays:

import numpy as np
x = np.random.random(...)  # Runs on a single CPU
y = x.T.dot(np.log(x) + 1)
z = y - y.mean(axis=0)
print(z[:5])

import cupy as cp
x = cp.random.random(...)  # Runs on a GPU
y = x.T.dot(cp.log(x) + 1)
z = y - y.mean(axis=0)
print(z[:5].get())

import dask.array as da
x = da.random.random(...)  # Runs on many CPUs
y = x.T.dot(da.log(x) + 1)
z = y - y.mean(axis=0)
print(z[:5].compute())

...

Additionally, each of the deep learning frameworks (TensorFlow, PyTorch, MXNet) has a Numpy-like thing that is similar-ish to Numpy’s API, but definitely not trying to be an exact match.

Libraries that consume and extend the Numpy API

At the same time as the development of Numpy APIs for different hardware, many libraries today build algorithmic functionality on top of the Numpy API:

XArray for labeled and indexed collections of arrays
Autograd and Tangent: for automatic differentiation
TensorLy for higher order array factorizations
Dask array which coordinates many Numpy-like arrays into a logical parallel array

(dask array both consumes and implements the Numpy API)
Opt Einsum for more efficient einstein summation operations
…

These projects and more enhance array computing in Python, building on new features beyond what Numpy itself provides.

There are also projects like Pandas, Scikit-Learn, and SciPy, that use Numpy’s in-memory internal representation. We’re going to ignore these libraries for this blogpost and focus on those libraries that only use the high-level Numpy API and not the low-level representation.

Opportunities and Challenges

Given the two groups of projects:

New libraries that implement the Numpy API (CuPy, Sparse, Dask array)
New libraries that consume and extend the Numpy API (XArray, Autograd/tangent, TensorLy, Einsum)

We want to use them together, applying Autograd to CuPy, TensorLy to Sparse, and so on, including all future implementations that might follow. This is challenging.

Unfortunately, while all of the array implementations APIs are very similar to Numpy’s API, they use different functions.

>>> numpy.sin is cupy.sin
False

This creates problems for the consumer libraries, because now they need to switch out which functions they use depending on which array-like objects they’ve been given.

def f(x):
    if isinstance(x, numpy.ndarray):
        return np.sin(x)
    elif isinstance(x, cupy.ndarray):
        return cupy.sin(x)
    elif ...

Today each array project implements a custom plugin system that they use to switch between some of the array options. Links to these plugin mechanisms are below if you’re interested:

For example XArray can use either Numpy arrays or Dask arrays. This has been hugely beneficial to users of that project, which today seamlessly transition from small in-memory datasets on their laptops to 100TB datasets on clusters, all using the same programming model. However when considering adding sparse or GPU arrays to XArray’s plugin system, it quickly became clear that this would be expensive today.

Building, maintaining, and extending these plugin mechanisms is costly. The plugin systems in each project are not alike, so any new array implementation has to go to each library and build the same code several times. Similarly, any new algorithmic library must build plugins to every ndarray implementation. Each library has to explicitly import and understand each other library, and has to adapt as those libraries change over time. This coverage is not complete, and so users lack confidence that their applications are portable between hardware.

Pair-wise plugin mechanisms make sense for a single project, but are not an efficient choice for the full ecosystem.

Solutions

I see two solutions today:

Build a new library that holds dispatch-able versions of all of the relevant Numpy functions and convince everyone to use it instead of Numpy internally
Build this dispatch mechanism into Numpy itself

Each has challenges.

Build a new centralized plugin library

We can build a new library, here called arrayish, that holds dispatch-able versions of all of the relevant Numpy functions. We then convince everyone to use it instead of Numpy internally.

So in each array-like library’s codebase we write code like the following:

# inside numpy's codebase
import arrayish
import numpy
@arrayish.sin.register(numpy.ndarray, numpy.sin)
@arrayish.cos.register(numpy.ndarray, numpy.cos)
@arrayish.dot.register(numpy.ndarray, numpy.ndarray, numpy.dot)
...

# inside cupy's codebase
import arrayish
import cupy
@arrayish.sin.register(cupy.ndarray, cupy.sin)
@arrayish.cos.register(cupy.ndarray, cupy.cos)
@arrayish.dot.register(cupy.ndarray, cupy.ndarray, cupy.dot)
...

and so on for Dask, Sparse, and any other Numpy-like libraries.

In all of the algorithm libraries (like XArray, autograd, TensorLy, …) we use arrayish instead of Numpy

# inside XArray's codebase
# import numpy
import arrayish as numpy

This is the same plugin solution as before, but now we build a community standard plugin system that hopefully all of the projects can agree to use.

This reduces the big n by m cost of maintaining several plugin systems, to a more manageable n plus m cost of using a single plugin system in each library. This centralized project would also benefit, perhaps, from being better maintained than any individual project is likely to do on its own.

However this has costs:

Getting many different projects to agree on a new standard is hard
Algorithmic projects will need to start using arrayish internally, adding new imports like the following:
```
import arrayish as numpy
```
And this wll certainly cause some complications interally
Someone needs to build an maintain the central infrastructure

Hameer Abbasi put together a rudimentary prototype for arrayish here: github.com/hameerabbasi/arrayish. There has been some discussion about this topic, using XArray+Sparse as an example, in pydata/sparse #1

Dispatch from within Numpy

Alternatively, the central dispatching mechanism could live within Numpy itself.

Numpy functions could learn to hand control over to their arguments, allowing the array implementations to take over when possible. This would allow existing Numpy code to work on externally developed array implementations.

There is precedent for this. The array_ufunc protocol allows any class that defines the __array_ufunc__ method to take control of any Numpy ufunc like np.sin or np.exp. Numpy reductions like np.sum already look for .sum methods on their arguments and defer to them if possible.

Some array projects, like Dask and Sparse, already implement the __array_ufunc__ protocol. There is also an open PR for CuPy. Here is an example showing Numpy functions on Dask arrays cleanly.

>>> import numpy as np
>>> import dask.array as da

>>> x = da.ones(10, chunks=(5,))  # A Dask array

>>> np.sum(np.exp(x))             # Apply Numpy function to a Dask array
dask.array<sum-aggregate, shape=(), dtype=float64, chunksize=()>  # get a Dask array

I recommend that all Numpy-API compatible array projects implement the __array_ufunc__ protocol.

This works for many functions, but not all. Other operations like tensordot, concatenate, and stack occur frequently in algorithmic code but are not covered here.

This solution avoids the community challenges of the arrayish solution above. Everyone is accustomed to aligning themselves to Numpy’s decisions, and relatively little code would need to be rewritten.

The challenge with this approach is that historically Numpy has moved more slowly than the rest of the ecosystem. For example the __array_ufunc__ protocol mentioned above was discussed for several years before it was merged. Fortunately Numpy has recently received funding to help it make changes like this more rapidly. The full time developers hired under this funding have just started though, and it’s not clear how much of a priority this work is for them at first.

For what it’s worth I’d prefer to see this Numpy protocol solution take hold.

Final Thoughts

The community has done this transition before (Numeric + Numarray -> Numpy, the Scikit-Learn fit/predict API, etc..) usually with surprisingly positive results.

The open questions I have today are the following:

How quickly can Numpy adapt to this demand for protocols while still remaining stable for its existing role as foundation of the ecosystem
What algorithmic domains can be written in a cross-hardware way that depends only on the high-level Numpy API, and doesn’t require specialization at the data structure level. Clearly some domains exist (XArray, automatic differentiation), but how common are these?
Once a standard protocol is in place, what other array-like implementations might arise? In-memory compression? Probabilistic? Symbolic?

Update

After discussing this topic at the May NumPy Developer Sprint at BIDS a few of us have drafted a Numpy Enhancement Proposal (NEP) available here.

Dask Release 0.17.2

2018-03-21T00:00:00+00:00

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.17.2. This is a minor release with new features and stability improvements. This blogpost outlines notable changes since the 0.17.0 release on February 12th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

Some notable changes follow:

Tornado is a popular framework for concurrent network programming that Dask relies on heavily. Tornado recently released a major version update that included both some major features for Dask as well as a couple of bugs.

The new IOStream.read_into method allows Dask communications (or anyone using this API) to move large datasets more efficiently over the network with fewer copies. This enables Dask to take advantage of high performance networking available on modern super-computers. On the Cheyenne system, where we tested this, we were able to get the full 3GB/s bandwidth available through the Infiniband network with this change (when using a few worker processes).

Many thanks to Antoine Pitrou and Ben Darnell for their efforts on this.

At the same time there were some unforeseen issues in the update to Tornado 5.0. More pervasive use of bytearrays over bytes caused issues with compression libraries like Snappy and Python 2 that were not expecting these types. There is a brief window in distributed.__version__ == 1.21.3 that enables this functionality if Tornado 5.0 is present but will misbehave if Snappy is also present.

HTTP File System

Dask leverages a file-system-like protocol for access to remote data. This is what makes commands like the following work:

import dask.dataframe as dd

df = dd.read_parquet('s3://...')
df = dd.read_parquet('hdfs://...')
df = dd.read_parquet('gcs://...')

We have now added http and https file systems for reading data directly from web servers. These also support random access if the web server supports range queries.

df = dd.read_parquet('https://...')

As with S3, HDFS, GCS, … you can also use these tools outside of Dask development. Here we read the first twenty bytes of the Pandas license:

from dask.bytes.http import HTTPFileSystem
http = HTTPFileSystem()
with http.open('https://raw.githubusercontent.com/pandas-dev/pandas/master/LICENSE') as f:
    print(f.read(20))

b'BSD 3-Clause License'

Thanks to Martin Durant who did this work and manages Dask’s byte handling generally. See remote data documentation for more information.

Fixed a correctness bug in Dask dataframe’s shuffle

We identified and resolved a correctness bug in dask.dataframe’s shuffle that resulted in some rows being dropped during complex operations like joins and groupby-applies with many partitions.

See dask/dask #3201 for more information.

Cluster super-class and intelligent adaptive deployments

There are many Python subprojects that help you deploy Dask on different cluster resource managers like Yarn, SGE, Kubernetes, PBS, and more. These have all converged to have more-or-less the same API that we have now combined into a consistent interface that downstream projects can inherit from in distributed.deploy.Cluster.

Now that we have a consistent interface we have started to invest more in improving the interface and intelligence of these systems as a group. This includes both pleasant IPython widgets like the following:

as well as improved logic around adaptive deployments. Adaptive deployments allow clusters to scale themselves automatically based on current workload. If you have recently submitted a lot of work the scheduler will estimate its duration and ask for an appropriate number of workers to finish the computation quickly. When the computation has finished the scheduler will release the workers back to the system to free up resources.

The logic here has improved substantially including the following:

You can specify minimum and maximum limits on your adaptivity
The scheduler estimates computation duration and asks for workers appropriately
There is some additional delay in giving back workers to avoid hysteresis, or cases where we repeatedly ask for and return workers

Acknowledgements

The following people contributed to the dask/dask repository since the 0.17.0 release on February 12h:

Anderson Banihirwe
Dan Collins
Dieter Weber
Gabriele Lanaro
John Kirkham
James Bourbeau
Julien Lhermitte
Matthew Rocklin
Martin Durant
Max Epstein
nkhadka
okkez
Pangeran Bottor
Rich Postelnik
Scott M. Edenbaum
Simon Perkins
Thrasibule
Tom Augspurger
Tor E Hagemann
Uwe L. Korn
Wes Roach

The following people contributed to the dask/distributed repository since the 1.21.0 release on February 12th:

Alexander Ford
Andy Jones
Antoine Pitrou
Brett Naul
Joe Hamman
John Kirkham
Loïc Estève
Matthew Rocklin
Matti Lyra
Sven Kreiss
Thrasibule
Tom Augspurger

Craft Minimal Bug Reports

2018-02-28T00:00:00+00:00

Following up on a post on supporting users in open source this post lists some suggestions on how to ask a maintainer to help you with a problem.

You don’t have to follow these suggestions. They are optional. They make it more likely that a project maintainer will spend time helping you. It’s important to remember that their willingness to support you for free is optional too.

Crafting minimal bug reports is essential for the life and maintenance of community-driven open source projects. Doing this well is an incredible service to the community.

I strongly recommend following Stack Overflow’s guidelines on Minimal Complete Verifiable Exmamples. I’ll include brief highlights here:

… code should be …

Minimal – Use as little code as possible that still produces the same problem

Complete – Provide all parts needed to reproduce the problem

Verifiable – Test the code you’re about to provide to make sure it reproduces the problem

Lets be clear, this is hard and takes time.

As a question-asker I find that creating an MCVE often takes 10-30 minutes for a simple problem. Fortunately this work is usually straightforward, even if I don’t know very much about the package I’m having trouble with. Most of the work to create a minimal example is about removing all of the code that was specific to my application, and as the question-asker I am probably the most qualified person to do that.

When answering questions I often point people to StackOverflow’s MCVE document. They sometimes come back with a better-but-not-yet-minimal example. This post clarifies a few common issues.

As an running example I’m going to use Pandas dataframe problems.

Don’t post data

You shouldn’t post the file that you’re working with. Instead, try to see if you can reproduce the problem with just a few lines of data rather than the whole thing.

Having to download a file, unzip it, etc. make it much less likely that someone will actually run your example in their free time.

Don’t

I’ve uploaded my data to Dropbox and you can get it here: my-data.csv.gz

import pandas as pd
df = pd.read_csv('my-data.csv.gz')

Do

You should be able to copy-paste the following to get enough of my data to cause the problem:

import pandas as pd
df = pd.DataFrame({'account-start': ['2017-02-03', '2017-03-03', '2017-01-01'],
                   'client': ['Alice Anders', 'Bob Baker', 'Charlie Chaplin'],
                   'balance': [-1432.32, 10.43, 30000.00],
                   'db-id': [1234, 2424, 251],
                   'proxy-id': [525, 1525, 2542],
                   'rank': [52, 525, 32],
                   ...
                   })

Actually don’t include your data at all

Actually, your data probably has lots of information that is very specific to your application. Your eyes gloss over it but a maintainer doesn’t know what is relevant and what isn’t, so it will take them time to digest it if you include it. Instead see if you can reproduce your same failure with artificial or random data.

Don’t

Here is enough of my data to reproduce the problem

import pandas as pd
df = pd.DataFrame({'account-start': ['2017-02-03', '2017-03-03', '2017-01-01'],
                   'client': ['Alice Anders', 'Bob Baker', 'Charlie Chaplin'],
                   'balance': [-1432.32, 10.43, 30000.00],
                   'db-id': [1234, 2424, 251],
                   'proxy-id': [525, 1525, 2542],
                   'rank': [52, 525, 32],
                   ...
                   })

Do

My actual problem is about finding the best ranked employee over a certain time period, but we can reproduce the problem with this simpler dataset. Notice that the dates are out of order in this data (2000-01-02 comes after 2000-01-03). I found that this was critical to reproducing the error.

import pandas as pd
df = pd.DataFrame({'account-start': ['2000-01-01', '2000-01-03', '2000-01-02'],
                   'db-id': [1, 2, 3],
                   'name': ['Alice', 'Bob', 'Charlie'})

As we shrink down our example problem we often discover a lot about what causes the problem. This discovery is valuable and something that only the question-asker is capable of doing efficiently.

See how small you can make things

To make it even easier, see how small you can make your data. For example if working with tabular data (like Pandas), then how many columns do you actually need to reproduce the failure? How many rows do you actually need to reproduce the failure? Do the columns need to be named as you have them now or could they be just “A” and “B” or descriptive of the types within?

Do

import pandas as pd
df = pd.DataFrame({'datetime': ['2000-01-03', '2000-01-02'],
                   'id': [1, 2]})

Remove unnecessary steps

Is every line in your example absolutely necessary to reproduce the error? If you’re able to delete a line of code then please do. Because you already understand your problem you are much more efficient at doing this than the maintainer is. They probably know more about the tool, but you know more about your code.

Don’t

The groupby step below is raising a warning that I don’t understand

df = pd.DataFrame(...)

df = df[df.value > 0]
df = df.fillna(0)

df.groupby(df.x).y.mean()  # <-- this produces the error

Do

The groupby step below is raising a warning that I don’t understand

df = pd.DataFrame(...)

df.groupby(df.x).y.mean()  # <-- this produces the error

Use Syntax Highlighting

When using Github you can enclose code blocks in triple-backticks (the character on the top-left of your keyboard on US-standard QWERTY keyboards). It looks like this:

```python
x = 1
```

Provide complete tracebacks

You know all of that stuff between your code and the exception that is hard to make sense of? You should include it.

Don’t

I get a ZeroDivisionError from the following code:

```python
def div(x, y):
    return x / y

div(1, 0)
```

Do

I get a ZeroDivisionError from the following code:

```python
def div(x, y):
    return x / y

div(1, 0)
```

```python-traceback
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-4-7b96263abbfa> in <module>()
----> 1 div(1, 0)

<ipython-input-3-7685f97b4ce5> in div(x, y)
      1 def div(x, y):
----> 2     return x / y
      3

ZeroDivisionError: division by zero
```

If the traceback is long that’s ok. If you really want to be clean you can put it in <details> brackets.

I get a ZeroDivisionError from the following code:

```python
def div(x, y):
    return x / y

div(1, 0)
```

### Traceback

<details>

```python
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-4-7b96263abbfa> in <module>()
----> 1 div(1, 0)

<ipython-input-3-7685f97b4ce5> in div(x, y)
      1 def div(x, y):
----> 2     return x / y
      3

ZeroDivisionError: division by zero
```

</details>

Ask Questions in Public Places

When raising issues you often have a few possible locations:

GitHub issue tracker
Stack Overflow
Project mailing list
Project Chat room
E-mail maintainers directly (never do this)

Different projects handle this differently, but they usually have a page on their documentation about where to go for help. This is often labeled “Community”, “Support” or “Where to ask for help”. Here are the recommendations from the Pandas community.

Generally it’s good to ask questions where many maintainers can see your question and help, and where other users can find your question and answer if they encounter a similar bug in the future.

While your goal may be to solve your problem, the maintainer’s goal is likely to create a record of how to solve problems like yours. This helps many more users who will have a similar problem in the future, see your well-crafted bug report, and learn from the resulting conversation.

My personal preferences

For user questions like “What is the right way to do X?” I prefer Stack Overflow.
For bug reports like “I did X, I’m pretty confident that it should work, but I get this error” I prefer Github issues
For general chit-chat I prefer Gitter, though actually, I personally spend almost no time in gitter because it isn’t easily searchable by future users. If you’ve asked me a question in Gitter I will almost certainly not respond to it, except to direct you to github, stack overflow, or this blogpost.
I only like personal e-mail if someone is proposing to fund or seriously support the project in some way

But again, different projects do this differently and have different policies. You should check the documentation of the project you’re dealing with to learn how they like to support users.

Dask Release 0.17.0

2018-02-12T00:00:00+00:00

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.17.0. This a significant major release with new features, breaking changes, and stability improvements. This blogpost outlines notable changes since the 0.16.0 release on November 21st.

You can conda install Dask:

conda install dask -c conda-forge

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

Some notable changes follow.

Removed dask.dataframe.rolling_* methods, which were previously deprecated both in dask.dataframe and in pandas. These are replaced with the rolling.* namespace
We’ve generally stopped maintenance of the dask-ec2 project to launch dask clusters on Amazon’s EC2 using Salt. We generally recommend kubernetes instead both for Amazon’s EC2, and for Google and Azure as well

dask.pydata.org/en/latest/setup/kubernetes.html
Internal state of the distributed scheduler has changed significantly. This may affect advanced users who were inspecting this state for debugging or diagnostics.

Task Ordering

As Dask encounters more complex problems from more domains we continually run into problems where its current heuristics do not perform optimally. This release includes a rewrite of our static task prioritization heuristics. This will improve Dask’s ability to traverse complex computations in a way that keeps memory use low.

To aid debugging we also integrated these heuristics into the GraphViz-style plots that come from the visualize method.

x = da.random.random(...)
...
x.visualize(color='order', cmap='RdBu')

Nested Joblib

Dask supports parallelizing Scikit-Learn by extending Scikit-Learn’s underlying library for parallelism, Joblib. This allows Dask to distribute some SKLearn algorithms across a cluster just by wrapping them with a context manager.

This relationship has been strengthened, and particular attention has been focused when nesting one parallel computation within another, such as occurs when you train a parallel estimator, like RandomForest, within another parallel computation, like GridSearchCV. Previously this would result in spawning too many threads/processes and generally oversubscribing hardware.

Due to recent combined development within both Joblib and Dask, these sorts of situations can now be resolved efficiently by handing them off to Dask, providing speedups even in single-machine cases:

from sklearn.externals import joblib
import distributed.joblib  # register the dask joblib backend

from dask.distributed import Client
client = Client()

est = ParallelEstimator()
gs = GridSearchCV(est)

with joblib.parallel_backend('dask'):
    gs.fit()

See Tom Augspurger’s recent post with more details about this work:

Thanks to Tom Augspurger, Jim Crist, and Olivier Grisel who did most of this work.

Scheduler Internal Refactor

The distributed scheduler has been significantly refactored to change it from a forest of dictionaries:

priority = {'a': 1, 'b': 2, 'c': 3}
dependencies = {'a': {'b'}, 'b': {'c'}, 'c': []}
nbytes = {'a': 1000, 'b': 1000, 'c': 28}

To a bunch of objects:

tasks = {'a': Task('a', priority=1, nbytes=1000, dependencies=...),
         'b': Task('b': priority=2, nbytes=1000, dependencies=...),
         'c': Task('c': priority=3, nbytes=28, dependencies=[])}

(there is much more state than what is listed above, but hopefully the examples above are clear.)

There were a few motivations for this:

We wanted to try out Cython and PyPy, for which objects like this might be more effective than dictionaries.
We believe that this is probably a bit easier for developers new to the schedulers to understand. The proliferation of state dictionaries was not highly discoverable.

Goal one ended up not working out. We have not yet been able to make the scheduler significantly faster under Cython or PyPy with this new layout. There is even a slight memory increase with these changes. However we have been happy with the results in code readability, and we hope that others find this useful as well.

Thanks to Antoine Pitrou, who did most of the work here.

User Priorities

You can now submit tasks with different priorities.

x = client.submit(f, 1, priority=10)   # Higher priority preferred
y = client.submit(f, 1, priority=-10)  # Lower priority happens later

To be clear, Dask has always had priorities, they just weren’t easily user-settable. Higher priorities are given precedence. The default priority for all tasks is zero. You can also submit priorities for collections (like arrays and dataframes)

df = df.persist(priority=5)  # give this computation higher priority.

dask/distributed #1651

Acknowledgements

The following people contributed to the dask/dask repository since the 0.16.0 release on November 14th:

Albert DeFusco
Apostolos Vlachopoulos
castalheiro
James Bourbeau
Jon Mease
Ian Hopkinson
Jakub Nowacki
Jim Crist
John A Kirkham
Joseph Lin
Keisuke Fujii
Martijn Arts
Martin Durant
Matthew Rocklin
Markus Gonser
Nir
Rich Signell
Roman Yurchak
S. Andrew Sheppard
sephib
Stephan Hoyer
Tom Augspurger
Uwe L. Korn
Wei Ji
Xander Johnson

The following people contributed to the dask/distributed repository since the 1.20.0 release on November 14th:

Alexander Ford
Antoine Pitrou
Brett Naul
Brian Broll
Bruce Merry
Cornelius Riemenschneider
Daniel Li
Jim Crist
Kelvin Yang
Matthew Rocklin
Min RK
rqx
Russ Bubley
Scott Sievert
Tom Augspurger
Xander Johnson

Credit Modeling with Dask

2018-02-09T00:00:00+00:00

This post explores a real-world use case calculating complex credit models in Python using Dask. It is an example of a complex parallel system that is well outside of the traditional “big data” workloads.

Hi All,

This is a guest post from Rich Postelnik, an Anaconda employee who works with a large retail bank on their credit modeling system. They’re doing interesting work with Dask to manage complex computations (see task graph below). This is a nice example of using Dask for complex problems that are neither a big dataframe nor a big array, but are still highly parallel. Rich was kind enough to write up this description of their problem and share it here.

Thanks Rich!

This is cross-posted at Anaconda’s Developer Blog.

P.S. If others have similar solutions and would like to share them I’d love to host those on this blog as well.

The Problem

When applying for a loan, like a credit card, mortgage, auto loan, etc., we want to estimate the likelihood of default and the profit (or loss) to be gained. Those models are composed of a complex set of equations that depend on each other. There can be hundreds of equations each of which could have up to 20 inputs and yield 20 outputs. That is a lot of information to keep track of! We want to avoid manually keeping track of the dependencies, as well as messy code like the following Python function:

def final_equation(inputs):
    out1 = equation1(inputs)
    out2_1, out2_2, out2_3 = equation2(inputs, out1)
    out3_1, out3_2 = equation3(out2_3, out1)
    ...
    out_final = equation_n(inputs, out,...)
    return out_final

This boils down to a dependency and ordering problem known as task scheduling.

DAGs to the rescue

A directed acyclic graph (DAG) is commonly used to solve task scheduling problems. Dask is a library for delayed task computation that makes use of directed graphs at its core. dask.delayed is a simple decorator that turns a Python function into a graph vertex. If I pass the output from one delayed function as a parameter to another delayed function, Dask creates a directed edge between them. Let’s look at an example:

def add(x, y):
    return x + y

>>> add(2, 2)
4

So here we have a function to add two numbers together. Let’s see what happens when we wrap it with dask.delayed:

>>> add = dask.delayed(add)
>>> left = add(1, 1)
>>> left
Delayed('add-f6204fac-b067-40aa-9d6a-639fc719c3ce')

add now returns a Delayed object. We can pass this as an argument back into our dask.delayed function to start building out a chain of computation.

>>> right = add(1, 1)
>>> four = add(left, right)
>>> four.compute()
4

>>> four.visualize()

Below we can see how the DAG starts to come together.

Mock credit example

Let’s assume I’m a mortgage bank and have 10 people applying for a mortgage. I want to estimate the group’s average likelihood to default based on years of credit history and income.

hist_yrs = range(10)
incomes = range(10)

Let’s also assume that default is a function of the incremented years history and half the years experience. While this could be written like:

def default(hist, income):
    return (hist + 1) ** 2 + (income / 2)

I know in the future that I will need the incremented history for another calculation and want to be able to reuse the code as well as avoid doing the computation twice. Instead, I can break those functions out:

from dask import delayed

@delayed
def increment(x):
    return x + 1

@delayed
def halve(y):
    return y / 2

@delayed
def default(hist, income):
    return hist**2 + income

Note how I wrapped the functions with delayed. Now instead of returning a number these functions will return a Delayed object. Even better is that these functions can also take Delayed objects as inputs. It is this passing of Delayed objects as inputs to other delayed functions that allows Dask to construct the task graph. I can now call these functions on my data in the style of normal Python code:

inc_hist = [increment(n) for n in hist_yrs]
halved_income = [halve(n) for n in income]
estimated_default = [default(hist, income) for hist, income in zip(inc_hist, halved_income)]

If you look at these variables, you will see that nothing has actually been calculated yet. They are all lists of Delayed objects.

Now, to get the average, I could just take the sum of estimated_default but I want this to scale (and make a more interesting graph) so let’s do a merge-style reduction.

@delayed
def agg(x, y):
    return x + y

def merge(seq):
    if len(seq) < 2:
        return seq
    middle = len(seq)//2
    left = merge(seq[:middle])
    right = merge(seq[middle:])
    if not right:
        return left
    return [agg(left[0], right[0])]

default_sum = merge(estimated_defaults)

At this point default_sum is a list of length 1 and that first element is the sum of estimated default for all applicants. To get the average, we divide by the number of applicants and call compute:

avg_default = default_sum[0] / 10
avg_default.compute()  # 40.75

To see the computation graph that Dask will use, we call visualize:

avg_default.visualize()

And that is how Dask can be used to construct a complex system of equations with reusable intermediary calculations.

How we used Dask in practice

For our credit modeling problem, we used Dask to make a custom data structure to represent the individual equations. Using the default example above, this looked something like the following:

class Default(Equation):
    inputs = ['inc_hist', 'halved_income']
    outputs = ['defaults']

    @delayed
    def equation(self, inc_hist, halved_income, **kwargs):
        return inc_hist**2 + halved_income

This allows us to write each equation as its own isolated function and mark its inputs and outputs. With this set of equation objects, we can determine the order of computation (with a topological sort) and let Dask handle the graph generation and computation. This eliminates the onerous task of manually passing around the arguments in the code base. Below is an example task graph for one particular model that the bank actually does.

This graph was a bit too large to render with the normal my_task.visualize() method, so instead we rendered it with Gephi to make the pretty colored graph above. The chaotic upper region of this graph is the individual equation calculations. Zooming in we can see the entry point, our input pandas DataFrame, as the large orange circle at the top and how it gets fed into many of the equations.

The output of the model is about 100 times the size of the input so we do some aggregation at the end via tree reduction. This accounts for the more structured bottom half of the graph. The large green node at the bottom is our final output.

Final Thoughts

With our Dask-based data structure, we spend more of our time writing model code rather than maintenance of the engine itself. This allows a clean separation between our analysts that design and write our models, and our computational system that runs them. Dask also offers a number of advantages not covered above. For example, with Dask you also get access to diagnostics such as time spent running each task and resources used. Also, you can easily distribute your computation with dask distributed with relative ease. Now if I want to run our model across larger-than-memory data or on a distributed cluster, we don’t have to worry about rewriting our code to incorporate something like Spark. Finally, Dask allows you to give pandas-capable business analysts or less technical folks access to large datasets with the dask dataframe.

Full Example

from dask import delayed


@delayed
def increment(x):
    return x + 1


@delayed
def halve(y):
    return y / 2


@delayed
def default(hist, income):
    return hist**2 + income


@delayed
def agg(x, y):
    return x + y


def merge(seq):
    if len(seq) < 2:
        return seq
    middle = len(seq)//2
    left = merge(seq[:middle])
    right = merge(seq[middle:])
    if not right:
        return left
    return [agg(left[0], right[0])]


hist_yrs = range(10)
incomes = range(10)
inc_hist = [increment(n) for n in hist_yrs]
halved_income = [halve(n) for n in incomes]
estimated_defaults = [default(hist, income) for hist, income in zip(inc_hist, halved_income)]
default_sum = merge(estimated_defaults)
avg_default = default_sum[0] / 10
avg_default.compute()
avg_default.visualize()  # requires graphviz and python-graphviz to be installed

Acknowledgements

Special thanks to Matt Rocklin, Michael Grant, Gus Cavanagh, and Rory Merritt for their feedback when writing this article.

Pangeo: JupyterHub, Dask, and XArray on the Cloud

2018-01-22T00:00:00+00:00

This work is supported by Anaconda Inc, the NSF EarthCube program, and UC Berkeley BIDS

A few weeks ago a few of us stood up pangeo.pydata.org, an experimental deployment of JupyterHub, Dask, and XArray on Google Container Engine (GKE) to support atmospheric and oceanographic data analysis on large datasets. This follows on recent work to deploy Dask and XArray for the same workloads on super computers. This system is a proof of concept that has taught us a great deal about how to move forward. This blogpost briefly describes the problem, the system, then describes the collaboration, and finally discusses a number of challenges that we’ll be working on in coming months.

Atmospheric and oceanographic sciences collect (with satellites) and generate (with simulations) large datasets that they would like to analyze with distributed systems. Libraries like Dask and XArray already solve this problem computationally if scientists have their own clusters, but we seek to expand access by deploying on cloud-based systems. We build a system to which people can log in, get Jupyter Notebooks, and launch Dask clusters without much hassle. We hope that this increases access, and connects more scientists with more cloud-based datasets.

The System

We integrate several pre-existing technologies to build a system where people can log in, get access to a Jupyter notebook, launch distributed compute clusters using Dask, and analyze large datasets stored in the cloud. They have a full user environment available to them through a website, can leverage thousands of cores for computation, and use existing APIs and workflows that look familiar to how they work on their laptop.

A video walk-through follows below:

We assembled this system from a number of pieces and technologies:

JupyterHub: Provides both the ability to launch single-user notebook servers and handles user management for us. In particular we use the KubeSpawner and the excellent documentation at Zero to JupyterHub, which we recommend to anyone interested in this area.
KubeSpawner: A JupyterHub spawner that makes it easy to launch single-user notebook servers on Kubernetes systems
JupyterLab: The newer version of the classic notebook, which we use to provide a richer remote user interface, complete with terminals, file management, and more.
XArray: Provides computation on NetCDF-style data. XArray extends NumPy and Pandas to enable scientists to express complex computations on complex datasets in ways that they find intuitive.
Dask: Provides the parallel computation behind XArray
Daskernetes: Makes it easy to launch Dask clusters on Kubernetes
Kubernetes: In case it’s not already clear, all of this is based on Kubernetes, which manages launching programs (like Jupyter notebook servers or Dask workers) on different machines, while handling load balancing, permissions, and so on
Google Container Engine: Google’s managed Kubernetes service. Every major cloud provider now has such a system, which makes us happy about not relying too heavily on one system
GCSFS: A Python library providing intuitive access to Google Cloud Storage, either through Python file interfaces or through a FUSE file system
Zarr: A chunked array storage format that is suitable for the cloud

Collaboration

We were able to build, deploy, and use this system to answer real science questions in a couple weeks. We feel that this result is significant in its own right, and is largely because we collaborated widely. This project required the expertise of several individuals across several projects, institutions, and funding sources. Here are a few examples of who did what from which organization. We list institutions and positions mostly to show the roles involved.

Alistair Miles, Professor, Oxford: Helped to optimize Zarr for XArray on GCS
Jacob Tomlinson, Staff, UK Met Informatics Lab: Developed original JADE deployment and early Dask-Kubernetes work.
Joe Hamman, Postdoc, National Center for Atmospheric Research: Provided scientific use case, data, and work flow. Tuned XArray and Zarr for efficient data storing and saving.
Martin Durant, Software developer, Anaconda Inc.: Tuned GCSFS for many-access workloads. Also provided FUSE system for NetCDF support
Matt Pryor, Staff, Centre for Envronmental Data Analysis: Extended original JADE deployment and early Dask-Kubernetes work.
Matthew Rocklin, Software Developer, Anaconda Inc. Integration. Also performance testing.
Ryan Abernathey, Assistant Professor, Columbia University: XArray + Zarr support, scientific use cases, coordination
Stephan Hoyer, Software engineer, Google: XArray support
Yuvi Panda, Staff, UC Berkeley BIDS and Data Science Education Program: Provided assistance configuring JupyterHub with KubeSpawner. Also prototyped the Daskernetes Dask + Kubernetes tool.

Notice the mix of academic and for-profit institutions. Also notice the mix of scientists, staff, and professional software developers. We believe that this mixture helps ensure the efficient construction of useful solutions.

Lessons

This experiment has taught us a few things that we hope to explore further:

Users can launch Kubernetes deployments from Kubernetes pods, such as launching Dask clusters from their JupyterHub single-user notebooks.

To do this well we need to start defining user roles more explicitly within JupyterHub. We need to give users a safe an isolated space on the cluster to use without affecting their neighbors.
HDF5 and NetCDF on cloud storage is an open question

The file formats used for this sort of data are pervasive, but not particulary convenient or efficent on cloud storage. In particular the libraries used to read them make many small reads, each of which is costly when operating on cloud object storage

I see a few options:
1. Use FUSE file systems, but tune them with tricks like read-ahead and caching in order to compensate for HDF’s access patterns
2. Use the HDF group’s proposed HSDS service, which promises to resolve these issues
3. Adopt new file formats that are more cloud friendly. Zarr is one such example that has so far performed admirably, but certainly doesn’t have the long history of trust that HDF and NetCDF have earned.
Environment customization is important and tricky, especially when adding distributed computing.

Immediately after showing this to science groups they want to try it out with their own software environments. They can do this easily in their notebook session with tools like pip or conda, but to apply those same changes to their dask workers is a bit more challenging, especially when those workers come and go dynamically.

We have solutions for this. They can bulid and publish docker images. They can add environment variables to specify extra pip or conda packages. They can deploy their own pangeo deployment for their own group.

However these have all taken some work to do well so far. We hope that some combination of Binder-like publishing and small modification tricks like environment variables resolve this problem.
Our docker images are very large. This means that users sometimes need to wait a minute or more for their session or their dask workers to start up (less after things have warmed up a bit).

It is surprising how much of this comes from conda and node packages. We hope to resolve this both by improving our Docker hygeine and by engaging packaging communities to audit package size.
Explore other clouds

We started with Google just because their Kubernetes support has been around the longest, but all major cloud providers (Google, AWS, Azure) now provide some level of managed Kubernetes support. Everything we’ve done has been cloud-vendor agnostic, and various groups with data already on other clouds have reached out and are starting deployment on those systems.
Combine efforts with other groups

We’re actually not the first group to do this. The UK Met Informatics Lab quietly built a similar prototype, JADE (Jupyter and Dask Environment) many months ago. We’re now collaborating to merge efforts.

It’s also worth mentioning that they prototyped the first iteration of Daskernetes.
Reach out to other communities

While we started our collaboration with atmospheric and oceanographic scientists, these same solutions apply to many other disciplines. We should investigate other fields and start collaborations with those communities.
Improve Dask + XArray algorithms

When we try new problems in new environments we often uncover new opportunities to improve Dask’s internal scheduling algorithms. This case is no different :)

Much of this upcoming work is happening in the upstream projects so this experimentation is both of concrete use to ongoing scientific research as well as more broad use to the open source communities that these projects serve.

Community uptake

We presented this at a couple conferences over the past week.

American Meteorological Society, Python Symposium, Keynote. Slides: http://matthewrocklin.com/slides/ams-2018.html#/
Earth Science Information Partners Winter Meeting. Video: https://www.youtube.com/watch?v=mDrjGxaXQT4

We found that this project aligns well with current efforts from many government agencies to publish large datasets on cloud stores (mostly S3). Many of these data publication endeavors seek a computational system to enable access for the scientific public. Our project seems to complement these needs without significant coordination.

Disclaimers

While we encourage people to try out pangeo.pydata.org we also warn you that this system is immature. In particular it has the following issues:

it is insecure, please do not host sensitive data
it is unstable, and may be taken down at any time
it is small, we only have a handful of cores deployed at any time, mostly for experimentation purposes

However it is also open, and instructions to deploy your own live here.

Come help

We are a growing group comprised of many institutions including technologists, scientists, and open source projects. There is plenty to do and plenty to discuss. Please engage with us at github.com/pangeo-data/pangeo/issues/new