Dask Working Notes - Posts tagged distributed

Shuffling large data at constant memory in Dask

2023-03-15T00:00:00+00:00

This work was engineered and supported by Coiled. In particular, thanks to Florian Jetter, Gabe Joseph, Hendrik Makait, and Matt Rocklin. Original version of this post appears on blog.coiled.io

With release 2023.2.1, dask.dataframe introduces a new shuffling method called P2P, making sorts, merges, and joins faster and using constant memory. Benchmarks show impressive improvements:

P2P shuffling (blue) uses constant memory while task-based shuffling (orange) scales linearly.

This article describes the problem, the new solution, and the impact on performance.

Shuffling is a key primitive in data processing systems. It is used whenever we move a dataset around in an all-to-all fashion, such as occurs in sorting, dataframe joins, or array rechunking. Shuffling is a challenging computation to run efficiently, with lots of small tasks sharding the data.

Task-based shuffling scales poorly

While systems like distributed databases and Apache Spark use a dedicated shuffle service to move data around the cluster, Dask builds on task-based scheduling. Task-based systems are more flexible for general-purpose parallel computing but less suitable for all-to-all shuffling. This results in three main issues:

Scheduler strain: The scheduler hangs due to the sheer amount of tasks required for shuffling.
Full dataset materialization: Workers materialize the entire dataset causing the cluster to run out of memory.
Many small operations: Intermediate tasks are too small and bring down hardware performance.

Together, these issues make large-scale shuffles inefficient, causing users to over-provision clusters. Fortunately, we can design a system to address these concerns. Early work on this started back in 2021 and has now matured into the P2P shuffling system:

P2P shuffling

With release 2023.2.1, Dask introduces a new shuffle method called P2P, making sorts, merges, and joins run faster and in constant memory.

This system is designed with three aspects, mirroring the problems listed above:

Peer-to-Peer communication: Reduce the involvement of the scheduler

Shuffling becomes an O(n) operation from the scheduler’s perspective, removing a key bottleneck.

Disk by default: Store data as it arrives on disk, efficiently.

Dask can now shuffle datasets of arbitrary size in small memory, reducing the need to fiddle with right-sizing clusters.
Buffer network and disk with memory: Avoid many small writes by buffering sensitive hardware with in-memory stores.

Shuffling involves CPU, network, and disk, each of which brings its own bottlenecks when dealing with many small pieces of data. We use memory judiciously to trade off between these bottlenecks, balancing network and disk I/O to maximize the overall throughput.

In addition to these three aspects, P2P shuffling implements many minor optimizations to improve performance, and it relies on pyarrow>=7.0 for efficient data handling and (de)serialization.

Results

To evaluate P2P shuffling, we benchmark against task-based shuffling on common workloads from our benchmark suite. For more information on this benchmark suite, see the GitHub repository or the latest test results.

Memory stability

The biggest benefit of P2P shuffling is constant memory usage. Memory usage drops and stays constant across all workloads:

P2P shuffling (blue) uses constant memory while task-based shuffling (orange) scales linearly.

For the tested workloads, we saw up to 10x lower memory. For even larger workloads, this gap only increases.

Performance and Speed

In the above plot, we can two performance improvements:

Faster execution: Workloads run up to 45% faster.
Quicker startup: Smaller graphs mean P2P shuffling starts sooner.

Comparing scheduler dashboards for P2P and task-based shuffling side-by-side helps to understand what causes these performance gains:

Task-based shuffling (left) shows bigger gaps in the task stream and spilling compared to P2P shuffling (right).

Graph size: 10x fewer tasks are faster to deploy and easier on the scheduler.
Controlled I/O: P2P shuffling handles I/O explicitly, which avoids less performant spilling by Dask.
Fewer interruptions: I/O is well balanced, so we see fewer gaps in work in the task stream.

Overall, P2P shuffling leverages our hardware far more effectively.

Changing defaults

These results benefit most users, so P2P shuffling is now the default starting from release 2023.2.1, as long as pyarrow>=7.0.0 is installed. Dataframe operations like the following will benefit:

df.set_index(...)
df.merge(...)
df.groupby(...).apply(...)

Keep old behavior

If P2P shuffling does not work for you, you can deactivate it by setting the dataframe.shuffle.method config value to "tasks" or explicitly setting a keyword argument, for example:

with the yaml config

dataframe:
  shuffle:
    method: tasks

or when using a cluster manager

import dask
from dask.distributed import LocalCluster

# The dataframe.shuffle.method config is available since 2023.3.1
with dask.config.set({"dataframe.shuffle.method": "tasks"}):
    cluster = LocalCluster(...)  # many cluster managers send current dask config automatically

For more information on deactivating P2P shuffle, see the discussion #7509.

What about arrays?

While the original motivation was to optimize large-scale dataframe joins, the P2P system is useful for all problems requiring lots of communication between tasks. For array workloads, this often occurs when rechunking data, such as when organizing a matrix by rows when it was stored by columns. Similar to dataframe joins, array rechunking has been inefficient in the past, which has become such a problem that the array community built specialized tools like rechunker to avoid it entirely.

There is a naive implementation of array rechunking using the P2P system available for experimental use. Benchmarking this implementation shows mixed results:

👍 Constant memory use: As with dataframe operations, memory use is constant.
❓ Variable runtime: The runtime of workloads may increase with P2P.
👎 Memory overhead: There is a large memory overhead for many small partitions.

The constant memory use is a very promising result. There are several ways to tackle the current limitations. We expect this to improve as we work with collaborators from the array computing community.

Next steps

Development on P2P shuffling is not done yet. For the future we plan the following:

dask.array:

While the early prototype of array rechunking is promising, it’s not there yet. We plan to do the following:
- Intelligently select which algorithm to use (task-based rechunking is better sometimes)
- Work with collaborators on rechunking to improve performance
- Seek out other use cases, like map_overlap, where this might be helpful
Failure recovery:

Make P2P shuffling resilient to worker loss; currently, it has to restart entirely.
Performance tuning:

Performance today is good, but not yet at peak hardware speeds. We can improve this in a few ways:
- Hiding disk I/O
- Using more memory when appropriate for smaller shuffles
- Improved batching of small operations
- More efficient serialization

To follow along with the development, subscribe to the tracking issue on GitHub.

Summary

Shuffling data is a common operation in dataframe workloads. Since 2023.2.1, Dask defaults to P2P shuffling for distributed clusters and shuffles data faster and at constant memory. This improvement unlocks previously un-runnable workload sizes and efficiently uses your clusters. Finally, P2P shuffling demonstrates extending Dask to add new paradigms while leveraging the foundations of its distributed engine.

Share results in this discussion thread or follow development at this tracking issue.

Data Proximate Computation on a Dask Cluster Distributed Between Data Centres

2022-07-19T00:00:00+00:00

This work is a joint venture between the Met Office and the European Weather Cloud, which is a partnership of ECMWF and EUMETSAT.

We have devised a technique for creating a Dask cluster where worker nodes are hosted in different data centres, connected by a mesh VPN that allows the scheduler and workers to communicate and exchange results.

A novel (ab)use of Dask resources allows us to run data processing tasks on the workers in the cluster closest to the source data, so that communication between data centres is minimised. If combined with zarr to give access to huge hyper-cube datasets in object storage, we believe that the technique could realise the potential of data-proximate distributed computing in the Cloud.

Introduction

The UK Met Office has been carrying out a study into data-proximate computing in collaboration with the European Weather Cloud. We identified Dask as a key technology, but whilst existing Dask techniques focus on parallel computation in a single data centre, we looked to extend this to computation across data centres. This way, when the data required is hosted in multiple locations, tasks can be run where the data is rather than copying it.

Dask worker nodes exchange data chunks over a network, coordinated by a scheduler. There is an assumption that all nodes are freely able to communicate, which is not generally true across data centres due to firewalls, NAT, etc, so a truly distributed approach has to solve this problem. In addition, it has to manage data transfer efficiently, because moving data chunks between data centres is much more costly than between workers in the same cloud.

This notebook documents a running proof-of-concept that addresses these problems. It runs a computation in 3 locations:

This computer, where the client and scheduler are running. This was run on AWS during development.
The ECMWF data centre. This has compute resources, and hosts data containing predictions.
The EUMETSAT data centre, with compute resources and data on observations

from IPython.display import Image
Image(filename="images/datacentres.png") # this because GitHub doesn't render markup images in private repos

The idea is that tasks accessing data available in a location should be run there. Meanwhile the computation can be defined, invoked, and the results rendered, elsewhere. All this with minimal hinting to the computation as to how this should be done.

Setup

First some imports and conveniences

import os
from time import sleep
import dask
from dask.distributed import Client
from dask.distributed import performance_report, get_task_stream
from dask_worker_pools import pool, propagate_pools
import pytest
import ipytest
import xarray
import matplotlib.pyplot as plt
from orgs import my_org
from tree import tree

ipytest.autoconfig()

In this case we are using a control plane IPv4 WireGuard network on 10.8.0.0/24 to set up the cluster - this is not a necessity, but simplifies this proof of concept. WireGuard peers are running on ECMWF and EUMETSAT machines already, but we have to start one here:

!./start-wg.sh

4: mo-aws-ec2: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 8921 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none
    inet 10.8.0.3/24 scope global mo-aws-ec2
       valid_lft forever preferred_lft forever

We have worker machines configured in both ECMWF and EUMETSAT, one in each. They are accessible on the control plane network as

ecmwf_host='10.8.0.4'
%env ECMWF_HOST=$ecmwf_host
eumetsat_host='10.8.0.2'
%env EUMETSAT_HOST=$eumetsat_host

env: ECMWF_HOST=10.8.0.4
env: EUMETSAT_HOST=10.8.0.2

Mount the Data

This machine needs access to the data files over the network in order to read NetCDF metadata. The workers are sharing their data with NFS, so we mount them here. (In this proof of concept, the control plane network is used for NFS, but the data plane network could equally be used, or a more appropriate technology such as zarr accessing object storage.)

%%bash
sudo ./data-reset.sh

mkdir -p /data/ecmwf
mkdir -p /data/eumetsat
sudo mount $ECMWF_HOST:/data/ecmwf /data/ecmwf
sudo mount $EUMETSAT_HOST:/eumetsatdata/ /data/eumetsat

Image(filename="images/datacentres-data.png")

Access to Data

For this demonstration, we have two large data files that we want to process. On ECMWF we have predictions in /data/ecmwf/000490262cdd067721a34112963bcaa2b44860ab.nc. Workers running in ECMWF can see the file

!ssh -i ~/.ssh/id_rsa_rcar_infra  rcar@$ECMWF_HOST 'tree /data/'

/data/
└── ecmwf
    └── 000490262cdd067721a34112963bcaa2b44860ab.nc

1 directory, 1 file

and because that directory is mounted here over NFS, so can this computer

!tree /data/ecmwf

/data/ecmwf
└── 000490262cdd067721a34112963bcaa2b44860ab.nc

0 directories, 1 file

This is a big file

!ls -lh /data/ecmwf

total 2.8G
-rw-rw-r-- 1 ec2-user ec2-user 2.8G Mar 25 13:09 000490262cdd067721a34112963bcaa2b44860ab.nc

On EUMETSAT we have observations.nc

!ssh -i ~/.ssh/id_rsa_rcar_infra  rcar@$EUMETSAT_HOST 'tree /data/eumetsat/ad-hoc'

/data/eumetsat/ad-hoc
└── observations.nc

0 directories, 1 file

similarly visible on this computer

!ls -lh /data/eumetsat/ad-hoc/observations.nc

-rw-rw-r-- 1 613600004 613600004 4.8M May 20 10:57 /data/eumetsat/ad-hoc/observations.nc

Crucially, ECMWF data is not visible in the EUMETSAT data centre, and vice versa.

Our Calculation

We want to compare the predictions against the observations.

We can open the predictions file with xarray

predictions = xarray.open_dataset('/data/ecmwf/000490262cdd067721a34112963bcaa2b44860ab.nc').chunk('auto')
predictions

<xarray.Dataset>
Dimensions:                  (realization: 18, height: 33, latitude: 960,
longitude: 1280, bnds: 2)
Coordinates:
* realization              (realization) int32 0 18 19 20 21 ... 31 32 33 34
* height                   (height) float32 5.0 10.0 20.0 ... 5.5e+03 6e+03
* latitude                 (latitude) float32 -89.91 -89.72 ... 89.72 89.91
* longitude                (longitude) float32 -179.9 -179.6 ... 179.6 179.9
  forecast_period          timedelta64[ns] 1 days 18:00:00
  forecast_reference_time  datetime64[ns] 2021-11-07T06:00:00
  time                     datetime64[ns] 2021-11-09
  Dimensions without coordinates: bnds
  Data variables:
  air_pressure             (realization, height, latitude, longitude) float32 dask.array<chunksize=(18, 33, 192, 160), meta=np.ndarray>
  latitude_longitude       int32 -2147483647
  latitude_bnds            (latitude, bnds) float32 dask.array<chunksize=(960, 2), meta=np.ndarray>
  longitude_bnds           (longitude, bnds) float32 dask.array<chunksize=(1280, 2), meta=np.ndarray>
  Attributes:
  history:                      2021-11-07T10:27:38Z: StaGE Decoupler
  institution:                  Met Office
  least_significant_digit:      1
  mosg__forecast_run_duration:  PT198H
  mosg__grid_domain:            global
  mosg__grid_type:              standard
  mosg__grid_version:           1.6.0
  mosg__model_configuration:    gl_ens
  source:                       Met Office Unified Model
  title:                        MOGREPS-G Model Forecast on Global 20 km St...
  um_version:                   11.5
  Conventions:                  CF-1.7

Dask code running on this machine has read the metadata for the file via NFS, but has not yet read in the data arrays themselves.

Likewise we can see the observations, so we can perform a calculation locally. Here we average the predictions over the realisations and then compare them with the observations at a particular height. (This is a deliberately inefficient calculation, as we could average at only the required height, but you get the point.)

%%time
def scope():
    client = Client()
    predictions = xarray.open_dataset('/data/ecmwf/000490262cdd067721a34112963bcaa2b44860ab.nc').chunk('auto')
    observations = xarray.open_dataset('/data/eumetsat/ad-hoc/observations.nc').chunk('auto')

    averages = predictions.mean('realization')
    diff = averages.isel(height=10) - observations
    diff.compute()

#scope()

CPU times: user 10 µs, sys: 2 µs, total: 12 µs
Wall time: 13.8 µs

When we uncomment scope() and actually run this, it takes over 14 minutes to complete! Accessing the data over NFS between data centres (we run this notebook in AWS) is just too slow.

In fact just copying the data files onto the computer running this notebook takes the same sort of time. At least 2.8 GiB + 4.8 MiB of data must pass from the data centres to this machine to perform the calculation.

Instead we should obviously run the Dask tasks where the data is. We can do that on a Dask cluster.

Running Up a Cluster

The cluster is run up with a single command. It takes a while though

import subprocess

scheduler_process = subprocess.Popen([
        '../dask_multicloud/dask-boot.sh',
        f"rcar@{ecmwf_host}",
        f"rcar@{eumetsat_host}"
    ])

[#] ip link add dasklocal type wireguard
[#] wg setconf dasklocal /dev/fd/63
[#] ip -6 address add fda5:c0ff:eeee:0::1/64 dev dasklocal
[#] ip link set mtu 1420 up dev dasklocal
[#] ip -6 route add fda5:c0ff:eeee:2::/64 dev dasklocal
[#] ip -6 route add fda5:c0ff:eeee:1::/64 dev dasklocal
2022-06-29 14:46:57,237 - distributed.scheduler - INFO - -----------------------------------------------
2022-06-29 14:46:58,602 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-06-29 14:46:58,643 - distributed.scheduler - INFO - -----------------------------------------------
2022-06-29 14:46:58,644 - distributed.scheduler - INFO - Clear task state
2022-06-29 14:46:58,646 - distributed.scheduler - INFO -   Scheduler at:     tcp://172.17.0.2:8786
2022-06-29 14:46:58,646 - distributed.scheduler - INFO -   dashboard at:                     :8787
2022-06-29 14:47:16,104 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://[fda5:c0ff:eeee:1::11]:37977', name: ecmwf-1-2, status: undefined, memory: 0, processing: 0>
2022-06-29 14:47:16,107 - distributed.scheduler - INFO - Starting worker compute stream, tcp://[fda5:c0ff:eeee:1::11]:37977
2022-06-29 14:47:16,108 - distributed.core - INFO - Starting established connection
2022-06-29 14:47:16,108 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://[fda5:c0ff:eeee:1::11]:44575', name: ecmwf-1-3, status: undefined, memory: 0, processing: 0>
2022-06-29 14:47:16,109 - distributed.scheduler - INFO - Starting worker compute stream, tcp://[fda5:c0ff:eeee:1::11]:44575
2022-06-29 14:47:16,109 - distributed.core - INFO - Starting established connection
2022-06-29 14:47:16,113 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://[fda5:c0ff:eeee:1::11]:40121', name: ecmwf-1-1, status: undefined, memory: 0, processing: 0>
2022-06-29 14:47:16,114 - distributed.scheduler - INFO - Starting worker compute stream, tcp://[fda5:c0ff:eeee:1::11]:40121
2022-06-29 14:47:16,114 - distributed.core - INFO - Starting established connection
2022-06-29 14:47:16,119 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://[fda5:c0ff:eeee:1::11]:40989', name: ecmwf-1-0, status: undefined, memory: 0, processing: 0>
2022-06-29 14:47:16,121 - distributed.scheduler - INFO - Starting worker compute stream, tcp://[fda5:c0ff:eeee:1::11]:40989
2022-06-29 14:47:16,121 - distributed.core - INFO - Starting established connection
2022-06-29 14:47:23,342 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://[fda5:c0ff:eeee:2::11]:33423', name: eumetsat-2-0, status: undefined, memory: 0, processing: 0>
2022-06-29 14:47:23,343 - distributed.scheduler - INFO - Starting worker compute stream, tcp://[fda5:c0ff:eeee:2::11]:33423
2022-06-29 14:47:23,343 - distributed.core - INFO - Starting established connection
2022-06-29 14:47:23,346 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://[fda5:c0ff:eeee:2::11]:43953', name: eumetsat-2-1, status: undefined, memory: 0, processing: 0>
2022-06-29 14:47:23,348 - distributed.scheduler - INFO - Starting worker compute stream, tcp://[fda5:c0ff:eeee:2::11]:43953
2022-06-29 14:47:23,348 - distributed.core - INFO - Starting established connection
2022-06-29 14:47:23,350 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://[fda5:c0ff:eeee:2::11]:46089', name: eumetsat-2-3, status: undefined, memory: 0, processing: 0>
2022-06-29 14:47:23,352 - distributed.scheduler - INFO - Starting worker compute stream, tcp://[fda5:c0ff:eeee:2::11]:46089
2022-06-29 14:47:23,352 - distributed.core - INFO - Starting established connection
2022-06-29 14:47:23,357 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://[fda5:c0ff:eeee:2::11]:43727', name: eumetsat-2-2, status: undefined, memory: 0, processing: 0>
2022-06-29 14:47:23,358 - distributed.scheduler - INFO - Starting worker compute stream, tcp://[fda5:c0ff:eeee:2::11]:43727
2022-06-29 14:47:23,358 - distributed.core - INFO - Starting established connection

We need to wait for 8 distributed.core - INFO - Starting established connection lines - one from each of 4 worker processes on each of 2 worker machines.

What has happened here is:

start-scheduler.sh runs up a Docker container on this computer.
The container creates a WireGuard IPv6 data plane VPN. This involves generating shared keys for all the nodes and a network interface inside itself. This data plane VPN is transient and unique to this cluster.
The container runs a Dask scheduler, hosted on the data plane network.
It then asks each data centre to provision workers and routing.

Each data centre hosts a control process, accessible over the control plane network. On invocation:

The control process creates a WireGuard network interface on the data plane network. This acts as a router between the workers inside the data centres and the scheduler.
It starts Docker containers on compute instances. These containers have their own WireGuard network interface on the data plane network, routing via the control process instance.
The Docker containers spawn (4) Dask worker processes, each of which connects via the data plane network back to the scheduler created at the beginning.

The result is one container on this computer running the scheduler, talking to a container on each worker machine, over a throw-away data plane WireGuard IPv6 network which allows each of the (in this case 8) Dask worker processes to communicate with each other and the scheduler, even though they are partitioned over 3 data centres.

Something like this

Image(filename="images/datacentres-dask.png")

Key

Data plane network
Dask
NetCDF data

Connecting to the Cluster

The scheduler for the cluster is now running in a Docker container on this machine and is exposed on localhost, so we can create a client talking to it

client = Client("localhost:8786")

2022-06-29 14:47:35,535 - distributed.scheduler - INFO - Receive client connection: Client-69f22f41-f7ba-11ec-a0a2-0acd18a5c05a
2022-06-29 14:47:35,536 - distributed.core - INFO - Starting established connection
/home/ec2-user/miniconda3/envs/jupyter/lib/python3.10/site-packages/distributed/client.py:1287: VersionMismatchWarning: Mismatched versions found

+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| msgpack | 1.0.4  | 1.0.3     | 1.0.3   |
| numpy   | 1.23.0 | 1.22.3    | 1.22.3  |
| pandas  | 1.4.3  | 1.4.2     | 1.4.2   |
+---------+--------+-----------+---------+
Notes:
-  msgpack: Variation is ok, as long as everything is above 0.6
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

If you click through the client you should see the workers under the Scheduler Info node

# client

You can also click through to the Dashboard on http://localhost:8787/status. There we can show the workers on the task stream

def show_all_workers():
    my_org().compute(workers='ecmwf-1-0')
    my_org().compute(workers='ecmwf-1-1')
    my_org().compute(workers='ecmwf-1-2')
    my_org().compute(workers='ecmwf-1-3')
    my_org().compute(workers='eumetsat-2-0')
    my_org().compute(workers='eumetsat-2-1')
    my_org().compute(workers='eumetsat-2-2')
    my_org().compute(workers='eumetsat-2-3')
    sleep(0.5)

show_all_workers()

Running on the Cluster

Now that there is a Dask client in scope, calculations will be run on the cluster. We can define the tasks to be run

predictions = xarray.open_dataset('/data/ecmwf/000490262cdd067721a34112963bcaa2b44860ab.nc').chunk('auto')
observations = xarray.open_dataset('/data/eumetsat/ad-hoc/observations.nc').chunk('auto')

averages = predictions.mean('realization')
diff = averages.isel(height=10) - observations

But when we try to perform the calculation it fails

with pytest.raises(FileNotFoundError) as excinfo:
    show_all_workers()
    diff.compute()

str(excinfo.value)

"[Errno 2] No such file or directory: b'/data/eumetsat/ad-hoc/observations.nc'"

It fails because the Dask scheduler has sent some of the tasks to read the data to workers running in EUMETSAT. They cannot see the data in ECMWF, and nor do we want them too, because reading all that data between data centres would be too slow.

Data-Proximate Computation

Dask has the concept of resources. Tasks can be scheduled to run only where a resource (such as a GPU or amount of RAM) is available. We can abuse this mechanism to pin tasks to a data centre, by treating the data centre as a resource.

To do this, when we create the workers we mark them as having a pool-ecmwf or pool-eumetsat resource. Then when we want to create tasks that can only run in one data centre, we annotate them as requiring the appropriate resource

with (dask.annotate(resources={'pool-ecmwf': 1})):
    predictions.mean('realization').isel(height=10).compute()

We can hide that boilerplate inside a Python context manager - pool - and write

with pool('ecmwf'):
    predictions.mean('realization').isel(height=10).compute()

The pool context manager is a collaboration with the Dask developers, and is published on GitHub. You can read more on the evolution of the concept on the Dask Discourse.

We can do better than annotating the computation tasks though. If we load the data inside the context manager block, the data loading tasks will carry the annotation with them

with pool('ecmwf'):
    predictions = xarray.open_dataset('/data/ecmwf/000490262cdd067721a34112963bcaa2b44860ab.nc').chunk('auto')

In this case we need another context manager in the library, propagate_pools, to ensure that the annotation is not lost when the task graph is processed and executed

with propagate_pools():
    predictions.mean('realization').isel(height=10).compute()

The two context managers allow us to annotate data with its pool, and hence where the loading tasks will run

with pool('ecmwf'):
    predictions = xarray.open_dataset('/data/ecmwf/000490262cdd067721a34112963bcaa2b44860ab.nc').chunk('auto')

with pool('eumetsat'):
    observations = xarray.open_dataset('/data/eumetsat/ad-hoc/observations.nc').chunk('auto')

define some deferred calculations oblivious to the data’s provenance

averaged_predictions = predictions.mean('realization')
diff = averaged_predictions.isel(height=10) - observations

and then perform the final calculation

%%time
with propagate_pools():
    show_all_workers()
    diff.compute()

CPU times: user 127 ms, sys: 6.34 ms, total: 133 ms
Wall time: 4.88 s

Remember, our aim was to distribute a calculation across data centres, whilst preventing workers reading foreign bulk data.

Here we know that data is only being read by workers in the appropriate location, because neither data centre can read the other’s data. Once data is in memory, Dask prefers to schedule tasks on the workers that have it, so that the local workers will tend to perform follow-on calcuations, and data chunks will tend to stay in the data centre that they were read from.

Ordinarily though, if workers are idle, Dask would use them to perform calculations even if they don’t have the data. If allowed, this work stealing would result in data being moved between data centres unnecessarly, a potentially expensive operation. To prevent this, the propagate_pools context manager installs a scheduler optimisation that disallows work-stealing between workers in different pools.

Once data loaded in one pool needs to be combined with data from another (the substraction in averaged_predictions.isel(height=10) - observations above), this is no longer classified as work stealing, and Dask will move data between data centres as required.

That calculation in one go looks like this

%%time
with pool('ecmwf'):
    predictions = xarray.open_dataset('/data/ecmwf/000490262cdd067721a34112963bcaa2b44860ab.nc').chunk('auto')

with pool('eumetsat'):
    observations = xarray.open_dataset('/data/eumetsat/ad-hoc/observations.nc').chunk('auto')

averages = predictions.mean('realization')
diff = averages.isel(height=10) - observations

with propagate_pools():
    show_all_workers()
    plt.figure(figsize=(6, 6))
    plt.imshow(diff.to_array()[0,...,0], origin='lower')

CPU times: user 234 ms, sys: 27.6 ms, total: 261 ms
Wall time: 6.04 s

In terms of code, compared with the local version above, this has only added the use of with blocks to label data and manage execution, and executes some 100 times faster.

This is a best case, because in the demonstrator the files are actually hosted on the worker machines, so the speed difference between reading the files locally and reading them over NFS is maximized. Perhaps more persuasive is the measured volume of network traffic.

Method	Time Taken	Measured Network Traffic
Calculation over NFS	> 14 minutes	2.8 GiB
Distributed Calculation	~ 10 seconds	8 MiB

Apart from some control and status messages, only the data required to paint the picture is sent over the network to this computer.

Looking at the task stream we see the ECMWF workers (on the bottom) doing the bulk of the reading and computation, with the red transfer tasks joining this with data on EUMETSAT.

Image(filename="images/task-graph.png")

Catalogs

We can simplify this code even more. Because the data-loading tasks are labelled with their resource pool, this can be opaque to the scientist. So we can write

def load_from_catalog(path):
    with pool(path.split('/')[2]):
        return xarray.open_dataset(path).chunk('auto')

allowing us to ignore where the data came from

predictions = load_from_catalog('/data/ecmwf/000490262cdd067721a34112963bcaa2b44860ab.nc')
observations = load_from_catalog('/data/eumetsat/ad-hoc/observations.nc')

averages = predictions.mean('realization')
diff = averages.isel(height=10) - observations

with propagate_pools():
    show_all_workers()
    diff.compute()

Of course the cluster would have to be provisioned with compute resources in the appropriate data centres, although with some work this could be made dynamic as part of the catalog code.

More Information

This notebook, and the code behind it, are published in a GitHub repository.

For details of the prototype implementation, and ideas for enhancements, see dask-multi-cloud-details.ipynb.

Acknowledgements

Thank you to Armagan Karatosun (EUMETSAT) and Vasileios Baousis (ECMWF) for their help and support with the infrastructure to support this proof of concept. Gabe Joseph (Coiled) wrote the clever pool context managers, and Jacob Tomlinson (NVIDIA) reviewed this document.

Measuring Dask memory usage with dask-memusage

2021-03-11T00:00:00+00:00

Using too much computing resources can get expensive when you’re scaling up in the cloud.

To give a real example, I was working on the image processing pipeline for a spatial gene sequencing device, which could report not just which genes were being expressed but also where they were in a 3D volume of cells. In order to get this information, a specialized microscope took snapshots of the cell culture or tissue, and the resulting data was run through a Dask pipeline.

The pipeline was fairly slow, so I did some back-of-the-envelope math to figure out our computing costs would be once we started running more data for customers. It turned out that we’d be using 70% of our revenue just paying for cloud computing!

Clearly I needed to optimize this code.

When we think about the bottlenecks in large-scale computation, we often focus on CPU: we want to use more CPU cores in order to get faster results. Paying for all that CPU can be expensive, as in this case, and I did successfully reduce CPU usage by quite a lot.

But high memory usage was also a problem, and fixing that problem led me to build a series of tools, tools that can also help you optimize and reduce your Dask memory usage.

In the rest of this article you will learn:

How high memory usage can drive up your computing costs.
How a tool called dask-memusage can help you find peak memory usage of the tasks in your Dask execution graph.
How to further pinpoint high memory usage using the Fil memory profiler, so you can reduce memory usage.

As a reminder, I was working on a Dask pipeline that processed data from a specialized microscope. The resulting data volume was quite large, and certain subsets of images had to be processed together as a unit. From a computational standpoint, we effectively had a series of inputs X0, X1, X2, … that could be independently processed by a function f().

The internal processing of f() could not easily be parallelized further. From a CPU scheduling perspective, this was fine, it was still an embarrassingly parallel problem given the large of number of X inputs.

For example, if I provisioned a virtual machine with 4 CPU cores, to process the data I could start four processes, and each would max out a single core. If I had 12 inputs and each processing step took about the same time, they might run as follows:

CPU0: f(X0), f(X4), f(X8)
CPU1: f(X1), f(X5), f(X9)
CPU2: f(X2), f(X6), f(X10)
CPU3: f(X3), f(X7), f(X11)

If I could make f() faster, the pipeline as a whole would also run faster.

CPU is not the only resource used in computation, however: RAM can also be a bottleneck. For example, let’s say each call to f(Xi) took 12GB of RAM. That means to fully utilize 4 CPUs, I would need 48GB of RAM—but what if my computer only has 16GB of RAM?

Even though my computer has 4 CPUs, I can only utilize one CPU on a computer with 16GB RAM, because I don’t have enough RAM to run more than one task in parallel. In practice, these tasks ran in the cloud, where I could ensure the necessary RAM/core ratio was preserved by choosing the right pre-configured VM instances. And on some clouds you can freely set the amount of RAM and number of CPU cores for each virtual machine you spin up.

However, I didn’t quite know how much memory was used at peak, so I’d had to limit parallelism to reduce out-of-memory errors. As a result, the default virtual machines we were using had half their CPUs resting idle, resources were paying for but not using.

In order to provision hardware appropriately and max out all the CPUs, I needed to know how much peak memory each task was using. And to do that, I created a new tool.

Measuring peak task memory usage with `dask-memusage` {#dask-memusage}

dask-memusage is a tool for measuring peak memory usage for each task in the Dask execution graph.

Per task because Dask executes code as a graph of tasks, and the graph determines how much parallelism can be used.
Peak memory is important, because that is the bottleneck. It doesn’t matter if average memory usage per task is 4GB, if two parallel tasks in the graph need 12GB each at the same time, you’re going to need 24GB of RAM if you want to to run both tasks on the same computer.

Using `dask-memusage`

Since the gene sequencing code is proprietary and quite complex, let’s use a different example. We’re going to count the occurrence of words in some text files, and then report the top-10 most common words in each file. You can imagine combining the data later on, but we won’t bother with that in this simple example.

import sys
import gc
from time import sleep
from pathlib import Path
from dask.bag import from_sequence
from collections import Counter
from dask.distributed import Client, LocalCluster
import dask_memusage


def calculate_top_10(file_path: Path):
    gc.collect()  # See notes below

    # Load the file
    with open(file_path) as f:
        data = f.read()

    # Count the words
    counts = Counter()
    for word in data.split():
        counts[word.strip(".,'\"").lower()] += 1

    # Choose the top 10:
    by_count = sorted(counts.items(), key=lambda x: x[1])
    sleep(0.1)  # See notes below
    return (file_path.name, by_count[-10:])


def main(directory):
    # Setup the calculation:

    # Create a 4-process cluster (running locally). Note only one thread
    # per-worker: because polling is per-process, you can't run multiple
    # threads per worker, otherwise you'll get results that combine memory
    # usage of multiple tasks.
    cluster = LocalCluster(n_workers=4, threads_per_worker=1,
                           memory_limit=None)
    # Install dask-memusage:
    dask_memusage.install(cluster.scheduler, "memusage.csv")
    client = Client(cluster)

    # Create the task graph:
    files = from_sequence(Path(directory).iterdir())
    graph = files.map(calculate_top_10)
    graph.visualize(filename="example2.png", rankdir="TD")

    # Run the calculations:
    for result in graph.compute():
        print(result)
    # ... do something with results ...


if __name__ == '__main__':
    main(sys.argv[1])

Here’s what the task graph looks like:

Plenty of parallelism!

We can run the program on some files:

$ pip install dask[bag] dask_memusage
$ python example2.py files/
('frankenstein.txt', [('that', 1016), ('was', 1021), ('in', 1180), ('a', 1438), ('my', 1751), ('to', 2164), ('i', 2754), ('of', 2761), ('and', 3025), ('the', 4339)])
('pride_and_prejudice.txt', [('she', 1660), ('i', 1730), ('was', 1832), ('in', 1904), ('a', 1981), ('her', 2142), ('and', 3503), ('of', 3705), ('to', 4188), ('the', 4492)])
('greatgatsby.txt', [('that', 564), ('was', 760), ('he', 770), ('in', 849), ('i', 999), ('to', 1197), ('of', 1224), ('a', 1440), ('and', 1565), ('the', 2543)])
('big.txt', [('his', 40032), ('was', 45356), ('that', 47924), ('he', 48276), ('a', 83228), ('in', 86832), ('to', 114184), ('and', 152284), ('of', 159888), ('the', 314908)])

As one would expect, the most common words are stem words, but there is still some variation in order.

Next, let’s look at the results from dask-memusage.

`dask-memusage` output, and how it works

You’ll notice that the actual use of dask-memusage involves just one extra line, other than the import:

dask_memusage.install(cluster.scheduler, "memusage.csv")

What this will do is poll the process at 10ms intervals for peak memory usage, broken down by task. In this case, here’s what memusage.csv looks like:

task_key,min_memory_mb,max_memory_mb
"('from_sequence-3637e6ff937ef8488894df60a80f62ed', 3)",51.2421875,51.2421875
"('from_sequence-3637e6ff937ef8488894df60a80f62ed', 0)",51.70703125,51.70703125
"('from_sequence-3637e6ff937ef8488894df60a80f62ed', 1)",51.28125,51.78515625
"('from_sequence-3637e6ff937ef8488894df60a80f62ed', 2)",51.30859375,51.30859375
"('calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca', 2)",56.19140625,56.19140625
"('calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca', 0)",51.70703125,54.26953125
"('calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca', 1)",52.30078125,52.30078125
"('calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca', 3)",51.48046875,384.00390625

For each task in the graph we are told minimum memory usage and peak memory usage, in MB.

In more readable form:

task_key	min_memory_mb	max_memory_mb
“(‘from_sequence-3637e6ff937ef8488894df60a80f62ed’, 3)”	51.2421875	51.2421875
“(‘from_sequence-3637e6ff937ef8488894df60a80f62ed’, 0)”	51.70703125	51.70703125
“(‘from_sequence-3637e6ff937ef8488894df60a80f62ed’, 1)”	51.28125	51.78515625
“(‘from_sequence-3637e6ff937ef8488894df60a80f62ed’, 2)”	51.30859375	51.30859375
“(‘calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca’, 2)”	56.19140625	56.19140625
“(‘calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca’, 0)”	51.70703125	54.26953125
“(‘calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca’, 1)”	52.30078125	52.30078125
“(‘calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca’, 3)”	51.48046875	384.00390625

The bottom four lines are the interesting ones; all four start with a minimum memory usage of ~50MB RAM, and then memory may or may not increase as the code runs. How much it increases presumably depends on the size of the files; most of them are quite small, so memory usage doesn’t change much. One file uses much more maximum memory than the others, 384MB of RAM; presumably it’s big.txt which is 25MB, since the other files are all smaller than 1MB.

The mechanism used, polling peak process memory, has some limitations:

You’ll notice there’s a gc.collect() at the top of the calculate_top_10(); this ensures we don’t count memory from previous code that hasn’t been cleaned up yet.
There’s also a sleep() at the bottom of calculate_top_10(). Because polling is used, tasks that run too quickly won’t get accurate information—the polling happens every 10ms or so, so you want to sleep at least 20ms.
Finally, because polling is per-process, you can’t run multiple threads per worker, otherwise you’ll get results that combine memory usage of multiple tasks.

Interpreting the data

What we’ve learned is that memory usage of calculate_top_10() grows with file size; this can be used to characterize the memory requirements for the workload. That is, we can create a model that links data input sizes and required RAM, and then we can calculate the required RAM for any given level of parallelism. And that can guide our choice of hardware, if we assume one task per CPU core.

Going back to my original motivating problem, the gene sequencing pipeline: using the data from dask-memusage, I was able to come up with a formula saying “for this size input, this much memory is necessary”. Whenever we ran a batch job we could therefore set the parallelism as high as possible given the number of CPUs and RAM on the machine.

While this allowed for more parallelism, it still wasn’t sufficient—processing was still using a huge amount of RAM, RAM that we had to pay for either with time (by using less CPUs) or money (by paying for more expensive virtual machines that more RAM). So the next step was to reduce memory usage.

Reducing memory usage with Fil {#fil}

If we look at the dask-memusage output for our word-counting example, the memory usage seems rather high: for a 25MB file, we’re using 330MB of RAM to count words. Thinking through how an ideal version of this code might work, we ought to be able to process the file with much less memory (for example we could redesign our code to process the file line by line, reducing memory).

And that’s another way in which dask-memusage can be helpful: it can point us at specific code that needs memory usage optimized, at the granularity of a task. A task can be a rather large chunk of code, though, so the next step is to use a memory profiler that can point to specific lines of code.

When working on the gene sequencing tool I used the memory_profiler package, and while that worked, and I managed to reduce memory usage quite a bit, I found it quite difficult to use. It turns out that for batch data processing, the typical use case for Dask, you want a different kind of memory profiler.

So after I’d left that job, I created a memory profiler called Fil that is expressly designed for finding peak memory usage. Unlike dask-memusage, which can be run on production workloads, Fil slows down your execution and has other limitations I’m currently working on (it doesn’t support multiple processes, as of March 2021), so for now it’s better used for manual profiling.

We can write a little script that only runs on big.txt:

from pathlib import Path
from example2 import calculate_top_10

calculate_top_10(Path("files/big.txt"))

Run it under Fil:

pip install filprofiler
fil-profile run example3.py

And the result shows us where the bulk of the memory is being allocated:

Reading in the file takes 8% of memory, but data.split() is responsible for 84% of memory. Perhaps we shouldn’t be loading the whole file into memory and splitting the whole file into words, and instead we should be processing the file line by line. A good next step if this were real code would be to fix the way calculate_top_10() is implemented.

Next steps

What should you do if your Dask workload is using too much memory?

If you’re running Dask workloads with the Distributed backend, and you’re fine with only having one thread per worker, running with dask-memusage will give you real-world per-task memory usage on production workloads. You can then use the resulting information in a variety of ways:

As a starting point for optimizing memory usage. Once you know which tasks use the most memory, you can then use Fil to figure out which lines of code are responsible and then use a variety of techniques to reduce memory usage.
When possible, you can fine tune your chunking size; smaller chunks will use less memory. If you’re using Dask Arrays you can set the chunk size; with Dask Dataframes you can ensure good partition sizes.
You can fine tune your hardware configuration, so you’re not wasting RAM or CPU cores. For example, on AWS you can choose a variety of instance sizes with different RAM/CPU ratios, one of which may match your workload characteristics.

In my original use case, the gene sequencing pipeline, I was able to use a combination of lower memory use and lower CPU use to reduce costs to a much more modest level. And when doing R&D, I was able to get faster results with the same hardware costs.

You can learn more about dask-memusage here, and learn more about the Fil memory profiler here.

Configuring a Distributed Dask Cluster

2020-07-30T00:00:00+00:00

Configuring a Dask cluster can seem daunting at first, but the good news is that the Dask project has a lot of built in heuristics that try its best to anticipate and adapt to your workload based on the machine it is deployed on and the work it receives. Possibly for a long time you can get away with not configuring anything special at all. That being said, if you are looking for some tips to move on from using Dask locally, or have a Dask cluster that you are ready to optimize with some more in-depth configuration, these tips and tricks will help guide you and link you to the best Dask docs on the topic!

The biggest jump for me was from running a local version of Dask for just an hour or so at a time during development, to standing up a production-ready version of Dask. Broadly there are two styles:

a static dask cluster – one that is always on, always awake, always ready to accept work
an ephemeral dask cluster – one that is spun up or down easily with a Python API, and, when on, starts a minimal dask master node that itself only spins up dask workers when work is actually submitted

Though those are the two broad main categories, there are tons of choices of how to actually achieve that. It depends on a number of factors including what cloud provider products you want to use and if those resources are pre-provisioned for you and whether you want to use a python API or a different deployment tool to actually start the Dask processes. A very exhaustive list of all the different ways you could provision a dask cluster is in the dask docs under Setup. As just a taste of what is described in those docs, you could:

Install and start up the dask processes manually from the CLI on cloud instances you provision, such as AWS EC2 or GCP GCE
Use popular deployment interfaces such as helm for kubernetes to deploy dask in cloud container clusters you provision, such as AWS Fargate or GCP GKE
Use ‘native’ deployment python APIs, provided by the dask developers, to create (and interactively configure) dask on deployment infrastructure they support, either through the general-purpose Dask Gateway which supports multiple backends, or directly against cluster managers such as kubernetes with dask-kubernetes or YARN with dask-yarn, as long as you’ve already provisioned the kubernetes cluster or hadoop cluster already
Use a nearly full-service deployment python API called Dask Cloud Provider, that will go one step farther and provision the cluster for you too, as long as you give it AWS credentials (and as of time of writing, it only supports AWS)

As you can see, there are a ton of options. On top of all of those, you might contract a managed service provider to provision and configure your dask cluster for you according to your specs, such as Saturn Cloud (Disclaimer: one of the authors (Julia Signell) works for Saturn Cloud).

Whatever you choose, the whole point is to unlock the power of parallelism in Python that Dask provides, in as scalable a manner as possible which is what getting it running on distributed infrastructure is all about. Once you know where and with what API you are going to deploy your dask cluster, the real configuration process for your Dask cluster and its workload begins.

How to choose instance type for your cluster

When you are ready to set up your dask cluster for production, you will need to make some decisions about the infrastructure your scheduler and your workers will be running on, especially if you are using one of the options from How to host a distributed dask cluster that requires pre-provisioned infrastructure. Whether your infrastructure is on-prem or in the cloud, the classic decision points need to be made:

Memory requirements
CPU requirements
Storage requirements

If you have tested your workload locally, a simple heuristic is to multiply the CPU, storage, and memory usage of your work by some multiplier that is related to how scaled down your local experiments are from your expected production usage. For example, if you test your workload locally with a 10% sample of data, multiplying any observed resource usage by at least 10 may get you close to your minimum instance size. Though in reality Dask’s many underlying optimizations means that it shouldn’t regularly require linear growth of resources to work on more data, this simple heuristic may give you a starting point as a good first pass technique.

In the same vein, choosing the smallest instance and running with a predetermined subset of data and scaling up until it runs effectively gives you a hint towards the minimum instance size. If your local environment is too underpowered to run your flows locally with 10%+ of your source data, if it is a highly divergent environment (for example a different OS, or with many competing applications running in the background), or if it is difficult or annoying to monitor CPU, memory, and storage of your flow’s execution using your local machine, isolating the test case on the smallest workable node is a better option.

On the flip side, choosing the biggest instance you can afford and observing the discrepancy between max CPU/memory/storage metrics and scaling back based on the ratio of unused resources can be a quicker way to find your ideal size.

Wherever you land on node size might be heavily influenced by what you want to pay for, but as long as your node size is big enough that you are avoiding strict out of memory errors, the flip side of what you pay for with nodes closest to your minimum run specs is time. Since the point of your Dask cluster is to run distributed, parallel computations, you can get significant time savings if you scale up your instance to allow for more parallelism. If you have long running models that take hours to train that you can reduce to minutes, and get back some of your time or your employee’s time to see the feedback loop quickly, then scaling up over your minimum specs is worth it.

Should your scheduler node and worker nodes be the same size? It may certainly be tempting to provision them at separate instance sizes to optimize resources. It’s worth a quick dive into the general resource requirements of each to get a good sense.

For the scheduler, a serialized version of each task is submitted to it is held in memory for as long as it needs to determine which worker should take the work. This is not necessarily the same amount of memory needed to actually execute the task, but skimping too much on memory here may prevent work from being scheduled. From a CPU perspective, the needs of the scheduler are likely much lower than your workers, but starving the scheduler of CPU will cause deadlock, and when the scheduler is stuck or dies your workers also cannot get any work. Storage wise, the Dask scheduler does not persist much to disk, even temporarily, so it’s storage needs are quite low.

For the workers, the specific resource needs of your task code may overtake any generalizations we can make. If nothing else, they need enough memory and CPU to deserialize each task payload, and serialize it up again to return as a Future to the Dask scheduler. Dask workers may persist the results of computations in memory, including distributed across the memory of the cluster, which you can read more about here. Regarding storage needs, fundamentally tasks submitted to Dask workers should not write to local storage - the scheduler does not guarantee work will be run on a given worker - so the storage costs should be directly related to the installation footprint of your worker’s dependencies and any ephemeral storage of the dask workers. Temporary files the workers create may include spilling in-memory data to local disk if they run out of memory as long as that behavior isn’t disabled, which means that reducing memory can have an effect on your ephemeral storage needs.

Generally we would recommend simplifying your life and keeping your scheduler and worker nodes the same node size, but if you wanted to optimize them, use the above CPU, memory and storage patterns to give you a starting point for configuring them separately.

How to choose number of workers

Every dask cluster has one scheduler and any number of workers. The scheduler keeps track of what work needs to be done and what has already been completed. The workers do work, share results between themselves and report back to the scheduler. More background on what this entails is available in the dask.distributed documentation.

When setting up a dask cluster you have to decide how many workers to use. It can be tempting to use many workers, but that isn’t always a good idea. If you use too many workers some may not have enough to do and spend much of their time idle. Even if they have enough to do, they might need to share data with each other which can be slow. Additionally if your machine has finite resources (rather than one node per worker), then each worker will be weaker - they might run out of memory, or take a long time to finish a task.

On the other hand if you use too few workers you don’t get to take full advantage of the parallelism of dask and your work might take longer to complete overall.

Before you decide how many workers to use, try using the default. In many cases dask can choose a default that makes use of the size and shape of your machine. If that doesn’t work, then you’ll need some information about the size and shape of your work. In particular you’ll want to know:

What size is your computer or what types of compute nodes do you have access to?
How big is your data?
What is the structure of the computation that you are trying to do?

If you are working on your local machine, then the size of the computer is fixed and knowable. If you are working on HPC or cloud instances then you can choose the resources allotted to each worker. You make the decision about the size of your cluster based on factors we discussed in How to choose instance type for your cluster.

Dask is often used in situations where the data are too big to fit in memory. In these cases the data are split into chunks or partitions. Each task is computed on the chunk and then the results are aggregated. You will learn about how to change the shape of your data below.

The structure of the computation might be the hardest to reason about. If possible, it can be helpful to try out the computation on a very small subset of the data. You can see the task graph for a particular computation by calling .visualize(). If the graph is too large to comfortably view inline, then take a look at the Dask dashboard graph tab. This shows the task graph as it runs and lights up each section. To make dask most efficient, you want a task graph that isn’t too big or too interconnected. The dask docs discuss several techniques for optimizing your task graph.

To pick the number of workers to use, think about how many concurrent tasks are happening at any given part of the graph. If each task contains a non-trivial amount of work, then the fastest way to run dask is to have a worker for each concurrent task. For chunked data, if each worker is able to comfortably hold one data chunk in memory and do some computation on that data, then the number of chunks should be a multiple of the number of workers. This ensures that there is always enough work for a worker to do.

If you have a highly variable number of tasks, then you can also consider using an adaptive cluster. In an adaptive cluster, you set the minimum and maximum number of workers and let the cluster add and remove workers as needed. When the scheduler determines that some workers aren’t needed anymore it asks the cluster to shut them down, and when more are needed, the scheduler asks the cluster to spin more up. This can work nicely for task graphs that start out with few input tasks then have more tasks in the middle, and then some aggregation or reductions at the end.

Once you have started up your cluster with some workers, you can monitor their progress in the dask dashboard. There you can check on their memory consumption, watch their progress through the task graph, and access worker-level logs. Watching your computation in this way, provides insight into potential speedups and builds intuition about the number of workers to use in the future.

The tricky bit about choosing the number of workers to use is that in practice the size and shape of your machine, data, and task graph can change. Figuring out how many workers to use can end up feeling like an endless fiddling of knobs. If this is starting to drive you crazy then remember that you can always change the number or workers, even while the cluster is running.

How to choose nthreads to utilize multithreading

When starting dask workers themselves, there are two very important configuration options to play against each other: how many workers and how many threads per worker. You can actually manipulate both on the same worker process with flags, such as in the form dask-worker --nprocs 2 --nthreads 2, though --nprocs simply spins up another worker in the background so it is cleaner configuration to avoid setting --nprocs and instead manipulate that configuration with whatever you use to specify total number of workers. We already talked about how to choose number of workers, but you may modify your decision about that if you change a workers’ --nthreads to increase the amount of work an individual worker can do.

When deciding the best number of nthreads for your workers, it all boils down to the type of work you expect those workers to do. The fundamental principle is that multiple threads are best to share data between tasks, but worse if running code that doesn’t release Python’s GIL (“Global Interpreter Lock”). Increasing the nthreads for work that does not release the Python’s GIL has no effect; the worker cannot use threading to optimize the speed of computation if the GIL is locked. This is a possible point of confusion for new Dask users who want to increase their parallelism, but don’t see any gains from increasing the threading limit of their workers.

As discussed in the Dask docs on workers, there are some rules of thumb when to worry about GIL lockages, and thus prefer more workers over heavier individual workers with high nthreads:

If your code is mostly pure Python (in non-optimized Python libraries) on non-numerical data
If your code causes computations external to Python that are long running and don’t release the GIL explicitly

Conveniently, a lot of dask users are running exclusively numerical computations using Python libraries optimized for multithreading, namely NumPy, Pandas, SciPy, etc in the PyData stack. If you do mostly numerical computations using those or similarly optimized libraries, you should emphasize a higher thread count. If you truly are doing mostly numerical computations, you can specify as many total threads as you have cores; if you are doing any work that would cause a thread to pause, for example any I/O (to write results to disk, perhaps), you can specify more threads than you have cores, since some will be occasionally sitting idle. The ideal number regarding how many more threads than cores to set in that situation is complex to estimate and dependent on your workload, but taking some advice from concurrent.futures, 5 times the processors on your machine is a historical upper bound to limit your total thread count to for heavily I/O dependent workloads.

How to chunk arrays and partition DataFrames

There are many different methods of triggering work in dask. For instance: you can wrap functions with delayed or submit work directly to the client (for a comparison of the options see User Interfaces). If you are loading structured data into dask objects, then you are likely using dask.array or dask.dataframe. These modules mimic numpy and pandas respectively - making it easier to interact with large arrays and large tabular datasets.

When using dask.dataframe and dask.array, computations are divided among workers by splitting the data into pieces. In dask.dataframe these pieces are called partitions and in dask.array they are called chunks, but the principle is the same. In the case of dask.array each chunk holds a numpy array and in the case of dask.dataframe each partition holds a pandas dataframe. Either way, each one contains a small part of the data, but is representative of the whole and must be small enough to comfortably fit in worker memory.

Often when loading in data, the partitions/chunks will be determined automatically. For instance, when reading from a directory containing many csv files, each file will become a partition. If your data are not split up by default, then it can be done manually using df.set_index or array.rechunk. If they are split up by default and you want to change the shape of the chunks, the file-level chunks should be a multiple of the dask level chunks (read more about this here).

As the user, you know how the data are going to be used, so you can often partition it in ways that lead to more efficient computations. For instance if you are going to be aggregating to a monthly step, it can make sense to chunk along the time axis. If instead you are going to be looking at a particular feature at different altitudes, it might make sense to chunk along the altitude. More tips for chunking dask.arrays are described in Best Practices. Another scenario in which it might be helpful to repartition is if you have filtered the data down to a subset of the original. In that case your partitions will likely be too small. See the dask.dataframe Best Practices for more details on how to handle that case.

When choosing the size of chunks it is best to make them neither too small, nor too big (around 100MB is often reasonable). Each chunk needs to be able to fit into the worker memory and operations on that chunk should take some non-trivial amount of time (more than 100ms). For many more recommendations take a look at the docs on chunks and on partitions.

We hope this helps you make decisions about whether to configure your Dask deployment differently and give you the confidence to try it out. We found all of this great information in the Dask docs, so if you are feeling inspired please follow the links we’ve sprinkled throughout and learn even more about Dask!

Dask-jobqueue

2018-10-08T00:00:00+00:00

This work was done in collaboration with Matthew Rocklin (Anaconda), Jim Edwards (NCAR), Guillaume Eynard-Bontemps (CNES), and Loïc Estève (INRIA), and is supported, in part, by the US National Science Foundation Earth Cube program. The dask-jobqueue package is a spinoff of the Pangeo Project. This blogpost was previously published here

TLDR; Dask-jobqueue allows you to seamlessly deploy dask on HPC clusters that use a variety of job queuing systems such as PBS, Slurm, SGE, or LSF. Dask-jobqueue provides a Pythonic user interface that manages dask workers/clusters through the submission, execution, and deletion of individual jobs on a HPC system. It gives users the ability to interactively scale workloads across large HPC systems; turning an interactive Jupyter Notebook into a powerful tool for scalable computation on very large datasets.

Install with:

conda install -c conda-forge dask-jobqueue

pip install dask-jobqueue

And checkout the dask-jobqueue documentation: http://jobqueue.dask.org

Large high-performance computer (HPC) clusters are ubiquitous throughout the computational sciences. These HPC systems include powerful hardware, including many large compute nodes, high-speed interconnects, and parallel file systems. An example of such systems that we use at NCAR is named Cheyenne. Cheyenne is a fairly large machine, with about 150k cores and over 300 TB of total memory.

Cheyenne is a 5.34-petaflops, high-performance computer operated by NCAR.

These systems frequently use a job queueing system, such as PBS, Slurm, or SGE, to manage the queueing and execution of many concurrent jobs from numerous users. A “job” is a single execution of a program that is to be run on some set of resources on the user’s HPC system. These jobs are often submitted via the command line:

qsub do_thing_a.sh

Where do_thing_a.sh is a shell script that might look something like this:

#!/bin/bash
#PBS -N thing_a
#PBS -q premium
#PBS -A 123456789
#PBS -l select=1:ncpus=36:mem=109G

echo “doing thing A”

In this example “-N” specifies the name of this job, “-q” specifies the queue where the job should be run, “-A” specifies a project code to bill for the CPU time used while the job is run, and “-l” specifies the hardware specifications for this job. Each job queueing system has slightly different syntax for configuring and submitting these jobs.

This interface has led to the development of a few common workflow patterns:

MPI if you want to scale. MPI stands for the Message Passing Interface. It is a widely adopted interface allowing parallel computation across traditional HPC clusters. Many large computational models are written in languages like C and Fortran and use MPI to manage their parallel execution. For the old-timers out there, this is the go-to solution when it comes time to scale complex computations.
Batch it. It is quite common for scientific processing pipelines to include a few steps that can be easily parallelized by submitting multiple jobs in parallel. Maybe you want to “do_thing_a.sh” 500 times with slightly different inputs — easy, just submit all the jobs separately (or in what some queueing systems refer to as “array-job”).
Serial is still okay. Computers are pretty fast these days, right? Maybe you don’t need to parallelize your programing at all. Okay, so keep it serial and get some coffee while your job is running.

The Problem

None of the workflow patterns listed above allow for interactive analysis on very large data analysis. When I’m prototyping new processing method, I often want to work interactively, say in a Jupyter Notebook. Writing MPI code on the fly is hard and expensive, batch jobs are inherently not interactive, and serial just won’t do when I start working on many TBs of data. Our experience is that these workflows tend to be fairly inelegant and difficult to transfer between applications, yielding lots of duplicated effort along the way.

One of the aims of the Pangeo project is to facilitate interactive data on very large datasets. Pangeo leverages Jupyter and dask, along with a number of more domain specific packages like xarray to make this possible. The problem is we didn’t have a particularly palatable method for deploying dask on our HPC clusters.

The System

Jupyter Notebooks are web applications that support interactive code execution, display of figures and animations, and in-line explanatory text and equations. They are quickly becoming the standard open-source format for interactive computing in Python.
Dask is a library for parallel computing that coordinates well with Python’s existing scientific software ecosystem, including libraries like NumPy, Pandas, Scikit-Learn, and xarray. In many cases, it offers users the ability to take existing workflows and quickly scale them to much larger applications. *Dask-distributed* is an extension of dask that facilitates parallel execution across many computers.
Dask-jobqueue is a new Python package that we’ve built to facilitate the deployment of dask on HPC clusters and interfacing with a number of job queuing systems. Its usage is concise and Pythonic.

from dask_jobqueue import PBSCluster
from dask.distributed import Client

cluster = PBSCluster(cores=36,
                     memory="108GB",
                     queue="premium")
cluster.scale(10)
client = Client(cluster)

What’s happening under the hood?

In the call to PBSCluster() we are telling dask-jobqueue how we want to configure each job. In this case, we set each job to have 1 Worker, each using 36 cores (threads) and 108 GB of memory. We also tell the PBS queueing system we’d like to submit this job to the “premium” queue. This step also starts a Scheduler to manage workers that we’ll add later.
It is not until we call the cluster.scale() method that we interact with the PBS system. Here we start 10 workers, or equivalently 10 PBS jobs. For each job, dask-jobqueue creates a shell command similar to the one above (except dask-worker is called instead of echo) and submits the job via a subprocess call.
Finally, we connect to the cluster by instantiating the Client class. From here, the rest of our code looks just as it would if we were using one of dask’s local schedulers.

Dask-jobqueue is easily customizable to help users capitalize on advanced HPC features. A more complicated example that would work on NCAR’s Cheyenne super computer is:

cluster = PBSCluster(cores=36,
                    processes=18,
                    memory="108GB",
                    project='P48500028',
                    queue='premium',
                    resource_spec='select=1:ncpus=36:mem=109G',
                    walltime='02:00:00',
                    interface='ib0',
                    local_directory='$TMPDIR')

In this example, we instruct the PBSCluster to 1) use up to 36 cores per job, 2) use 18 worker processes per job, 3) use the large memory nodes with 109 GB each, 4) use a longer walltime than is standard, 5) use the InfiniBand network interface (ib0), and 6) use the fast SSD disks as its local directory space.

Finally, Dask offers the ability to “autoscale” clusters based on a set of heuristics. When the cluster needs more CPU or memory, it will scale up. When the cluster has unused resources, it will scale down. Dask-jobqueue supports this with a simple interface:

cluster.adapt(minimum=18, maximum=360)

In this example, we tell our cluster to autoscale between 18 and 360 workers (or 1 and 20 jobs).

Demonstration

We have put together a fairly comprehensive screen cast that walks users through all the steps of setting up Jupyter and Dask (and dask-jobqueue) on an HPC cluster:

Conclusions

Dask jobqueue makes it much easier to deploy Dask on HPC clusters. The package provides a Pythonic interface to common job-queueing systems. It is also easily customizable.

The autoscaling functionality allows for a fundamentally different way to do science on HPC clusters. Start your Jupyter Notebook, instantiate your dask cluster, and then do science — let dask determine when to scale up and down depending on the computational demand. We think this bursting approach to interactive parallel computing offers many benefits.

Finally, in developing dask-jobqueue, we’ve run into a few challenges that are worth mentioning.

Queueing systems are highly customizable. System administrators seem to have a lot of control over their particularly implementation of each queueing system. In practice, this means that it is often difficult to simultaneously cover all permutations of a particular queueing system. We’ve generally found that things seem to be flexible enough and welcome feedback in the cases where they are not.
CI testing has required a fair bit of work to setup. The target environment for using dask-jobqueue is on existing HPC clusters. In order to facilitate continuous integration testing of dask-jobqueue, we’ve had to configure multiple queueing systems (PBS, Slurm, SGE) to run in docker using Travis CI. This has been a laborious task and one we’re still working on.
We’ve built dask-jobqueue to operate in the dask-deploy framework. If you are familiar with dask-kubernetes or dask-yarn, you’ll recognize the basic syntax in dask-jobqueue as well. The coincident development of these dask deployment packages has recently brought up some important coordination discussions (e.g. dask/distributed#2235).

Dask Working Notes - Posts tagged distributed

Shuffling large data at constant memory in Dask

Task-based shuffling scales poorly

P2P shuffling

Results

Memory stability

Performance and Speed

Changing defaults

Share your experience

Keep old behavior

What about arrays?

Next steps

Summary

Data Proximate Computation on a Dask Cluster Distributed Between Data Centres

Introduction

Setup

Mount the Data

Access to Data

Our Calculation

Running Up a Cluster

Connecting to the Cluster

Running on the Cluster

Data-Proximate Computation

Catalogs

More Information

Acknowledgements

Measuring Dask memory usage with dask-memusage

Measuring peak task memory usage with dask-memusage {#dask-memusage}

Using dask-memusage

dask-memusage output, and how it works

Interpreting the data

Reducing memory usage with Fil {#fil}

Next steps

Configuring a Distributed Dask Cluster

How to choose instance type for your cluster

How to choose number of workers

How to choose nthreads to utilize multithreading

How to chunk arrays and partition DataFrames

Dask-jobqueue

The Problem

The System

What’s happening under the hood?

Demonstration

Conclusions

Measuring peak task memory usage with `dask-memusage` {#dask-memusage}

Using `dask-memusage`

`dask-memusage` output, and how it works