Dask Working Notes - Posted in 2020

Image Analysis Redux

2020-11-12T00:00:00+00:00

Last year we experimented with Dask/ITK/Scikit-Image to perform large scale image analysis on a stack of 3D images. Specifically, we looked at deconvolution, a common method to deblur images. Now, a year later, we return to these experiments with a better understanding of how Dask and CuPy can interact, enhanced serialization methods, and support from the open-source community. This post looks at the following:

Implementing a common deconvolution method for CPU + GPU
Leveraging Dask to perform deconvolution on a larger dataset
Exploring the results with the Napari image viewer

Image Analysis Redux

Previously we used the Richardson Lucy (RL) deconvolution algorithm from ITK and Scikit-Image. We left off at theorizing how GPUs could potentially help accelerate these workflows. Starting with Scikit-Image’s implementation, we naively tried replacing scipy.signal.convolve calls with cupyx.scipy.ndimage.convolve, and while performance improved, it did not improve significantly – that is, we did not get the 100X speed we were looking for.

As it often turns out in mathematics a problem that can be inefficient to solve in one representation can often be made more efficent by transforming the data beforehand. In this new representation we can solve the same problem (convolution in this case) more easily before transforming the result back into a more familiar representation. When it comes to convolution, the transformation we apply is called Fast-Fourier Transform (FFT). Once this is applied we are able to convolve data using a simple multiplication.

As it turns out this FFT transformation is extremely fast on both CPUs and GPUs. Similarly the algorithm we can write with FFTs is accelerated. This is a commonly used technique in the image processing field to speed up convolutions. Despite the added step of doing FFTs, the cost of transformation + the cost of the algorithm is still lower than performing the original algorithm in real space. We (and others before us) found this was the case for Richardson Lucy (on both CPUs and GPUs) and performance continued increasing when we parallelized with Dask over multiple GPUs.

Help from Open-Source

An FFT RL equivalent has been around for some time and the good folks at the Solar Dynamics Observatory built and shared a NumPy/CuPy implementation as part the Atmospheric Imaging Assembly Python package (aiapy). We slightly modified their implementation to handle 3D as well as 2D Point Spread Functions and to take advantage of NEP-18 for convenient dispatching of NumPy and CuPy to NumPy and CuPy functions:

def deconvolve(img, psf=None, iterations=20):
    # Pad PSF with zeros to match image shape
    pad_l, pad_r = np.divmod(np.array(img.shape) - np.array(psf.shape), 2)
    pad_r += pad_l
    psf = np.pad(psf, tuple(zip(pad_l, pad_r)), 'constant', constant_values=0)

    # Recenter PSF at the origin
    # Needed to ensure PSF doesn't introduce an offset when
    # convolving with image
    for i in range(psf.ndim):
        psf = np.roll(psf, psf.shape[i] // 2, axis=i)

    # Convolution requires FFT of the PSF
    psf = np.fft.rfftn(psf)

    # Perform deconvolution in-place on a copy of the image
    # (avoids changing the original)
    img_decon = np.copy(img)
    for _ in range(iterations):
        ratio = img / np.fft.irfftn(np.fft.rfftn(img_decon) * psf)
        img_decon *= np.fft.irfftn((np.fft.rfftn(ratio).conj() * psf).conj())
    return img_decon

For a 1.3 GB image we measured the following:

CuPy ~3 seconds for 20 iterations
NumPy ~36 seconds for 2 iterations

We see 10x increase in speed for 10 times the number of iterations – very close to our desired 100x speedup! Let’s explore how this implementation performs with real biological data and Dask…

Define a Dask Cluster and Load the Data

We were provided sample data from Prof. Shroff’s lab at the NIH. The data originally was provided as a 3D TIFF file which we subsequently converted to Zarr with a shape of (950, 2048, 2048).

We start by creating a Dask cluster on a DGX2 (16 GPUs in a single node) and loading the image stored in Zarr :

Example Notebook

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask.array as da

import rmm
import cupy as cp

cluster = LocalCUDACluster(
    local_directory="/tmp/bzaitlen",
    enable_nvlink=True,
    rmm_pool_size="26GB",
)
client = Client(cluster)

client.run(
    cp.cuda.set_allocator,
    rmm.rmm_cupy_allocator
)

imgs = da.from_zarr("/public/NVMICROSCOPY/y1z1_C1_A.zarr/")

	Array	Chunk
Bytes	7.97 GB	8.39 MB
Shape	(950, 2048, 2048)	(1, 2048, 2048)
Count	951 Tasks	950 Chunks
Type	uint16	numpy.ndarray

2048 2048 950

From the Dask output above you can see the data is a z-stack of 950 images where each slice is 2048x2048. For this data set, we can improve GPU performance if we operate on larger chunks. Additionally, we need to ensure the chunks are are least as big as the PSF which in this case is, (128, 128, 128). As we did our work on a DGX2, which has 16 GPUs, we can comfortably fit the data and perform deconvolution on each GPU if we rechunk the data accordingly:

# chunk with respect to PSF shape (128, 128, 128)
imgs = imgs.rechunk(chunks={0: 190, 1: 512, 2: 512})
imgs

	Array	Chunk
Bytes	7.97 GB	99.61 MB
Shape	(950, 2048, 2048)	(190, 512, 512)
Count	967 Tasks	80 Chunks
Type	uint16	numpy.ndarray

Next, we convert to float32 as the data may not already be of floating point type. Also 32-bit is a bit faster than 64-bit when computing and saves a bit on memory. Below we map cupy.asarray onto each block of data. cupy.asarray moves the data from host memory (NumPy) to the device/GPU (CuPy).

imgs = imgs.astype(np.float32)
c_imgs = imgs.map_blocks(cp.asarray)

	Array	Chunk
Bytes	15.94 GB	199.23 MB
Shape	(950, 2048, 2048)	(190, 512, 512)
Count	80 Tasks	80 Chunks
Type	float32	cupy.ndarray

2048 2048 950

What we now have is a Dask array composed of 16 CuPy blocks of data. Notice how Dask provides nice typing information in the SVG output. When we moved from NumPy to CuPy, the block diagram above displays Type: cupy.ndarray – this is a nice sanity check.

The last piece we need before running the deconvolution is the PSF which should also be loaded onto the GPU:

import skimage.io

psf = skimage.io.imread("/public/NVMICROSCOPY/PSF.tif")
c_psf = cp.asarray(psf)

Lastly, we call map_overlap with the deconvolve function across the Dask array:

out = da.map_overlap(
    deconvolve,
    c_imgs,
    psf=c_psf,
    iterations=100,
    meta=c_imgs._meta,
    depth=tuple(np.array(c_psf.shape) // 2),
    boundary="periodic"
)
out

The image above is taken from a mouse intestine.

With Dask and multiple GPUs, we measured deconvolution of an 16GB image in ~30 seconds! But this is just the first step towards accelerated image science.

Napari

Deconvolution is just one operation and one tool, an image scientist or microscopist will need. They will need other tools as they study the underlying biology. Before getting to those next steps, they will need tools to visualize the data. Napari, a multi-dimensional image viewer used in the PyData Bio ecosystem, is a good tool for visualizing this data. As an experiment, we ran the same workflow on a local workstation with 2 Quadro RTX 8000 GPUs connected with NVLink. Example Notebook

By adding a map_blocks call to our array, we can move our data back from GPU to CPU (device to host).

def cupy_to_numpy(x):
    import cupy as cp
    return cp.asnumpy(x)

np_out = out.map_blocks(cupy_to_numpy, meta=out)

When the user moves the slider on the Napari UI, we are instructing dask to the following:

Load the data from disk onto the GPU (CuPy)
Compute the deconvolution
Move back to the host (NumPy)
Render with Napari

This has about a second latency which is great for a naive implementation! We can improve this by adding caching, improving communications with map_overlap, and optimizing the deconvolution kernel.

Conclusion

We have now shown with Dask + CuPy how one can perform Richardson-Lucy Deconvolution. This required a minimal amount of code. Combining this with an image viewer (Napari), we were able to inspect the data and our result. All of this performed reasonably well by assembling PyData libraries: Dask, CuPy, Zarr, and Napari with a new deconvolution kernel. Hopefully this provides you a good template to get started analyzing your own data and demonstrates the richness and easy expression of custom workflows. If you run into any challenges, please reach out on the Dask issue tracker and we would be happy to engage with you :)

2020 Dask User Survey

2020-09-22T00:00:00+00:00

This post presents the results of the 2020 Dask User Survey, which ran earlier this summer. Thanks to everyone who took the time to fill out the survey! These results help us better understand the Dask community and will guide future development efforts.

The raw data, as well as the start of an analysis, can be found in this binder:

Let us know if you find anything in the data.

We had 240 responses to the survey (slightly fewer than last year, which had about 260).
Overall, results look mostly similar to last year’s.
Our documentation has probably improved relative to last year
Respondents care more about performance relative to last year.

New Questions

Most of the questions are the same as in 2019. We added a couple questions about deployment and dashboard usage. Let’s look at those first.

Among respondents who use a Dask package to deploy a cluster (about 53% of respondents), there’s a wide spread of methods.

Most people access the dashboard through a web browser. Those not using the dashboard are likely (hopefully) just using Dask on a single machine with the threaded scheduler (though the dashboard works fine on a single machine as well).

Learning Resources

Respondents’ learning material usage is farily similar to last year. The most notable differences are from our survey form providing more options (our YouTube channel and “Gitter chat”). Other than that, examples.dask.org might be relatively more popular.

Just like last year, we’ll look at resource usage grouped by how often they use Dask.

A few observations

GitHub issues are becoming relatively less popular, which perhaps reflects better documentation or stability (assuming people go to the issue tracker when they can’t find the answer in the docs or they hit a bug).
https://examples.dask.org is notably now more popular among occasinal users.
In response to last year’s survey, we invested time in making https://tutorial.dask.org better, which we previously felt was lacking. Its usage is still about the same as last year’s (pretty popular), so it’s unclear whether we should dedicate additional focus there.

How do you use Dask?

API usage remains about the same as last year (recall that about 20 fewer people took the survey and people can select multiple, so relative differences are most interesting). We added new choices for RAPIDS, Prefect, and XGBoost, each of which are somewhat popular (in the neighborhood of dask.Bag).

About 65% of our users are using Dask on a cluster at least some of the time, which is similar to last year.

How can Dask improve?

Respondents continue to say that more documentation and examples would be the most valuable improvements to the project.

One interesting change comes from looking at “Which would help you most right now?” split by API group (dask.dataframe, dask.array, etc.). Last year showed that “More examples” in my field was the most important for all API groups (first table below). But in 2020 there are some differences (second table below).

            <tr>
                    <th id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87level0_row0" class="row_heading level0 row0" >Array</th>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row0_col0" class="data row0 col0" >10</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row0_col1" class="data row0 col1" >24</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row0_col2" class="data row0 col2" >62</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row0_col3" class="data row0 col3" >15</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row0_col4" class="data row0 col4" >25</td>
        </tr>
        <tr>
                    <th id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87level0_row1" class="row_heading level0 row1" >Bag</th>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row1_col0" class="data row1 col0" >3</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row1_col1" class="data row1 col1" >11</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row1_col2" class="data row1 col2" >16</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row1_col3" class="data row1 col3" >10</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row1_col4" class="data row1 col4" >7</td>
        </tr>
        <tr>
                    <th id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87level0_row2" class="row_heading level0 row2" >DataFrame</th>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row2_col0" class="data row2 col0" >16</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row2_col1" class="data row2 col1" >32</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row2_col2" class="data row2 col2" >71</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row2_col3" class="data row2 col3" >39</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row2_col4" class="data row2 col4" >26</td>
        </tr>
        <tr>
                    <th id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87level0_row3" class="row_heading level0 row3" >Delayed</th>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row3_col0" class="data row3 col0" >16</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row3_col1" class="data row3 col1" >22</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row3_col2" class="data row3 col2" >55</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row3_col3" class="data row3 col3" >26</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row3_col4" class="data row3 col4" >27</td>
        </tr>
        <tr>
                    <th id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87level0_row4" class="row_heading level0 row4" >Futures</th>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row4_col0" class="data row4 col0" >12</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row4_col1" class="data row4 col1" >9</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row4_col2" class="data row4 col2" >25</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row4_col3" class="data row4 col3" >20</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row4_col4" class="data row4 col4" >17</td>
        </tr>
        <tr>
                    <th id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87level0_row5" class="row_heading level0 row5" >ML</th>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row5_col0" class="data row5 col0" >5</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row5_col1" class="data row5 col1" >11</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row5_col2" class="data row5 col2" >23</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row5_col3" class="data row5 col3" >11</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row5_col4" class="data row5 col4" >7</td>
        </tr>
        <tr>
                    <th id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87level0_row6" class="row_heading level0 row6" >Xarray</th>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row6_col0" class="data row6 col0" >8</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row6_col1" class="data row6 col1" >11</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row6_col2" class="data row6 col2" >34</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row6_col3" class="data row6 col3" >7</td>
                    <td id="T_0a8701b8_e96b_11ea_9e95_186590cd1c87row6_col4" class="data row6 col4" >9</td>
        </tr>
</tbody></table>

2019 normalized by row. Darker means that a higher proporiton of users of that API prefer that priority.
Which would help you most right now?	Bug fixes	More documentation	More examples in my field	New features	Performance improvements
Dask APIs

            <tr>
                    <th id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87level0_row0" class="row_heading level0 row0" >Array</th>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row0_col0" class="data row0 col0" >12</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row0_col1" class="data row0 col1" >16</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row0_col2" class="data row0 col2" >56</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row0_col3" class="data row0 col3" >15</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row0_col4" class="data row0 col4" >23</td>
        </tr>
        <tr>
                    <th id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87level0_row1" class="row_heading level0 row1" >Bag</th>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row1_col0" class="data row1 col0" >7</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row1_col1" class="data row1 col1" >5</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row1_col2" class="data row1 col2" >24</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row1_col3" class="data row1 col3" >7</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row1_col4" class="data row1 col4" >16</td>
        </tr>
        <tr>
                    <th id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87level0_row2" class="row_heading level0 row2" >DataFrame</th>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row2_col0" class="data row2 col0" >24</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row2_col1" class="data row2 col1" >21</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row2_col2" class="data row2 col2" >67</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row2_col3" class="data row2 col3" >22</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row2_col4" class="data row2 col4" >41</td>
        </tr>
        <tr>
                    <th id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87level0_row3" class="row_heading level0 row3" >Delayed</th>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row3_col0" class="data row3 col0" >15</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row3_col1" class="data row3 col1" >19</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row3_col2" class="data row3 col2" >46</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row3_col3" class="data row3 col3" >17</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row3_col4" class="data row3 col4" >34</td>
        </tr>
        <tr>
                    <th id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87level0_row4" class="row_heading level0 row4" >Futures</th>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row4_col0" class="data row4 col0" >9</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row4_col1" class="data row4 col1" >10</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row4_col2" class="data row4 col2" >21</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row4_col3" class="data row4 col3" >13</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row4_col4" class="data row4 col4" >24</td>
        </tr>
        <tr>
                    <th id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87level0_row5" class="row_heading level0 row5" >ML</th>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row5_col0" class="data row5 col0" >6</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row5_col1" class="data row5 col1" >4</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row5_col2" class="data row5 col2" >21</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row5_col3" class="data row5 col3" >9</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row5_col4" class="data row5 col4" >12</td>
        </tr>
        <tr>
                    <th id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87level0_row6" class="row_heading level0 row6" >Xarray</th>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row6_col0" class="data row6 col0" >3</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row6_col1" class="data row6 col1" >4</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row6_col2" class="data row6 col2" >25</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row6_col3" class="data row6 col3" >9</td>
                    <td id="T_0a8d3eac_e96b_11ea_9e95_186590cd1c87row6_col4" class="data row6 col4" >13</td>
        </tr>
</tbody></table>

Examples are again the most important (for all API groups except Futures). But “Performance improvements” is now the second-most important improvement (except for Futures where it’s most important). How should we interpret this? A charitable interpretation is that Dask’s users are scaling to larger problems and are running into new scaling challenges. A less charitable interpretation is that our user’s workflows are the same but Dask is getting slower!

What other systems do you use?

SSH continues to be the most popular “cluster resource mananger”. This was the big surprise last year, so we put in some work to make it nicer. Aside from that, not much has changed.

And Dask users are about as happy with its stability as last year.

Takeaways

Overall, most things are similar to last year.
Documentation, especially domain-specific examples, continues to be important. That said, our documentation is probably better than it was last year.
More users are pushing Dask further. Investing in performance is likely to be valuable.

Thanks again to all the respondents!

Announcing the DaskHub Helm Chart

2020-08-31T00:00:00+00:00

Today we’re announcing the release of the daskhub helm chart. This is a Helm chart to easily install JupyterHub and Dask for multiple users on a Kubernetes Cluster. If you’re managing deployment for many people that needs interactive, scalable computing (say for a class of students, a data science team, or a research lab) then dask/daskhub might be right for you.

You can install dask/daskhub on a Kubernetes cluster today with

helm repo add dask https://helm.dask.org/
helm repo update
helm upgrade --install dhub dask/daskhub

The dask/daskhub helm chart is an evolution of the Pangeo helm chart, which came out of that community’s attempts to do big data geoscience on the cloud. We’re very grateful to have years of experience using Dask and JupyterHub together. Pangeo was always aware that there wasn’t anything geoscience-specific to their Helm chart and so were eager to contribute it to Dask to share the maintenance burden. In the process of moving it over to Dask’s chart repository we took the opportunity to clean up a few rough edges.

It’s interesting to read the original announcement of Pangeo’s JupyterHub deployment. A lot has improved, and we hope that this helm chart assists more groups in deploying JupyterHubs capable of scalable computations with Dask.

Details

Internally, the DaskHub helm chart is relatively simple combination of the JupyterHub and Dask Gateway helm charts. The only additional magic is some configuration to

Register Dask Gateway as a JupyterHub service.
Set environment variables to make using Dask Gateway easy for your users.

With the default configuration, your users will be able to create and connect to Dask Clusters, including their dashboards, with a simple

>>> from dask_gateway import GatewayCluster
>>> cluster = GatewayCluster()
>>> client = cluster.get_client()

Check out the documentation for details and let us know if you run into any difficulties.

Running tutorials

2020-08-21T00:00:00+00:00

For the last couple of months we’ve been running community tutorials every three weeks or so. The response from the community has been great and we’ve had 50-100 people at each 90 minute session.

The Dask team has historically run tutorials at conferences such as SciPy. With 2020 turning out the way that it has much of this content is being presented virtually this year. As more people are becoming accustomed to participating in virtual tutorials we felt it would be a good service to our community to start running regular virtual tutorials independent of conferences we may be attending or speaking at.

Tutorials are great for open source projects as they appeal to multiple types of learner.

The tutorial material provides a great foundation for written and visual learners.
Using an interactive tool like Jupyter Notebooks allows kinesthetic learners to follow along and take their own paths.
Having an instructor run through the material in real time provides a spoken source for auditory learners.
It’s also just fun to have a bunch of people from around the world participate in a live event. There is a greater sense of community.

Many open source projects provide documentation, some also make instructional videos on YouTube, but you really can’t beat a tutorial for producing a single set of content that is valuable to many users.

The more users can share knowledge, information and skills with the more they are going to use and engage with the project. Having a great source of learning material is critical for converting interested newcomers to users and users to contributors.

It is also great for the maintainers too. Dask is a large project made up of many open source repositories all with different functions. Each maintainer tends to participate in their specialist areas, but do not engage with everything on a day-to-day basis. Having maintainers run tutorials encourages them to increase their knowledge of areas they rarely touch in order to deliver the material, and this benefits the project as a whole.

How

For the rest of this post we will discuss the preparation and logistics we have undertaken to provide our tutorials. Hopefully this will provide a blueprint for others waning to run similar activities.

Writing the material

When starting to compile material is it important to consider a few questions; “Who is this for?”, “How long should it be?” and “What already exists today?”.

For the Dask tutorial we were targeting users who were either new to Dask, or had been using it for a while but wanted to learn more about the wider project. Dask is a large project after all and there are many features that you may not discover when trying to solve your specific challenges with it.

At large conferences is it quite normal to run a three hour tutorial, however when trying to schedule a tutorial as part of a person’s normal working day that is probably too much to ask of them. Folks are accustomed to scheduling in work meetings that are typically 30-60 minutes, but that may not be enough to run a tutorial. So we settled on 90 minutes, enough to get through a good amount of content, but not too long that folks will be put off.

We already have an “official” tutorial which is designed to fill the three hours of a SciPy tutorial. This tutorial is also designed as a “Dask from first principals” style tutorial where we explore how Dask works and eventually scale up to how Dask implements familiar APIs like Numpy and Pandas. This is great for giving folks a thorough understanding of Dask but given that we decided on 90 minutes we may not want to start with low level code as we may run out of time before getting to general usage.

While researching what already exists I was pointed to the Mini Dask 2019 tutorial which was created for an O’Reilly event. This tutorial starts with familiar APIs such as dataframes and arrays and eventually digs down into Dask fundamentals. As tutorial content like this is often licensed as open source and made available on GitHub it’s great to be able to build upon the work of others.

The result of combining the two tutorials was the Dask Video Tutorial 2020. It follows the same structure as the mini tutorial starting with high level APIs and digging further down. It also includes some new content on deployment and distributed methods.

Structuring content

To ensure this content targets the different learner types that we discussed earlier we need to ensure our content has a few things.

As a foundation we should put together a series of pages/documents with a written version of the information we are trying to communicate for written learners. We should also endeavor to include diagrams and pictures to illustrate this information for visual learners.

As we are sharing knowledge on an open source software project we should also make things as interactive as possible. Using Jupyter Notebooks as our document format means we can include many code examples which both provide written examples but are also editable and executable to empower kinesthetic learners to feel how things work in practice.

When the content is being delivered the instructor will be running through the content at the same time and narrating what they are doing for auditory learners. It is important to try and structure things in a way where you explain each section of the content out loud, but without directly reading the text from the screen as that can be off-putting.

We also want to ensure folks are taking things in, and labs are a great way to include small tests in the content. Having a section at the end of an example which is incomplete means that you can give the audience some time to try and figure things our for themselves. Some folks will be able to fill things in with no problems. For others they will hit errors or make mistakes, this is good for teaching how to debug and troubleshoot your project. And for those who are having awful flashbacks to pop-quizzes they can simply skip it without worrying that someone will check up on them.

For each section of content you want to include in your tutorial I recommend you create a notebook with an explanation, an example and some things for the audience to figure out. Doing this for each section (in the Dask tutorial we had 9 sections) the audience will quickly become familiar with the process and be able to anticipate what is coming next. This will make them feel comfortable.

Hosting the material

Once you have put your material together you need to share it with your attendees.

GitHub is a great place to put things, especially if you include an open license with it. For narrative tutorial content a creative commons license if often used which requires modifications to also be shared.

As we have put our content together as Jupyter Notebooks we can use Binder to make it possible for folks to run the material without having to download it locally or ensure their Python environment is set up correctly.

Choosing a video platform

Next we have to decide how we will present the material. As this is a virtual tutorial we will want to use some kind of video conferences or streaming software.

These tools tend to fall into two categories; private meetings with a tool like Zoom, Hangouts or Teams and public broadcasts on websites like YouTube or Twitch.

Any of these options will likely be a good choice, they allow the presenter to share their video, audio and screen with participants and participants can communicate back with a range of tools.

The main decision you will have to make is around whether you want to restrict numbers or not. The more interactivity you want to have in the tutorial the more you will need a handle on numbers. For our initial tutorials we wanted to enable participants to ask questions at any time and get a quick response, so we opted to use Zoom and limit our numbers to allow us to not get overwhelmed with questions. However if you want to present to as many people as possible and accept that you may not be able to address them all individually you may want to use a streaming platform instead.

It is also possible to do both at the same time. Zoom can stream directly to YouTube for example. This can be useful if you want to open things to as many folks as possible, but also limit the interactivity to a select group (probably on a first-come-first-served basis). For the Dask tutorials we decided to not livestream and instead run multiple tutorials so that everyone gets an interactive experience, but we are fortunate to have the resources to do that.

Registering attendees

There are a couple of reasons why you may wish to register attendees ahead of time.

If you want to limit numbers you will certainly need some way to register people and put a cap on that number. But even if you are streaming generally you may want to get folks to register ahead of time as that allows you to send them reminder emails in the run up to the event, which likely will add more certainty to the attendance numbers.

As our event was private we registered folks with Eventbrite. This allowed us to cap numbers and also schedule automated emails to act as a reminder but also share the details of the private Zoom meeting.

When running the Dask tutorials we found about 50% of the folks who registered actually turned up, so we accounted for this an set out limit to around double the number we wanted.

Here’s an example of the event details what we created:

Event Title: Dask Tutorial

Organizer: Presenter’s name

Event Type: Seminar or talk, Science and Technology, Online event

Tags: dask, pydata, python, tutorial

Location: Online event

Date and time: Single Event, add times

Details:

Come learn about Dask at this online free tutorial provided by the Dask maintainers.

This ninety minute course will mix overview discussion and demonstration by a leader in the Dask community, as well as interactive exercises in live notebook sessions for attendees. The computing environment will be provided.

If you want to get a sample of similar content, take a look at https://tutorial.dask.org (although this tutorial will cover different material appropriate for this shorter session).

We look forward to seeing you there!

Image: https://i.imgur.com/2i1tMNG.png

Live video content: NA

Text and media: NA

Links to resources: Tutorial Content (Online Jupyter Notebooks) https://github.com/jacobtomlinson/dask-video-tutorial-2020

Ticket Cost: Free

Ticket Attendee limit: 150 people

Count down to the tutorial

We also set up a series of automated emails. You can find this under Manage Attendees > Emails to Attendees in the event management page.

We scheduled emails for two days before, two hours before and 10 minutes before to let folks know where to go and another a few hours after to gather feedback. We will discuss the feedback email shortly.

You’ll need to ensure you have links to the materials and meeting location ready for this. In our case we pushed the content to GitHub and scheduled the Zoom call ahead of time.

Two days and two hours before

Hi Everyone!

We look forward to seeing you <tomorrow|soon>. We wanted to share some important links with you to help you connect to the meeting.

The materials for the course are available on GitHub here at the link below:

<Link to materials>

This repository contains Jupyter notebooks that we’ll go through together as a group. You do not need to install anything before the tutorial. We will run the notebooks on the online service, mybinder.org . All you need is a web connection.

The meeting itself will be held by video call at the following Zoom link:

<Zoom link and pin>

We look forward to seeing you soon!

<Organisers names>

Ten minutes before

Hi Everyone!

We are about to get started. Here’s a final reminder of the meeting details.

<Zoom link and pin>

See you in a minute!

<Organisers names>

Few hours after

Hi Everyone!

Thank you so much for attending the Dask tutorial. We really hope you found it valuable.

We would really appreciate it if you could answer a couple of quick feedback questions to help us improve things for next time.

<Google form link >

Also we want to remind you that the tutorial materials are always available on GitHub and you can run through them any time or share them with others.

<Link to materials>

Thanks,

<Organisers names>

Getting the word out

Now that we have an Eventbrite page we need to tell people about it.

You may already have existing channels where you can contact your community. For Dask we have an active twitter account with a good number of followers, so tweeting out the link to the event a couple of times the week running up to the tutorial was enough to fill the spaces.

If you have a mailing list, or any other platform you will probably want to share it there.

Setting up the call

Be sure to join the call ahead of the attendees. I would make sure this is at least before the final reminder email goes out. Personally I join 20 minutes or so before hand. This allows you to ensure the call is being recorded and that attendees were muted when they join.

Consider the experience of the user’s here. They will have signed up for an event online, received a few emails with Zoom call details and then they will join the call. If there is no indication that they are in the right place within a few seconds they may become anxious.

To combat this I tend to show some graphic which lets people know they are in the right place. You could either use a tool like OBS with Zoom to create a custom scene or just share your screen with a simple slide saying something like “The Dask tutorial will start soon”.

The only downside to sharing your screen is you can’t continue to use your computer in the run up to the tutorial.

When we ran our first few tutorials we were also running our Dask user survey so also included a link to that on the waiting screen to give folks something to do.

Greeting and getting folks set up

Say hi on the hour and welcome everyone to the tutorial. But as the event is virtual folks will be late, so don’t kick off until around five minutes in, otherwise you’ll just get a flood of questions asking what’s going on.

Interactivity

A fun thing to do during this waiting period is get everyone to introduce themselves in the chat. Say something like “Please say hi in that chat and give your name and where you are joining from”.

This is nice feedback for you as the instructor to see where folks are joining from, but it also gives the attendees a sense of being in a room full of people. One of the benefits of an event like this is that it is interactive, so be sure to say hi back to people.

I’m awful at pronouncing names correctly so I tend to list the places they said they are from instead. It still makes them feel like their message has been seen.

Once you’re ready to start introduce yourself and a general overview of the tutorial content. Then make use of any interaction tools you may have in your chat application. In zoom there are buttons that participants can click with labels like “go faster”, “go slower”, “yes” and “no”. These are great for getting feedback from the audience when running the tutorial, but it’s good to make sure everyone knows where they are and has a go at using them. I tend to explain where the buttons are and then ask questions like “have you managed to launch the binder?”, “have you used Dask before?” or “are you a Pandas user?”. You learn a little about your audience and they get familiar with the controls.

Being interactive means you can also respond to user questions. In Dask tutorials we mute everyone by default and encourage folks to type in the text chat. We also have an additional instructor who is not delivering the material who is able to watch the chat and answer questions in real time. If they feel like a question/answer would be beneficial to the whole group they can unmute and interrupt the presenter in order to bubble it up. Be prepared for a wide range of questions from the chat, including topics that are not being actively covered in the tutorial. This is often the only time that attendees have real-time access to core maintainers.

You may not have the resources to have two instructors for every tutorial, Dask is fortunate to have a strong maintainer team, so instead you may want to allocate breaks at the end of each section to answer questions. During the labs can be a good time to go back and review any questions.

Interactivity is one of the big benefits a live tutorial has over a video.

Run through the material

Once you’re all set up and everyone is in it’s time to run through the material. Given the amount of preparation we did before hand to construct the material this is relatively straight forward. Everything is laid out in front of us and we just need to go through the motions of talking through it.

I find it very helpful to have a list of the sections with timings written down that I can refer to in order to pace things.

Overview of Dask with Dask Dataframe (10 mins)

Introductory Lab (10 mins) and results (5 mins)

Dask GUI and dashboards (10 mins)

Dask Array (10 mins)

Dask ML with lab (10 mins) and results (5 mins)

Bags and Futures (10 mins)

Distributed (10 mins)

Wrapup and close (5 mins)

As we have another instructor answering questions I tend to ignore the chat and run through each section as slowly as I can without going over time. Personally my default is to go too fast, so forcing myself to be slow but having some timings to keep me on track seems to work well. But you should do whatever works for you.

During the labs I tend to mute my microphone and join in with answering questions on the chat.

Wrapping things up

When you’re nearing the end it’s good to have some time for any final questions. People may want to ask things that they didn’t get a chance to earlier or have questions which haven’t fit in with any particular area.

If you get complex questions or want to go in to depth you may want to offer to stay after and continue talking, but your attendees will appreciate you finishing at the scheduled time as they may have other things booked immediately after.

It’s always good to leave folks with some extra resources, whether that is links to the documentation, community places they can learn more like a Gitter chat, etc.

Gathering feedback and planning for next time

The last thing for you to do is plan for next time. The Dask team have decided to run tutorials every month or so but rotate around timezones to try and cover as many users as possible. We’ve also discussed having special deep dive tutorials which follow the same length and format but dive into one topic in particular.

To help you plan for future events you will likely want feedback from your participants. You can use tools like Google Forms to create a short questionnaire which you can send out to participants afterwards. In our experience about 20% of participants will fill in a survey that is 10 questions long.

This feedback can be very helpful for making changes to the content or format. For example in our first tutorial we use OBS for both the intro screen and screen sharing throughout. However Zoom limits webcams to 720p and adds heavy compression, so the quality for participants was not good and 50% of the surveys mentioned poor video. In later tutorials we only used OBS for the intro screen and then used the built in screen sharing utility in Zoom which provided a better experience and no user reported any audio/video issues in the survey.

Here are some examples of questions we asked and how they were answered for our tutorial.

Have you used Dask before?

When writing our material we said we were “targeting users who were either new to Dask, or had been using it for a while but wanted to learn more about the wider project.”. Our feedback results confirm that we are hitting these groups.

We could’ve been more specific and asked folks to rank their ability. But the more complex the questions the less likely folks will fill them out, so it’s a balancing act.

Did we cover all the topics you were expecting? And if not, what was missing?

Depending on the complexity of your project you may have to make compromises on what you can cover in the time you have. Dask is a large project and so we couldn’t cover everything, so we wanted to check we had covered the basics.

Most of the feedback we had from folks who answered no were asking about advanced topics like Kubernetes, Google Cloud deployments, deep dives into internal workings, etc. I’m satisfied that this shouldn’t have been in this tutorial, but it adds weight to our plans to run deep dives in the future.

Once useful bit of feedback we had here was “When should I use Dask and when should I stick with Pandas?”. This is something which definitely should be covered by an intro tutorial, so our material is clearly lacking here. As a result we can go back and make modifications and improve the content.

How was the pace?

Setting the pace is hard. If you’re targeting a range of abilities then it’s easy to go too fast or slow for a big chunk of the attendees.

Our feedback shows that folks were generally happy, but we are leaning on the side of being too fast. Given that we are filling our allocated time this probably indicates that we should cut a little content in order to slow things down.

Which sections did you find more informative?

By asking what sections were most informative we can identify things to cut in future if we do need to slow things down. It also shows areas where we may want to spend more time and add more content.

What would be your preferred platform for a tutorial like this?

We had to make a decision on which video platform to use based on the criteria we discussed earlier. For our tutorials we chose Zoom. By doing a user survey we were able to check that this worked for people and also see if there is an alternative that folks prefer.

Our results confirmed that folks were happy with Zoom. These results may be a little biased given that we used Zoom, but I’m confident that we can keep using it and folks will have a good experience.

Wrap up

In this post we have covered why and how you can run community tutorials for open source projects.

In summary you should run tutorials because:

You can share knowledge with a range of people with different learning styles
You can give back to your community
You can grow your community
You can improve maintainers knowledge of the whole project

And you can run a tutorial by following these steps:

Break your project into sections
Write up interactive documents on each section with tools like Jupyter notebooks
Give people access to this content with services like Binder
Manage attendees with services like Eventbrite
Advertise your tutorial on social media
Get everyone in a video meeting
Make use of the interactive tools
Deliver your material
Gather feedback

Comparing Dask-ML and Ray Tune's Model Selection Algorithms

2020-08-06T00:00:00+00:00

Hyperparameter optimization is the process of deducing model parameters that can’t be learned from data. This process is often time- and resource-consuming, especially in the context of deep learning. A good description of this process can be found at “Tuning the hyper-parameters of an estimator,” and the issues that arise are concisely summarized in Dask-ML’s documentation of “Hyper Parameter Searches.”

There’s a host of libraries and frameworks out there to address this problem. Scikit-Learn’s module has been mirrored in Dask-ML and auto-sklearn, both of which offer advanced hyperparameter optimization techniques. Other implementations that don’t follow the Scikit-Learn interface include Ray Tune, AutoML and Optuna.

Ray recently provided a wrapper to Ray Tune that mirrors the Scikit-Learn API called tune-sklearn (docs, source). The introduction of this library states the following:

Cutting edge hyperparameter tuning techniques (Bayesian optimization, early stopping, distributed execution) can provide significant speedups over grid search and random search.

However, the machine learning ecosystem is missing a solution that provides users with the ability to leverage these new algorithms while allowing users to stay within the Scikit-Learn API. In this blog post, we introduce tune-sklearn [Ray’s tuning library] to bridge this gap. Tune-sklearn is a drop-in replacement for Scikit-Learn’s model selection module with state-of-the-art optimization features.

—GridSearchCV 2.0 — New and Improved

This claim is inaccurate: for over a year Dask-ML has provided access to “cutting edge hyperparameter tuning techniques” with a Scikit-Learn compatible API. To correct their statement, let’s look at each of the features that Ray’s tune-sklearn provides, and compare them to Dask-ML:

Here’s what [Ray’s] tune-sklearn has to offer:

Consistency with Scikit-Learn API …

Modern hyperparameter tuning techniques …

Framework support …

Scale up … [to] multiple cores and even multiple machines.

[Ray’s] Tune-sklearn is also fast.

Dask-ML’s model selection module has every one of the features:

Consistency with Scikit-Learn API: Dask-ML’s model selection API mirrors the Scikit-Learn model selection API.
Modern hyperparameter tuning techniques: Dask-ML offers state-of-the-art hyperparameter tuning techniques.
Framework support: Dask-ML model selection supports many libraries including Scikit-Learn, PyTorch, Keras, LightGBM and XGBoost.
Scale up: Dask-ML supports distributed tuning (how could it not?) and larger-than-memory datasets.

Dask-ML is also fast. In “Speed” we show a benchmark between Dask-ML, Ray and Scikit-Learn:

Only time-to-solution is relevant; all of these methods produce similar model scores. See “Speed” for details.

Now, let’s walk through the details on how to use Dask-ML to obtain the 5 features above.

Dask-ML is consistent with the Scikit-Learn API.

Here’s how to use Scikit-Learn’s, Dask-ML’s and Ray’s tune-sklearn hyperparameter optimization:

## Trimmed example; see appendix for more detail
from sklearn.model_selection import RandomizedSearchCV
search = RandomizedSearchCV(model, params, ...)
search.fit(X, y)

from dask_ml.model_selection import HyperbandSearchCV
search = HyperbandSearchCV(model, params, ...)
search.fit(X, y, classes=[0, 1])

from tune_sklearn import TuneSearchCV
search = TuneSearchCV(model, params, ...)
search.fit(X, y, classes=[0, 1])

The definitions of model and params follow the normal Scikit-Learn definitions as detailed in the appendix.

Clearly, both Dask-ML and Ray’s tune-sklearn are Scikit-Learn compatible. Now let’s focus on how each search performs and how it’s configured.

Modern hyperparameter tuning techniques

Dask-ML offers state-of-the-art hyperparameter tuning techniques in a Scikit-Learn interface.

The introduction of Ray’s tune-sklearn made this claim:

tune-sklearn is the only Scikit-Learn interface that allows you to easily leverage Bayesian Optimization, HyperBand and other optimization techniques by simply toggling a few parameters.

The state-of-the-art in hyperparameter optimization is currently “Hyperband.” Hyperband reduces the amount of computation required with a principled early stopping scheme; past that, it’s the same as Scikit-Learn’s popular RandomizedSearchCV.

Hyperband works. As such, it’s very popular. After the introduction of Hyperband in 2016 by Li et. al, the paper has been cited over 470 times and has been implemented in many different libraries including Dask-ML, Ray Tune, keras-tune, Optuna, AutoML,[1] and Microsoft’s NNI. The original paper shows a rather drastic improvement over all the relevant implementations,[2] and this drastic improvement persists in follow-up works.[3] Some illustrative results from Hyperband are below:

^{All algorithms are configured to do the same amount of work except “random
2x” which does twice as much work. “hyperband (finite)” is similar Dask-ML’s
default implementation, and “bracket s=4” is similar to Ray’s default
implementation. “random” is a random search. SMAC,[4]
spearmint,[5] and TPE[6] are popular Bayesian algorithms.}

Hyperband is undoubtedly a “cutting edge” hyperparameter optimization technique. Dask-ML and Ray offer Scikit-Learn implementations of this algorithm that rely on similar implementations, and Dask-ML’s implementation also has a rule of thumb for configuration. Both Dask-ML’s and Ray’s documentation encourages use of Hyperband.

Ray does support using their Hyperband implementation on top of a technique called Bayesian sampling. This changes the hyperparameter sampling scheme for model initialization. This can be used in conjunction with Hyperband’s early stopping scheme. Adding this option to Dask-ML’s Hyperband implementation is future work for Dask-ML.

Framework support

Dask-ML model selection supports many libraries including Scikit-Learn, PyTorch, Keras, LightGBM and XGBoost.

Ray’s tune-sklearn supports these frameworks:

tune-sklearn is used primarily for tuning Scikit-Learn models, but it also supports and provides examples for many other frameworks with Scikit-Learn wrappers such as Skorch (Pytorch), KerasClassifiers (Keras), and XGBoostClassifiers (XGBoost).

Clearly, both Dask-ML and Ray support the many of the same libraries.

However, both Dask-ML and Ray have some qualifications. Certain libraries don’t offer an implementation of partial_fit,[7] so not all of the modern hyperparameter optimization techniques can be offered. Here’s a table comparing different libraries and their support in Dask-ML’s model selection and Ray’s tune-sklearn:

2020 normalized by row. Darker means that a higher proporiton of users of that API prefer that priority.
Which would help you most right now?	Bug fixes	More documentation	More examples in my field	New features	Performance improvements
Dask APIs

Model Library	Dask-ML support	Ray support	Dask-ML: early stopping?	Ray: early stopping?
Scikit-Learn	✔	✔	✔*	✔*
PyTorch (via Skorch)	✔	✔	✔	✔
Keras (via SciKeras)	✔	✔	✔**	✔**
LightGBM	✔	✔	❌	❌
XGBoost	✔	✔	❌	❌

^{* Only for the models that implement partial_fit.}
^{** Thanks to work by the Dask developers around scikeras#24.}

By this measure, Dask-ML and Ray model selection have the same level of framework support. Of course, Dask has tangential integration with LightGBM and XGBoost through Dask-ML’s xgboost module and dask-lightgbm.

Scale up

Dask-ML supports distributed tuning (how could it not?), aka parallelization across multiple machines/cores. In addition, it also supports larger-than-memory data.

[Ray’s] Tune-sklearn leverages Ray Tune, a library for distributed hyperparameter tuning, to efficiently and transparently parallelize cross validation on multiple cores and even multiple machines.

Naturally, Dask-ML also scales to multiple cores/machines because it relies on Dask. Dask has wide support for different deployment options that span from your personal machine to supercomputers. Dask will very likely work on top of any computing system you have available, including Kubernetes, SLURM, YARN and Hadoop clusters as well as your personal machine.

Dask-ML’s model selection also scales to larger-than-memory datasets, and is thoroughly tested. Support for larger-than-memory data is untested in Ray, and there are no examples detailing how to use Ray Tune with the distributed dataset implementations in PyTorch/Keras.

In addition, I have benchmarked Dask-ML’s model selection module to see how the time-to-solution is affected by the number of Dask workers in “Better and faster hyperparameter optimization with Dask.” That is, how does the time to reach a particular accuracy scale with the number of workers \(P\)? At first, it’ll scale like \(1/P\) but with large number of workers the serial portion will dictate time to solution according to Amdahl’s Law. Briefly, I found Dask-ML’s HyperbandSearchCV speedup started to saturate around 24 workers for a particular search.

Speed

Both Dask-ML and Ray are much faster than Scikit-Learn.

Ray’s tune-sklearn runs some benchmarks in the introduction with the GridSearchCV class found in Scikit-Learn and Dask-ML. A more fair benchmark would be use Dask-ML’s HyperbandSearchCV because it is almost the same as the algorithm in Ray’s tune-sklearn. To be specific, I’m interested in comparing these methods:

Scikit-Learn’s RandomizedSearchCV. This is a popular implementation, one that I’ve bootstrapped myself with a custom model.
Dask-ML’s HyperbandSearchCV. This is an early stopping technique for RandomizedSearchCV.
Ray tune-sklearn’s TuneSearchCV. This is a slightly different early stopping technique than HyperbandSearchCV’s.

Each search is configured to perform the same task: sample 100 parameters and train for no longer than 100 “epochs” or passes through the data.[8] Each estimator is configured as their respective documentation suggests. Each search uses 8 workers with a single cross validation split, and a partial_fit call takes one second with 50,000 examples. The complete setup can be found in the appendix.

Here’s how long each library takes to complete the same search:

Notably, we didn’t improve the Dask-ML codebase for this benchmark, and ran the code as it’s been for the last year.[9] Regardless, it’s possible that other artifacts from biased benchmarks crept into this benchmark.

Clearly, Ray and Dask-ML offer similar performance for 8 workers when compared with Scikit-Learn. To Ray’s credit, their implementation is ~15% faster than Dask-ML’s with 8 workers. We suspect that this performance boost comes from the fact that Ray implements an asynchronous variant of Hyperband. We should investigate this difference between Dask and Ray, and how each balances the tradeoffs, number FLOPs vs. time-to-solution. This will vary with the number of workers: the asynchronous variant of Hyperband provides no benefit if used with a single worker.

Dask-ML reaches scores quickly in serial environments, or when the number of workers is small. Dask-ML prioritizes fitting high scoring models: if there are 100 models to fit and only 4 workers available, Dask-ML selects the models that have the highest score. This is most relevant in serial environments;[10] see “Better and faster hyperparameter optimization with Dask” for benchmarks. This feature is omitted from this benchmark, which only focuses on time to solution.

Conclusion

Dask-ML and Ray offer the same features for model selection: state-of-the-art features with a Scikit-Learn compatible API, and both implementations have fairly wide support for different frameworks and rely on backends that can scale to many machines.

In addition, the Ray implementation has provided motivation for further development, specifically on the following items:

Adding support for more libraries, including Keras (dask-ml#696, dask-ml#713, scikeras#24). SciKeras is a Scikit-Learn wrapper for Keras that (now) works with Dask-ML model selection because SciKeras models implement the Scikit-Learn model API.
Better documenting the models that Dask-ML supports (dask-ml#699). Dask-ML supports any model that implement the Scikit-Learn interface, and there are wrappers for Keras, PyTorch, LightGBM and XGBoost. Now, Dask-ML’s documentation prominently highlights this fact.

The Ray implementation has also helped motivate and clarify future work. Dask-ML should include the following implementations:

A Bayesian sampling scheme for the Hyperband implementation that’s similar to Ray’s and BOHB’s (dask-ml#697).
A configuration of HyperbandSearchCV that’s well-suited for exploratory hyperparameter searches. An initial implementation is in dask-ml#532, which should be benchmarked against Ray.

Luckily, all of these pieces of development are straightforward modifications because the Dask-ML model selection framework is pretty flexible.

Thank you Tom Augspurger, Matthew Rocklin, Julia Signell, and Benjamin Zaitlen for your feedback, suggestions and edits.

Appendix

Benchmark setup

This is the complete setup for the benchmark between Dask-ML, Scikit-Learn and Ray. Complete details can be found at stsievert/dask-hyperband-comparison.

Let’s create a dummy model that takes 1 second for a partial_fit call with 50,000 examples. This is appropriate for this benchmark; we’re only interested in the time required to finish the search, not how well the models do. Scikit-learn, Ray and Dask-ML have have very similar methods of choosing hyperparameters to evaluate; they differ in their early stopping techniques.

from scipy.stats import uniform
from sklearn.model_selection import make_classification
from benchmark import ConstantFunction  # custom module

# This model sleeps for `latency * len(X)` seconds before
# reporting a score of `value`.
model = ConstantFunction(latency=1 / 50e3, max_iter=max_iter)

params = {"value": uniform(0, 1)}
# This dummy dataset mirrors the MNIST dataset
X_train, y_train = make_classification(n_samples=int(60e3), n_features=784)

This model will take 2 minutes to train for 100 epochs (aka passes through the data). Details can be found at stsievert/dask-hyperband-comparison.

Let’s configure our searches to use 8 workers with a single cross-validation split:

from sklearn.model_selection import RandomizedSearchCV, ShuffleSplit
split = ShuffleSplit(test_size=0.2, n_splits=1)
kwargs = dict(cv=split, refit=False)

search = RandomizedSearchCV(model, params, n_jobs=8, n_iter=n_params, **kwargs)
search.fit(X_train, y_train)  # 20.88 minutes

from dask_ml.model_selection import HyperbandSearchCV
dask_search = HyperbandSearchCV(
    model, params, test_size=0.2, max_iter=max_iter, aggressiveness=4
)

from tune_sklearn import TuneSearchCV
ray_search = TuneSearchCV(
    model, params, n_iter=n_params, max_iters=max_iter, early_stopping=True, **kwargs
)

dask_search.fit(X_train, y_train)  # 2.93 minutes
ray_search.fit(X_train, y_train)  # 2.49 minutes

Full example usage

from sklearn.linear_model import SGDClassifier
from scipy.stats import uniform, loguniform
from sklearn.datasets import make_classification
model = SGDClassifier()
params = {"alpha": loguniform(1e-5, 1e-3), "l1_ratio": uniform(0, 1)}
X, y = make_classification()

from sklearn.model_selection import RandomizedSearchCV
search = RandomizedSearchCV(model, params, ...)
search.fit(X, y)

from dask_ml.model_selection import HyperbandSearchCV
HyperbandSearchCV(model, params, ...)
search.fit(X, y, classes=[0, 1])

from tune_sklearn import TuneSearchCV
search = TuneSearchCV(model, params, ...)
search.fit(X, y, classes=[0, 1])

Configuring a Distributed Dask Cluster

2020-07-30T00:00:00+00:00

Configuring a Dask cluster can seem daunting at first, but the good news is that the Dask project has a lot of built in heuristics that try its best to anticipate and adapt to your workload based on the machine it is deployed on and the work it receives. Possibly for a long time you can get away with not configuring anything special at all. That being said, if you are looking for some tips to move on from using Dask locally, or have a Dask cluster that you are ready to optimize with some more in-depth configuration, these tips and tricks will help guide you and link you to the best Dask docs on the topic!

The biggest jump for me was from running a local version of Dask for just an hour or so at a time during development, to standing up a production-ready version of Dask. Broadly there are two styles:

a static dask cluster – one that is always on, always awake, always ready to accept work
an ephemeral dask cluster – one that is spun up or down easily with a Python API, and, when on, starts a minimal dask master node that itself only spins up dask workers when work is actually submitted

Though those are the two broad main categories, there are tons of choices of how to actually achieve that. It depends on a number of factors including what cloud provider products you want to use and if those resources are pre-provisioned for you and whether you want to use a python API or a different deployment tool to actually start the Dask processes. A very exhaustive list of all the different ways you could provision a dask cluster is in the dask docs under Setup. As just a taste of what is described in those docs, you could:

Install and start up the dask processes manually from the CLI on cloud instances you provision, such as AWS EC2 or GCP GCE
Use popular deployment interfaces such as helm for kubernetes to deploy dask in cloud container clusters you provision, such as AWS Fargate or GCP GKE
Use ‘native’ deployment python APIs, provided by the dask developers, to create (and interactively configure) dask on deployment infrastructure they support, either through the general-purpose Dask Gateway which supports multiple backends, or directly against cluster managers such as kubernetes with dask-kubernetes or YARN with dask-yarn, as long as you’ve already provisioned the kubernetes cluster or hadoop cluster already
Use a nearly full-service deployment python API called Dask Cloud Provider, that will go one step farther and provision the cluster for you too, as long as you give it AWS credentials (and as of time of writing, it only supports AWS)

As you can see, there are a ton of options. On top of all of those, you might contract a managed service provider to provision and configure your dask cluster for you according to your specs, such as Saturn Cloud (Disclaimer: one of the authors (Julia Signell) works for Saturn Cloud).

Whatever you choose, the whole point is to unlock the power of parallelism in Python that Dask provides, in as scalable a manner as possible which is what getting it running on distributed infrastructure is all about. Once you know where and with what API you are going to deploy your dask cluster, the real configuration process for your Dask cluster and its workload begins.

How to choose instance type for your cluster

When you are ready to set up your dask cluster for production, you will need to make some decisions about the infrastructure your scheduler and your workers will be running on, especially if you are using one of the options from How to host a distributed dask cluster that requires pre-provisioned infrastructure. Whether your infrastructure is on-prem or in the cloud, the classic decision points need to be made:

Memory requirements
CPU requirements
Storage requirements

If you have tested your workload locally, a simple heuristic is to multiply the CPU, storage, and memory usage of your work by some multiplier that is related to how scaled down your local experiments are from your expected production usage. For example, if you test your workload locally with a 10% sample of data, multiplying any observed resource usage by at least 10 may get you close to your minimum instance size. Though in reality Dask’s many underlying optimizations means that it shouldn’t regularly require linear growth of resources to work on more data, this simple heuristic may give you a starting point as a good first pass technique.

In the same vein, choosing the smallest instance and running with a predetermined subset of data and scaling up until it runs effectively gives you a hint towards the minimum instance size. If your local environment is too underpowered to run your flows locally with 10%+ of your source data, if it is a highly divergent environment (for example a different OS, or with many competing applications running in the background), or if it is difficult or annoying to monitor CPU, memory, and storage of your flow’s execution using your local machine, isolating the test case on the smallest workable node is a better option.

On the flip side, choosing the biggest instance you can afford and observing the discrepancy between max CPU/memory/storage metrics and scaling back based on the ratio of unused resources can be a quicker way to find your ideal size.

Wherever you land on node size might be heavily influenced by what you want to pay for, but as long as your node size is big enough that you are avoiding strict out of memory errors, the flip side of what you pay for with nodes closest to your minimum run specs is time. Since the point of your Dask cluster is to run distributed, parallel computations, you can get significant time savings if you scale up your instance to allow for more parallelism. If you have long running models that take hours to train that you can reduce to minutes, and get back some of your time or your employee’s time to see the feedback loop quickly, then scaling up over your minimum specs is worth it.

Should your scheduler node and worker nodes be the same size? It may certainly be tempting to provision them at separate instance sizes to optimize resources. It’s worth a quick dive into the general resource requirements of each to get a good sense.

For the scheduler, a serialized version of each task is submitted to it is held in memory for as long as it needs to determine which worker should take the work. This is not necessarily the same amount of memory needed to actually execute the task, but skimping too much on memory here may prevent work from being scheduled. From a CPU perspective, the needs of the scheduler are likely much lower than your workers, but starving the scheduler of CPU will cause deadlock, and when the scheduler is stuck or dies your workers also cannot get any work. Storage wise, the Dask scheduler does not persist much to disk, even temporarily, so it’s storage needs are quite low.

For the workers, the specific resource needs of your task code may overtake any generalizations we can make. If nothing else, they need enough memory and CPU to deserialize each task payload, and serialize it up again to return as a Future to the Dask scheduler. Dask workers may persist the results of computations in memory, including distributed across the memory of the cluster, which you can read more about here. Regarding storage needs, fundamentally tasks submitted to Dask workers should not write to local storage - the scheduler does not guarantee work will be run on a given worker - so the storage costs should be directly related to the installation footprint of your worker’s dependencies and any ephemeral storage of the dask workers. Temporary files the workers create may include spilling in-memory data to local disk if they run out of memory as long as that behavior isn’t disabled, which means that reducing memory can have an effect on your ephemeral storage needs.

Generally we would recommend simplifying your life and keeping your scheduler and worker nodes the same node size, but if you wanted to optimize them, use the above CPU, memory and storage patterns to give you a starting point for configuring them separately.

How to choose number of workers

Every dask cluster has one scheduler and any number of workers. The scheduler keeps track of what work needs to be done and what has already been completed. The workers do work, share results between themselves and report back to the scheduler. More background on what this entails is available in the dask.distributed documentation.

When setting up a dask cluster you have to decide how many workers to use. It can be tempting to use many workers, but that isn’t always a good idea. If you use too many workers some may not have enough to do and spend much of their time idle. Even if they have enough to do, they might need to share data with each other which can be slow. Additionally if your machine has finite resources (rather than one node per worker), then each worker will be weaker - they might run out of memory, or take a long time to finish a task.

On the other hand if you use too few workers you don’t get to take full advantage of the parallelism of dask and your work might take longer to complete overall.

Before you decide how many workers to use, try using the default. In many cases dask can choose a default that makes use of the size and shape of your machine. If that doesn’t work, then you’ll need some information about the size and shape of your work. In particular you’ll want to know:

What size is your computer or what types of compute nodes do you have access to?
How big is your data?
What is the structure of the computation that you are trying to do?

If you are working on your local machine, then the size of the computer is fixed and knowable. If you are working on HPC or cloud instances then you can choose the resources allotted to each worker. You make the decision about the size of your cluster based on factors we discussed in How to choose instance type for your cluster.

Dask is often used in situations where the data are too big to fit in memory. In these cases the data are split into chunks or partitions. Each task is computed on the chunk and then the results are aggregated. You will learn about how to change the shape of your data below.

The structure of the computation might be the hardest to reason about. If possible, it can be helpful to try out the computation on a very small subset of the data. You can see the task graph for a particular computation by calling .visualize(). If the graph is too large to comfortably view inline, then take a look at the Dask dashboard graph tab. This shows the task graph as it runs and lights up each section. To make dask most efficient, you want a task graph that isn’t too big or too interconnected. The dask docs discuss several techniques for optimizing your task graph.

To pick the number of workers to use, think about how many concurrent tasks are happening at any given part of the graph. If each task contains a non-trivial amount of work, then the fastest way to run dask is to have a worker for each concurrent task. For chunked data, if each worker is able to comfortably hold one data chunk in memory and do some computation on that data, then the number of chunks should be a multiple of the number of workers. This ensures that there is always enough work for a worker to do.

If you have a highly variable number of tasks, then you can also consider using an adaptive cluster. In an adaptive cluster, you set the minimum and maximum number of workers and let the cluster add and remove workers as needed. When the scheduler determines that some workers aren’t needed anymore it asks the cluster to shut them down, and when more are needed, the scheduler asks the cluster to spin more up. This can work nicely for task graphs that start out with few input tasks then have more tasks in the middle, and then some aggregation or reductions at the end.

Once you have started up your cluster with some workers, you can monitor their progress in the dask dashboard. There you can check on their memory consumption, watch their progress through the task graph, and access worker-level logs. Watching your computation in this way, provides insight into potential speedups and builds intuition about the number of workers to use in the future.

The tricky bit about choosing the number of workers to use is that in practice the size and shape of your machine, data, and task graph can change. Figuring out how many workers to use can end up feeling like an endless fiddling of knobs. If this is starting to drive you crazy then remember that you can always change the number or workers, even while the cluster is running.

How to choose nthreads to utilize multithreading

When starting dask workers themselves, there are two very important configuration options to play against each other: how many workers and how many threads per worker. You can actually manipulate both on the same worker process with flags, such as in the form dask-worker --nprocs 2 --nthreads 2, though --nprocs simply spins up another worker in the background so it is cleaner configuration to avoid setting --nprocs and instead manipulate that configuration with whatever you use to specify total number of workers. We already talked about how to choose number of workers, but you may modify your decision about that if you change a workers’ --nthreads to increase the amount of work an individual worker can do.

When deciding the best number of nthreads for your workers, it all boils down to the type of work you expect those workers to do. The fundamental principle is that multiple threads are best to share data between tasks, but worse if running code that doesn’t release Python’s GIL (“Global Interpreter Lock”). Increasing the nthreads for work that does not release the Python’s GIL has no effect; the worker cannot use threading to optimize the speed of computation if the GIL is locked. This is a possible point of confusion for new Dask users who want to increase their parallelism, but don’t see any gains from increasing the threading limit of their workers.

As discussed in the Dask docs on workers, there are some rules of thumb when to worry about GIL lockages, and thus prefer more workers over heavier individual workers with high nthreads:

If your code is mostly pure Python (in non-optimized Python libraries) on non-numerical data
If your code causes computations external to Python that are long running and don’t release the GIL explicitly

Conveniently, a lot of dask users are running exclusively numerical computations using Python libraries optimized for multithreading, namely NumPy, Pandas, SciPy, etc in the PyData stack. If you do mostly numerical computations using those or similarly optimized libraries, you should emphasize a higher thread count. If you truly are doing mostly numerical computations, you can specify as many total threads as you have cores; if you are doing any work that would cause a thread to pause, for example any I/O (to write results to disk, perhaps), you can specify more threads than you have cores, since some will be occasionally sitting idle. The ideal number regarding how many more threads than cores to set in that situation is complex to estimate and dependent on your workload, but taking some advice from concurrent.futures, 5 times the processors on your machine is a historical upper bound to limit your total thread count to for heavily I/O dependent workloads.

How to chunk arrays and partition DataFrames

There are many different methods of triggering work in dask. For instance: you can wrap functions with delayed or submit work directly to the client (for a comparison of the options see User Interfaces). If you are loading structured data into dask objects, then you are likely using dask.array or dask.dataframe. These modules mimic numpy and pandas respectively - making it easier to interact with large arrays and large tabular datasets.

When using dask.dataframe and dask.array, computations are divided among workers by splitting the data into pieces. In dask.dataframe these pieces are called partitions and in dask.array they are called chunks, but the principle is the same. In the case of dask.array each chunk holds a numpy array and in the case of dask.dataframe each partition holds a pandas dataframe. Either way, each one contains a small part of the data, but is representative of the whole and must be small enough to comfortably fit in worker memory.

Often when loading in data, the partitions/chunks will be determined automatically. For instance, when reading from a directory containing many csv files, each file will become a partition. If your data are not split up by default, then it can be done manually using df.set_index or array.rechunk. If they are split up by default and you want to change the shape of the chunks, the file-level chunks should be a multiple of the dask level chunks (read more about this here).

As the user, you know how the data are going to be used, so you can often partition it in ways that lead to more efficient computations. For instance if you are going to be aggregating to a monthly step, it can make sense to chunk along the time axis. If instead you are going to be looking at a particular feature at different altitudes, it might make sense to chunk along the altitude. More tips for chunking dask.arrays are described in Best Practices. Another scenario in which it might be helpful to repartition is if you have filtered the data down to a subset of the original. In that case your partitions will likely be too small. See the dask.dataframe Best Practices for more details on how to handle that case.

When choosing the size of chunks it is best to make them neither too small, nor too big (around 100MB is often reasonable). Each chunk needs to be able to fit into the worker memory and operations on that chunk should take some non-trivial amount of time (more than 100ms). For many more recommendations take a look at the docs on chunks and on partitions.

We hope this helps you make decisions about whether to configure your Dask deployment differently and give you the confidence to try it out. We found all of this great information in the Dask docs, so if you are feeling inspired please follow the links we’ve sprinkled throughout and learn even more about Dask!

The current state of distributed Dask clusters

2020-07-23T00:00:00+00:00

Dask enables you to build up a graph of the computation you want to perform and then executes it in parallel for you. This is great for making best use of your computer’s hardware. It is also great when you want to expand beyond the limits of a single machine.

In this post we will cover:

Manual cluster setup
Review of deployment options today
Analysis of that state

Let’s dive in by covering the most straight forward way to setup a distributed Dask cluster.

Setup scheduler and workers

Imagine we have three computers, we will call them MachineA, MachineB and MachineC. Each of these machines has a functioning Python environment and we have installed Dask with conda install dask. If we want to pull them together into a Dask cluster we start by running a scheduler on MachineA.

$ dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Local Directory: /tmp/scheduler-btqf8ve1
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://MachineA:8786
distributed.scheduler - INFO -   dashboard at:               :8787

Next we need to start a worker process on both MachineB and MachineC.

$ dask-worker tcp://MachineA:8786
distributed.nanny - INFO -         Start Nanny at:    'tcp://127.0.0.1:51224'
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:51225
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:51225
distributed.worker - INFO -          dashboard at:            127.0.0.1:51226
distributed.worker - INFO - Waiting to connect to:        tcp://MachineA:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    8.00 GB
distributed.worker - INFO -       Local Directory:       /tmp/worker-h3wfwg7j
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:        tcp://MachineA:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

If we start a worker on both of our two space machines Dask will autodetect the resources on the machine and make them available to the scheduler. In the example above the worker has detected 4 CPU cores and 8GB of RAM. Therefore our scheduler has access to a total of 8 cores and 16GB of RAM and it will use these resources to run through the computation graph as quickly as possible. If we add more workers on more machines then the amount of resources available to the scheduler increases and computation times should get faster.

Note: While the scheduler machine probably has the same resources as the other two these will not be used in the computation.

Lastly we need to connect to our scheduler from our Python session.

from dask.distributed import Client
client = Client("tcp://MachineA:8786")

Creating this Client object within the Python global namespace means that any Dask code you execute will detect this and hand the computation off to the scheduler which will then execute on the workers.

Accessing the dashboard

The Dask distributed scheduler also has a dashboard which can be opened in a web browser. As you can see in the output above the default location for this is on the scheduler machine at port 8787. So you should be able to navigate to http://MachineA:8787.

If you are using Jupyter Lab as your Python environment you are also able to open individual plots from the dashboard as windows in Jupyter Lab with the Dask Lab Extension.

Recap

In this minimal example we have installed Dask on some machines, ran a distributed scheduler on one of them and workers on the others. We then connected to our cluster from our Python session and opened the dashboard to keep an eye on the cluster.

What we haven’t covered is where these machines came from in the first place. In the rest of this post we will discuss the different ways that folks tend to run clusters out in the wild and give an overview of the various tools that exist to help you set up Dask clusters on a variety of infrastructure.

Cluster requirements

In order to run a Dask cluster you must be able to install Dask on a machine and start the scheduler and worker components. These machines need to be able to communicate via a network so that these components can speak to each other.

You also need to be able to access the scheduler from your Python session via a network in order to connect the Client and access the dashboard.

Lastly the Python environment in the Python session where you create your Client must match the Python environment where the workers are running. This is because Dask uses cloudpickle to serialize objects and send them to workers and to retrieve results. Therefore package versions must match in both locations.

We will need to bear these requirements in mind as we discuss the different platforms that folks generally want to run Dask on.

Cluster types

There are two “types” of clusters that I tend to see folks running. Fixed clusters and ephemeral clusters.

Fixed clusters

One common way of setting up a cluster is to run the scheduler and worker commands as described above, but leave them running indefinitely. For the purpose of this article I’ll refer to this as a “fixed cluster”. You may use something like systemd or supervisord to manage the processes and ensure they are always running on the machines. The Dask cluster can then be treated as a service.

In this paradigm once a cluster is set up folks may start their Python session, connect their client to this existing cluster, do some work and disconnect again. They might later come back to that cluster and run further work. The cluster will sit idle in the meantime.

It is also common in this paradigm for multiple users to share this single cluster, however this is not recommended as the Dask scheduler does not manage users or clients separately and work will be executed on a first come first served bases. Therefore we recommend that users use a cluster one at a time.

Ephemeral clusters

An ephemeral cluster is one which only exists for the duration of the work. In this case a user may SSH onto the machines, run the commands to set up the cluster, connect a client and perform work, then disconnect and exit the Dask processes. A basic way of doing this would be to create a bash script which calls ssh and sets up the cluster. You would run this script in the background while performing your work and then kill it once you are done. We will cover other implementations of this in the coming sections.

Ephemeral clusters allow you to leverage a bunch of machines but free them up again when you are done. This is especially useful when you are using a system like a cloud service or a batch scheduler where you have limited credits, or are charged for provisioned resources.

Adaptivity

Ephemeral clusters are also generally easier to scale as you will likely have an automated mechanism for starting workers. The Dask scheduler maintains an estimate of how long it expects the outstanding work will take to complete. If the scheduler has a mechanism for starting and stopping workers then it will scale up the workers in an attempt to complete all outstanding work within 5 seconds. This is referred to as adaptive mode.

The mechanisms for starting and stopping workers are added via plugins. Many of the implementations we are about to discuss include this logic.

Connectivity

Dask uses TCP to communicate between client, scheduler and workers by default. This means that all of these components must be on a TCP/IP network with open routes between the machines. Many connectivity problems step from firewalls or private networks blocking connections between certain components. An example of this would be running Dask on a cloud platform like AWS, but running the Python session and client on your laptop while sitting in a coffee shop using the free wifi. You must ensure you are able to route traffic between components, either by exposing the Dask cluster to the internet or by connecting your laptop to the private network via a VPN or tunnel.

There is also ongoing work to add support for UCX to Dask, which will allow it to make use of InfiniBand or NVLink networks where they are available.

Cluster Managers

In the following section we are going to cover a range of cluster manager implementations which are available within the Dask community.

In the Dask distributed codebase there is a Cluster superclass which can be subclassed to build various cluster managers for different platforms. Members of the community have taken this and built their own packages which enable creating a Dask cluster on a specific platform, for example Kubernetes.

The design of these classes is that you import the cluster manager into your Python session and instantiate it. The object then handles starting the scheduler and worker processes on the target platform. You can then create a Client object as usual from that cluster object to connect to it.

All of these cluster manager objects are ephemeral clusters, they only exist for the duration of the Python session and then will be cleaned up.

Local Cluster

Let’s start with one of the reference implementations of Cluster from the Dask distributed codebase LocalCluster.

from dask.distributed import LocalCluster, Client

cluster = LocalCluster()
client = Client(cluster)

This cluster manager starts a scheduler on your local machine, and then starts a worker for every CPU core that it finds on the machine.

SSH Cluster

Another reference implementation is SSHCluster. This is one of the most pure and simple ways of using multiple machines with Dask distributed and is very similar to our initial example in this blog post.

from dask.distributed import SSHCluster, Client

cluster = SSHCluster(["MachineA", "MachineB", "MachineC"])
client = Client(cluster)

The first argument here is a list of machines which we can SSH into and set up a Dask cluster on. The first machine in the list will be used as the scheduler and the rest as workers.

As the scheduler will likely use far less resources than the workers you may even want to run that locally and make use of all three remote machines as workers.

cluster = SSHCluster(["localhost", "MachineA", "MachineB", "MachineC"])

SpecCluster

The last implementation that is included in the core Dask distributed library is SpecCluster. This is actually another superclass and is designed to be subclassed by other developers when building cluster managers. However it goes further than Cluster in expecting the developer to provide a full specification for schedulers and workers as Python classes. There is also a superclass called ProcessInterface which is designed to be used when creating those scheduler and worker classes.

Having standard interfaces means a more consistent experience for users. Many of the cluster manager we will cover next use SpecCluster.

Dask Kubernetes

Dask Kubernetes provides a cluster manager for Kubernetes called KubeCluster.

Kubernetes provides high level APIs and abstract concepts for scheduling linux containers on a cluster of machines. It provides abstracted concepts for processes, containers, networks, storage, etc to empower better use of data centre scale resources.

As a Dask user it generally doesn’t matter to you how your cluster is set up. But if you’ve been given access to a Kubernetes cluster by your organisation or institution you will need to understand those concepts in order to schedule your work on it.

The KubeCluster cluster manager further abstracts away those concepts into the Dask terms we are familiar with.

from dask.distributed import Client
from dask_kubernetes import KubeCluster

cluster = KubeCluster(**cluster_specific_kwargs)
client = Client(cluster)

In order for this code to work you will need to have configured your Kubernetes credentials, in the same way that for the SSH example you will need to configure your keys.

Your client will also need to be able to access the Dask scheduler, and you probably want to be able to open the dashboard in your browser. However Kubernetes uses an overlay network which means that the IP addresses assigned to the scheduler and workers are only routable within the cluster. This is fine for them talking to each other but means you wont be able to get in from the outside.

One way around this is to ensure your Python session is also running inside the Kubernetes cluster. A popular way of setting up an interactive Python environment on Kubernetes is with Zero to Jupyter Hub, which gives you access to Jupyter notebooks running within the Kubernetes cluster.

The alternative is exposing your scheduler to the external network. You can do this by exposing the Kubernetes Service object associated with the scheduler or by setting up and configuring an Ingress component for your Kubernetes cluster. Both of these options require some knowledge of Kubernetes.

Dask Helm chart

Another option for running Dask on a Kubernetes cluster is using the Dask Helm Chart.

This is an example of a fixed cluster setup. Helm is a way of installing specific resources on a Kubernetes cluster, similar to a package manager like apt or yum. The Dask Helm chart includes a Jupyter notebook, a Dask scheduler and three Dask workers. The workers can be scaled manually by interacting with the Kubernetes API but not adaptively by the Dask scheduler itself.

This feels like a different approach to what we’ve seen so far. It gives you a Dask cluster which is always available, and a Jupyter notebook to drive the cluster from. You then have to take your work to the cluster’s Jupyter session rather than spawning a cluster from your existing work place.

One benefit of this approach is that because the Jupyter notebook has been set up as part of the cluster it already has the Lab Extension installed and also has been pre-configured with the location of the Dask cluster. So unlike previous examples where you need to either give the Client the address of the scheduler or a Cluster object, in this instance it will auto-detect the cluster from environment variables that are set by the Helm chart.

from dask.distributed import Client

client = Client()  # The address is loaded from an environment variable

Note: If you call Client without any arguments in other situations where the scheduler location has not been configured it will automatically create a LocalCluster object and use that.

Dask Jobqueue

Dask Jobqueue is a set of cluster managers aimed at HPC users.

When working as a researcher or academic with access to an HPC or Supercomputer you likely have to submit work to that machine via some kind of job queueing system. This is often in the form of a bash script which contains metadata about how much resource you need on the machine and the commands you want to run.

Dask Jobqueue has cluster manager objects for PBS, Slurm, and SGE. When creating these cluster managers they will construct scripts for the batch scheduler based on your arguments and submit them using your default credentials.

from dask.distributed import Client
from dask_jobqueue import PBSCluster

cluster = PBSCluster(**cluster_specific_kwargs)
client = Client(cluster)

As batch systems like these often have a long wait time you may not immediately get access to your cluster object and scaling can be slow. Depending on the queueing policies it may be best to think of this as a fixed sized cluster. However if you have a responsive interactive queue then you can use this like any other autoscaling cluster manager.

Again it is expected that your Python session is able to connect to the IP address of the scheduler. This may vary depending on your HPC centre setup as to how you can ensure this.

Dask Yarn

Dask Yarn is a cluster manager for traditional Hadoop systems.

Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is a common piece of infrastrcture in Java/Scala ecosystems for processing large volumes of data. However you can also use the scheduling functionality called YARN to schedule Dask workers and leverage the underlying hardware resources.

from dask.distributed import Client
from dask_yarn import YarnCluster

cluster = YarnCluster(**cluster_specific_kwargs)
client = Client(cluster)

Dask Yarn is only intended to be used from a Hadoop edge node which will have access to the internal network of the Hadoop cluster.

Dask Cloudprovider

Dask Cloudprovider is a collection of cluster managers for leveraging cloud native APIs.

Cloud providers such as Amazon, Microsoft and Google have many APIs available for building and running various types of infrastructure. These range from traditional virtual servers running linux or Windows to higher level APIs that can execute small snippets of code on demand. They have batch systems, Hadoop systems, machine learning systems and more.

The ideal scenario for running Dask on a cloud provider would be a service which would allow you to run the scheduler and worker with specified Python environments and then connect to them securely from the outside. Such a service doesn’t quite exist, but similar things do to varying degrees.

One example is AWS Fargate which is a managed container platform. You can run Docker containers on demand which each have a unique IP address which can be public or private. This means we can run Dask scheduler and worker processes within a Dask container and connect to them from our Python session. This service is billed per second for the requested resources, so makes most sense as an ephemeral service which has no cost when you aren’t using it.

from dask.distributed import Client
from dask_cloudprovider import FargateCluster

cluster = FargateCluster(**cluster_specific_kwargs)
client = Client(cluster)

This cluster manager uses your AWS credentials to authenticate and request AWS resources on Fargate, and then connects your local session to the Dask cluster running on the cloud.

There are even higher level services such as AWS Lambda or Google Cloud Functions which allow you to execute code on demand and you are billed for the execution time of the code. These are referred to as “serverless” services as the servers are totally abstracted away. This would be perfect for out Dask cluster as you could submit the scheduler and workers as the code to run. However when running these cloud functions it is not possible to get a network connection between them as they do not have routable IP addresses, so there is no way to set up a Dask cluster made of these executing functions. Maybe one day!

Dask Gateway

Dask Gateway is a central service for managing Dask clusters. It provides a secure API which multiple users can communicate with to request Dask servers. It can spawn Dask clusters on Kubernetes, Yarn or batch systems.

This tool is targeted at IT administrators who want to enable their users to create Dask clusters, but want to maintain some centralized control instead of each user creating their own thing. This can also be useful for tracking Dask usage and setting per user limits.

from dask.distributed import Client
from dask_gateway import GatewayCluster

cluster = GatewayCluster(**cluster_specific_kwargs)
client = Client(cluster)

For each user the commands for creating and using a gateway cluster are the same. It is down to the administrator to setup and manage the gateway server and configure authentication via kerberos or Jupyter Hub. They should also provide configuration to their users so that Dask Gateway knows how to connect to the gateway server. In a large organisation or institution the IT department also likely provisions the machines that staff are using, and so should be able to drop configuration files onto users computers.

Local CUDA Cluster

The last cluster manager I’m going to cover is LocalCUDACluster from the Dask CUDA package.

This is slightly different than the other cluster managers in that it is constructing a Dask cluster which is specifically optimised for a single piece of hardware. In this case it is targeting machines with GPUs ranging from your laptop with an onboard NVIDIA GPU to an NVIDIA DGX-2 with multiple GPUs running in your datacentre.

The cluster manager closely follows the LocalCluster in that is creates resources locally on the current machine, but instead of creating one worker per CPU core it creates one per GPU. It also changes some of the configuration defaults to ensure good performance of GPU workloads.

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster(**cluster_specific_kwargs)
client = Client(cluster)

This package also has an alternative Dask worker bash command called dask-cuda-worker which also modified the defaults of the Dask worker to ensure it is optimised for GPU work.

Future

Now that we have laid out the current state of the Dask distributed cluster ecosystem let’s discuss where we could go next.

As shown at the beginning a Dask cluster is a combination of scheduler, workers and client which enable distributed execution of Python functions. Setting up your own cluster on your own machines is straight forward, but there is such a variety of ways to provision infrastructure that we now have a number of ways of automating this.

This variation opens up a number of questions about how we can improve things.

Do we need more fixed cluster options?

While covering the various cluster managers we only covered one fixed cluster implementation, the Helm chart. Is there a requirement for more fixed clusters? Examples may be CloudFormation or Terraform templates which follow the same structure as the Helm chart, providing a Jupyter service, Dask scheduler and fixed number of workers.

Can we bridge some gaps?

Could the Dask Kubernetes cluster manager connect to an existing cluster that was built using the Helm chart to then perform adaptive scaling? I’ve been asked this a lot but it is currently unclear how to get to this position. The cluster manager and Helm chart use different Kubernetes resources to achieve the same goal, so some unification would be needed before this is possible.

Are ephemeral clusters too ephemeral?

Many of the cluster managers only exist for the duration of the Python session. However some like the YarnCluster allow you to disconnect and reconnect from the cluster. This allows you to treat a YARN cluster more like a fixed cluster.

In other circumstances the Python session may have a timeout or limit and may be killed before the Dask cluster can complete its work. Would there be benefit to letting the Dask cluster continue to exist? With the Python session cleared up the client and futures will also be garbage collected. So perhaps not.

Can we manage conda environments better?

Currently it is the responsibility of the person creating the cluster to ensure that the worker’s conda environment matches the one where the Client is going to be created. On fixed clusters this can be easier as the Python/Jupyter environment can be provided within the same set of systems. However on ephemeral clusters where you may be reaching into a cloud or batch system they may not match your laptop’s environment for example.

Perhaps there could be integration between workers and conda to create dynamic environments on the fly. Exploring the performance impact of this would be interesting.

Another option could be enabling users to start a remote Jupyter kernel on a worker. They wouldn’t have access to the same filesystem, but they would share a conda environment.

Faster Scheduling

2020-07-21T00:00:00+00:00

This post discusses Dask overhead costs for task scheduling, and then lays out a rough plan for acceleration.

This post is written for other maintainers, and often refers to internal details. It is not intended for broad readability.

How does this problem present?

When we submit large graphs there is a bit of a delay between us calling .compute() and work actually starting on the workers. In some cases, that delay can affect usability and performance.

Additionally, in far fewer cases, the gaps in between tasks can be an issue, especially if those tasks are very short and for some reason can not be made longer.

Who cares?

First, this is a problem that affects about 1-5% of Dask users. These are people who want to process millions of tasks relatively quickly. Let’s list a few use cases:

Xarray/Pangeo workloads at the 10-100TB scale
NVIDIA RAPIDS workloads on large tabular data (GPUs make computing fast, so other costs become relatively larger)
Some mystery use cases inside of some hedge funds

It does not affect the everyday user, who processes 100GB to a few TB of data, and doesn’t mind waiting 10s for things to start running.

Coarse breakdown of costs

When you call x.sum().compute() a few things happen:

Graph generation: Some Python code in a Dask collection library, like dask array, calls the sum function, which generates a task graph on the client side.
Graph Optimization: We then optimize that graph, also on the client side, in order to remove unnecessary work, fuse tasks, apply important high level optimizations, and more.
Graph Serializtion: We now pack up that graph in a form that can be sent over to the scheduler.
Graph Communication: We fire those bytes across a wire over to the scheduler
Scheduler.update_graph: The scheduler receives these bytes, unpacks them, and then updates its own internal data structures
Scheduling: The scheduler then assigns ready tasks to workers
Communicate to workers: The scheduler sends out lots of smaller messages to each of the workers with the tasks that they can perform
Workers work: The workers then perform this work, and start communicating back and forth with the scheduler to receive new instructions

Generally most people today are concerned with steps 1-6. Once things get out to the workers and progress bars start moving people tend to care a bit less (but not zero).

What do other people do?

Let’s look at a few things that other projects do, and see if there are things that we can learn. These are commonly suggested, but there are challenges with most of them.

Rewrite the scheduler it in C++/Rust/C/Cython

Proposal: Python is slow. Want to make it faster? Don’t use Python. See academic projects.

Challenge: This makes sense for some parts of the pipeline above, but not for others. It also makes it harder to attract maintainers.

What we should consider: Some parts of the scheduler and optimization algorithms could be written in a lower level language, maybe Cython. We’ll need to be careful about maintainability.
Distributed scheduling

Proposal: The scheduler is slow, maybe have many schedulers? See Ray.

Challenge: It’s actually really hard to make the kinds of decisions that Dask has to make if scheduling state is spread on many computers. Distributed scheduling works better when the workload is very either uniform or highly decoupled. Distributed scheduling is really attractive to people who like solving interesting/hard problems.

What we should consider: We can move some simple logic down to the workers. We’ve already done this with the easy stuff though. It’s not clear how much additional benefit there is here.
Build specialty scheduling around collections

Proposal: If Dask were to become just a dataframe library or just an array computing library then it could special-case things more effectively. See Spark, Mars, and others.

Challenge: Yes, but Dask is not a dataframe library or an array library. The three use cases we mention above are all very different.

What we should consider: modules like dask array and dask dataframe should develop high level query blocks, and we should endeavor to communicate these subgraphs over the wire directly so that they are more compact.

What should we actually do?

Because our pipeline has many stages, each of which can be slow for different reasons, we’ll have to do many things. Additionally, this is a hard problem because changing one piece of the project at this level has repurcussions for many other pieces. The rest of this post tries to lay out a consistent set of changes. Let’s start with a summary:

For Dask array/dataframe let’s use high level graphs more aggressively so that we can communicate only abstract representations between the client and scheduler.
But this breaks low level graph optimizations, fuse, cull, and slice fusion in particular. We can make these unnecessary with two changes:
- We can make high level graphs considerably smarter to handle cull and slice fusion
- We can move a bit more of the scheduling down to the workers to replicate the advantages of low-level fusion there
Then, once all of the graph manipulation happens on the scheduler, let’s try to accelerate it, hopefully in a language that the current dev community can understand, like Cython
At the same time in parallel, let’s take a look at our network stack

We’ll go into these in more depth below

Graph Generation

High Level Graph History

A year or two ago we moved graph generation costs from user-code-typing time to graph-optimization-time with high level graphs

y = x + 1                 # graph generation used to happen here
(y,) = dask.optimize(y,)  # now it happens here

This really improved usability, and also let us do some high level optimizations which sometimes allowed us to skip some lower-level optimization costs.

Can we push this further?

The first four stages of our pipeline happen on the client:

Graph generation: Some Python code in a Dask collection library, like dask array, calls the sum function, which generates a task graph on the client side.
Graph Optimization: We then optimize that graph, also on the client side, in order to remove unnecessary work, fuse tasks, apply important high level optimizations, and more.
Graph Serializtion: We now pack up that graph in a form that can be sent over to the scheduler.
Graph Communication: We fire those bytes across a wire over to the scheduler

If we’re able to stay with the high level graph representation through these stages all the way until graph communication, then we can communicate a far more compact representation up to the scheduler. We can drop a lot of these costs, at least for the high level collection APIs (delayed and client.submit would still be slow, client.map might be ok though).

This has a couple of other nice benefits:

User’s code won’t block, and we can alert the user immediately that we’re on the job
We’ve centralized costs in just the scheduler, so there is now only one place where we might have to think about low-level code

(some conversation here: https://github.com/dask/distributed/issues/3872)

However, low-level graph optimizations are going to be a problem

In principle changing the distributed scheduler to accept a variety of graph layer types is a tedious but straightforward problem. I’m not concerned.

The bigger concern is what to do with low-level graph optimizations. Today we have three of these that really matter:

Task fusion: this is what keeps your read_parquet task merged with your subsequent blockwise tasks
Culling: this is what makes df.head() or x[0] fast
Slice fusion: this is why x[:100][5] works well

In order for us to transmit abstract graph layers up to the scheduler, we need to remove the need for these low level graph optimizations. I think that we can do this with a combination of two approaches:

More clever high level graph manipulation

We already do this a bit with blockwise, which has its own fusion, and which removes much of the need for fusion generally. But other blockwise-like operations, like read_* will probably have to join the Blockwise family.

Getting culling to work properly may require us to teach each of the individual graph layers how to track dependencies in each layer type and cull themselves. This may get tricky.

Slicing is doable, we just need someone to go in, grok all of the current slicing optimizations, and make high level graph layers for these computations. This would be a great project for a sharp masters student

Send speculative tasks to the workers

High level Blockwise fusion handles many of the use cases for low-level fusion, but not all. For example I/O layers like dd.read_parquet or da.from_zarr aren’t fused at a high level.

We can resolve this either by making them blockwise layers (this requires expanding the blockwise abstraction, which may be hard) or alternatively we can start sending not-yet-ready tasks to workers before all of their dependencies are finished if we’re highly confident that we know where they’re going to go. This would give us some of the same results of fusion, but would keep all of the task types separate (which would be nice for diagnostics) and might still give us some of the same performance benefits that we get from fusion.

Unpack abstract graph layers on the scheduler

So after we’ve removed the need for low level optimizations, and we just send the abstract graph layers up to the scheduler directly, we’ll need to teach the scheduler how to unpack those graph layers.

This is a little tricky because the Scheduler can’t run user Python code (for security reasons). We’ll have to register layer types (like blockwise, rechunk, dataframe shuffle) that the scheduler knows about and trusts ahead of time. We’ll still always support custom layers, and these will be at the same speed that they’ve always been, but hopefully there will be far less need for these if we go all-in on high level layers.

Rewrite scheduler in low-level language

Once most of the finicky bits are moved to the scheduler, we’ll have one place where we can focus on low level graph state manipulation.

Dask’s distributed scheduler is two things:

A Tornado TCP application that receives signals from clients and workers and send signals out to clients and workers

This is async-heavy networking code
A complex state machine internally that responds to those state changes

This is a complex data structure heavy Python code

Networking

Jim has an interesting project here that shows promise: https://github.com/jcrist/ery Reducing latency between workers and the scheduler would be good, and would help to accelerate stages 7-8 in the pipeline listed at the top of this post.

State machine

Rewriting the state machine in some lower level language would be fine. Ideally this would be in a language that was easy for the current maintainer community to maintain, (Cython?) but we may also consider making a more firm interface here that would allow other groups to experiment safely.

There are some advantages to this (more experimentation by different groups) but also some costs (splitting of core efforts and mismatches for users). Also, I suspect that splitting out also probably means that we’ll probably lose the dashboard, unless those other groups are very careful to expose the same state to Bokeh.

There is more exploration to do here. Regardless I think that it probably makes sense to try to isolate the state machine from the networking system. Maybe this also makes it easier for people to profile in isolation.

In speaking with a few different groups most people have expressed reservation about having multiple different state machine codes. This was done in MapReduce and Spark and resulted in difficult to maintain community dynamics.

High Level Graph Optimizations

Once we have everything in smarter high level graph layers, we will also be more ripe for optimization.

We’ll need a better way to write down these optimizations with a separated traversal system and a set of rules. A few of us have written these things before, maybe it’s time we revisit them.

What we need

This would require some effort, but I think that it would hit several high profile problems at once. There are a few tricky things to get right:

A framework for high level graph layers
An optimization system for high level graph layers
Separation of the scheduler into two parts

For this I think that we’ll need people who are fairly familiar with Dask to do this right.

And there there is a fair amount of follow-on work

Build a hierarchy of layers for dask dataframe
Build a hierarchy of layers for dask array
Build optimizations for those to remove the need for low level graph optimizations
Rewrite core parts of the scheduler in Cython
Experiment with the networking layer, maybe with a new Comm

I’ve been thinking about the right way to enact this change. Historically most Dask changes over the past few years have been incremental or peripheral, due to how burdened the maintainers are. There might be enough pressure on this problem though that we can get some dedicated engineering effort from a few organizations though, which might change how possible this is. We’ve gotten 25% time from a few groups. I’m curious if we can gate 100% time for some people for a few months.

Last Year in Review

2020-07-17T00:00:00+00:00

We recently enjoyed the 2020 SciPy conference from the comfort of our own homes this year. The 19th annual Scientific Computing with Python conference was a virtual conference this year due to the global pandemic. The annual SciPy Conference brought together over 1500 participants from industry, academia, and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development.

As part of the maintainers track we presented an update on Dask.

You can find the video on the SciPy YouTube channel. The Dask update runs from 0:00-19:30.

Slides

Talk Summary

Here’s a summary of the main topics covered in the talk. You can also check out the original thread on Twitter.

Community overview

We’ve been trying to gauge the size of our community lately. The best proxy we have right now is the number of weekly visitors to the Dask documentation. Which currently stands at around 10,000.

Dask also came up in the Jetbrains Python developer survey. We were excited to see 5% of all the Python developers who filled out the survey said they use Dask. Which shows health in the PyData community as well as Dask.

We are running our own survey at the moment. If you are a Dask user please take a few minutes to fill it out. We would really appreciate it.

Community events

In February we had an in-person Dask Summit where a mixture of OSS maintainers and institutional users met. We had talks and workshops to help figure out our challenges and set our direction.

The Dask community also has a monthly meeting! It is held on the first Thursday of the month at 10:00 US Central Time. If you’re a Dask user you are welcome to come to hear updates from maintainers and share what you’re working on.

Community projects

There are many projects built on Dask. Looking at the preliminary results from the 2020 Dask survey shows some that are especially popular.

Let’s take a look at each of those.

Xarray

Xarray allows you to work on multi-dimensional datasets that have supporting metadata arrays in a Pandas-like way.

RAPIDS

RAPIDS is an open-source suite of GPU accelerated Python libraries. Using these tools you can execute end-to-end data science and analytics pipelines entirely on GPUs. All using familiar PyData APIs.

BlazingSQL

BlazingSQL builds on RAPIDS and Dask to provide an open-source distributed, GPU accelerated SQL engine.

XGBoost

While XGBoost has been around for a long time you can now prepare your data on your Dask cluster and then bootstrap your XGBoost cluster on top of Dask and hand the distributed dataframes straight over.

Prefect

Prefect is a workflow manager which is built on top of Dask’s scheduling engine. “Users organize Tasks into Flows, and Prefect takes care of the rest.”

Iris

Iris, part of the SciTools suite of tools, uses the CF data model giving you a format-agnostic interface for working with your data. It excels when working with multi-dimensional Earth Science data, where tabular representations become unwieldy and inefficient.

More tools

These are the tools our community have told us they like so far. But if you use something which didn’t make the list then head to our survey and let us know! According to PyPI there are many more out there.

User groups

There are many user groups who use Dask. Everything from life sciences, geophysical sciences and beamline facilities to finance, retail and logistics. Check out the great “Who uses Dask?” talk from Matthew Rocklin for more info.

For profit companies

There has been an increase in for-profit companies building tools with Dask. Including Coiled Computing, Prefect and Saturn Cloud.

We’ve also seen large companies like Microsoft’s Azure ML team contributing a cluster manager to Dask Cloudprovider. This helps folks get up and running with Dask on AzureML quicker and easier.

Recent improvements

Communications

Moving on to recent improvements there has been a lot of work to get Open UCX supported as a protocol in Dask. Which allows worker-worker communication to be accelerated vastly with hardware that supports Infiniband or NVLink.

There have also been some recent announcements around NVIDIA blowing away the TPCx-BB benchmark by outperforming the current leader by 20x. This is a huge success for all the open-source projects that were involved, including Dask.

Dask Gateway

We’ve seen increased adoption of Dask Gateway. Many institutions are using it as a way to provide their staff with on-demand Dask clusters.

Cluster map plot (aka ‘pew pew pew’)

The update that got the most 👏 feedback from the SciPy 2020 attendees was the Cluster Map Plot (known to maintainers as the “pew pew pew” plot). This plot shows a high-level overview of your Dask cluster scheduler and workers and the communication between them.

Next steps

High-level graph optimization

To wrap up with what Dask is going to be doing next we are going to be continuing to work on high-level graph optimization.

Scheduler performance

With feedback from our community we are also going to be focussing on making the Dask scheduler more performant. There are a few things happening including a Rust implementation of the scheduler, dynamic task creation and ongoing benchmarking.

Chan Zuckerberg Foundation maintainer post

Lastly I’m excited to share that with funding from the Chan Zuckerberg Foundation, Dask will be hiring a maintainer who will focus on growing usage in the biological sciences field. If that is of interest to you keep an eye on our twitter account for more announcements.

Large SVDs

2020-05-13T00:00:00+00:00

We perform Singular Value Decomposition (SVD) calculations on large datasets.

We modify the computation both by using fully precise and approximate methods, and by using both CPUs and GPUs.

In the end we compute an approximate SVD of 200GB of simulated data and using a mutli-GPU machine in 15-20 seconds. Then we run this from a dataset stored in the cloud where we find that I/O is, predictably, a major bottleneck.

SVD - The simple case

Dask arrays contain a relatively sophisticated SVD algorithm that works in the tall-and-skinny or short-and-fat cases, but not so well in the roughly-square case. It works by taking QR decompositions of each block of the array, combining the R matrices, doing another smaller SVD on those, and then performing some matrix multiplication to get back to the full result. It’s numerically stable and decently fast, assuming that the intermediate R matrices of the QR decompositions mostly fit in memory.

The memory constraints here are that if you have an n by m tall and skinny array (n >> m) cut into k blocks then you need to have about m**2 * k space. This is true in many cases, including typical PCA machine learning workloads, where you have tabular data with a few columns (hundreds at most) and many rows.

It’s easy to use and quite robust.

import dask.array as da

x = da.random.random((10000000, 20))
x

	Array	Chunk
Bytes	1.60 GB	100.00 MB
Shape	(10000000, 20)	(625000, 20)
Count	16 Tasks	16 Chunks
Type	float64	numpy.ndarray

20 10000000

u, s, v = da.linalg.svd(x)

This works fine in the short and fat case too (when you have far more columns than rows) but we’re always going to assume that one of your dimensions is unchunked, and that the other dimension has chunks that are quite a bit longer, otherwise, things might not fit into memory.

Approximate SVD

If your dataset is large in both dimensions then the algorithm above won’t work as is. However, if you don’t need exact results, or if you only need a few of the components, then there are a number of excellent approximation algorithms.

Dask array has one of these approximation algorithms implemented in the da.linalg.svd_compressed function. And with it we can compute the approximate SVD of very large matrices.

We were recently working on a problem (explained below) and found that we were still running out of memory when dealing with this algorithm. There were two challenges that we ran into:

The algorithm requires multiple passes over the data, but the Dask task scheduler was keeping the input matrix in memory after it had been loaded once in order to avoid recomputation. Things still worked, but Dask had to move the data to disk and back repeatedly, which reduced performance significantly.

We resolved this by including explicit recomputation steps in the algorithm.
Related chunks of data would be loaded at different times, and so would need to stick around longer than necessary to wait for their associated chunks.

We resolved this by engaging task fusion as an optimization pass.

Before diving further into the technical solution we quickly provide the use case that was motivating this work.

Application - Genomics

Many studies are using genome sequencing to study genetic variation between different individuals within a species. These includes studies of human populations, but also other species such as mice, mosquitoes or disease-causing parasites. These studies will, in general, find a large number of sites in the genome sequence where individuals differ from each other. For example, humans have more than 100 million variable sites in the genome, and modern studies like the UK BioBank are working towards sequencing the genomes of 1 million individuals or more.

In diploid species like humans, mice or mosquitoes, each individual carries two genome sequences, one inherited from each parent. At each of the 100 million variable genome sites there will be two or more “alleles” that a single genome might carry. One way to think about this is via the Punnett square, which represents the different possible genotypes that one individual might carry at one of these variable sites:

In the above there are three possible genotypes: AA, Aa, and aa. For computational genomics, these genotypes can be encoded as 0, 1, or 2. In a study of a species with M genetic variants assayed in N individual samples, we can represent these genotypes as an (M x N) array of integers. For a modern human genetics study, the scale of this array might approach (100 million x 1 million). (Although in practice, the size of the first dimension (number of variants) can be reduced somewhat, by at least an order of magnitude, because many genetic variants will carry little information and/or be correlated with each other.)

These genetic differences are not random, but carry information about patterns of genetic similarity and shared ancestry between individuals, because of the way they have been inherited through many generations. A common task is to perform a dimensionality reduction analysis on these data, such as a principal components analysis (SVD), to identify genetic structure reflecting these differencies in degree of shared ancestry. This is an essential part of discovering genetic variants associated with different diseases, and for learning more about the genetic history of populations and species.

Reducing the time taken to compute an analysis such as SVD, like all science, allows for exploring larger datasets and testing more hypotheses in less time. Practically, this means not simply a fast SVD but an accelerated pipeline end-to-end, from data loading to analysis, to understanding.

We want to run an experiment in less time than it takes to make a cup of tea

Performant SVDs w/ Dask

Now that we have that scientific background, let’s transition back to talking about computation.

To stop Dask from holding onto the data we intentionally trigger computation as we build up the graph. This is a bit atypical in Dask calculations (we prefer to have as much of the computation at once before computing) but useful given the multiple-pass nature of this problem. This was a fairly easy change, and is available in dask/dask #5041.

Additionally, we found that it was helpful to turn on moderately wide task fusion.

import dask
dask.config.set({"optimization.fuse.ave-width": 5})

Then things work fine

We’re going to try this SVD on a few different choices of hardware including:

A MacBook Pro
A DGX-2, an NVIDIA worksation with 16 high-end GPUs and fast interconnect
A twenty-node cluster on AWS

Macbook Pro

We can happily perform an SVD on a 20GB array on a Macbook Pro

import dask.array as da

x = da.random.random(size=(1_000_000, 20_000), chunks=(20_000, 5_000))

u, s, v = da.linalg.svd_compressed(x, k=10, compute=True)
v.compute()

This call is no longer entirely lazy, and it recomputes x a couple times, but it works, and it works using only a few GB of RAM on a consumer laptop.

It takes around 2min 30s time to compute that on a laptop. That’s great! It was super easy to try out, didn’t require any special hardware or setup, and in many cases is fast enough. By working locally we can iterate quickly.

Now that things work, we can experiment with different hardware.

Adding GPUs (a 15 second SVD)

Disclaimer: one of the authors (Ben Zaitlen) works for NVIDIA

We can dramatically increase performance by using a multi-GPU machine. NVIDIA and other manufacturers now make machines with multiple GPUs co-located in the same physical box. In the following section, we will run the calculations on a DGX2, a machine with 16 GPUs and fast interconnect between the GPUs.

Below is almost the same code, running in significantly less same time but we make the following changes:

We increase the size of the array by a factor of 10x
We switch out NumPy for CuPy, a GPU NumPy implementation
We use a sixteen-GPU DGX-2 machine with NVLink interconnects between GPUs (NVLink will dramatically decrease transfer time between workers)

On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 to 15 seconds.

The full notebook is here, but the relevant code snippets are below:

# Some GPU specific setup
from dask_cuda import LocalCluster

cluster = LocalCluster(...)
client = Client(cluster)

import cupy
import dask.array as da
rs = da.random.RandomState(RandomState=cupy.random.RandomState)

# Create the data and run the SVD as normal
x = rs.randint(0, 3, size=(10_000_000, 20_000),
               chunks=(20_000, 5_000), dtype="uint8")
x = x.persist()

u, s, v = da.linalg.svd_compressed(x, k=10, seed=rs)
v.compute()

To see this run, we recommend viewing this screencast:

Read dataset from Disk

While impressive, the computation above is mostly bound by generating random data and then performing matrix calculations. GPUs are good at both of these things.

In practice though, our input array won’t be randomly generated, it will be coming from some dataset stored on disk or increasingly more common, stored in the cloud. To make things more realistic we perform a similar calculation with data stored in a Zarr format in GCS

In this Zarr SVD example, we load a 25GB GCS backed data set onto a DGX2, run a few processing steps, then perform an SVD. The combination of preprocessing and SVD calculations ran in 18.7 sec and the data loading took 14.3 seconds.

Again, on a DGX2, from data loading to SVD we are running in time less than it would take to make a cup of tea. However, the data loading can be accelerated. From GCS we are reading into data into the main memory of the machine (host memory), uncompressing the zarr bits, then moving the data from host memory to the GPU (device memory). Passing data back and forth between host and device memory can significantly decrease performance. Reading directly into the GPU, bypassing host memory, would improve the overall pipeline.

And so we come back to a common lesson of high performance computing:

High performance computing isn’t about doing one thing exceedingly well, it’s about doing nothing poorly.

In this case GPUs made our computation fast enough that we now need to focus on other components of our system, notably disk bandwidth, and a direct reader for Zarr data to GPU memory.

Cloud

Diclaimer: one of the authors (Matthew Rocklin) works for Coiled Computing

We can also run this on the cloud with any number of frameworks. In this case we used the Coiled Cloud service to deploy on AWS

from coiled_cloud import Cloud, Cluster
cloud = Cloud()

cloud.create_cluster_type(
    organization="friends",
    name="genomics",
    worker_cpu=4,
    worker_memory="16 GiB",
    worker_environment={
        "OMP_NUM_THREADS": 1,
        "OPENBLAS_NUM_THREADS": 1,
        # "EXTRA_PIP_PACKAGES": "zarr"
    },
)

cluster = Cluster(
    organization="friends",
    typename="genomics",
    n_workers=20,
)

from dask.distributed import Client
client = Client(cluster)

# then proceed as normal

Using 20 machines with a total of 80 CPU cores on a dataset that was 10x larger than the MacBook pro example we were able to run in about the same amount of time. This shows near optimal scaling for this problem, which is nice to see given how complex the SVD calculation is.

A screencast of this problem is viewable here

Compression

One of the easiest ways for us to improve performance is to reduce the size of this data through compression. This data is highly compressible for two reasons:

The real-world data itself has structure and repetition (although the random play data does not)
We’re storing entries that take on only four values. We’re using eight-bit integers when we only needed two-bit integers

Let’s solve the second problem first.

Compression with bit twiddling

Ideally Numpy would have a two-bit integer datatype. Unfortunately it doesn’t, and this is hard because memory in computers is generally thought of in full bytes.

To work around this we can use bit arithmetic to shove four values into a single value Here are functions that do that, assuming that our array is square, and the last dimension is divisible by four.

import numpy as np

def compress(x: np.ndarray) -> np.ndarray:
    out = np.zeros_like(x, shape=(x.shape[0], x.shape[1] // 4))
    out += x[:, 0::4]
    out += x[:, 1::4] << 2
    out += x[:, 2::4] << 4
    out += x[:, 3::4] << 6
    return out


def decompress(out: np.ndarray) -> np.ndarray:
    back = np.zeros_like(out, shape=(out.shape[0], out.shape[1] * 4))
    back[:, 0::4] = out & 0b00000011
    back[:, 1::4] = (out & 0b00001100) >> 2
    back[:, 2::4] = (out & 0b00110000) >> 4
    back[:, 3::4] = (out & 0b11000000) >> 6
    return back

Then, we can use these functions along with Dask to store our data in a compressed state, but lazily decompress on-demand.

x = x.map_blocks(compress).persist().map_blocks(decompress)

That’s it. We compress each block our data and store that in memory. However the output variable that we have, x will decompress each chunk before we operate on it, so we don’t need to worry about handling compressed blocks.

Compression with Zarr

A slightly more general but probably less efficient route would be to compress our arrays with a proper compression library like Zarr.

The example below shows how we do this in practice.

import zarr
import numpy as np
from numcodecs import Blosc
compressor = Blosc(cname='lz4', clevel=3, shuffle=Blosc.BITSHUFFLE)


x = x.map_blocks(zarr.array, compressor=compressor).persist().map_blocks(np.array)

Additionally, if we’re using the dask-distributed scheduler then we want to make sure that the Blosc compression library doesn’t use additional threads. That way we don’t have parallel calls of a parallel library, which can cause some contention

def set_no_threads_blosc():
    """ Stop blosc from using multiple threads """
    import numcodecs
    numcodecs.blosc.use_threads = False

# Run on all workers
client.register_worker_plugin(set_no_threads_blosc)

This approach is more general, and probably a good trick to have up ones’ sleeve, but it also doesn’t work on GPUs, which in the end is why we ended up going with the bit-twiddling approach one section above, which uses API that was uniformly accessible within the Numpy and CuPy libraries.

Final Thoughts

In this post we did a few things, all around a single important problems in genomics.

We learned a bit of science
We translated a science problem into a computational problem, and in particular into a request to perform large singular value decompositions
We used a canned algorithm in dask.array that performed pretty well, assuming that we’re comfortable going over the array in a few passes
We then tried that algorithm on three architectures quickly
1. A Macbook Pro
2. A multi-GPU machine
3. A cluster in the cloud
Finally we talked about some tricks to pack more data into the same memory with compression

This problem was nice in that we got to dive deep into a technical science problem, and yet also try a bunch of architecture quickly to investigate hardware choices that we might make in the future.

We used several technologies here today, made by several different communities and companies. It was great to see how they all worked together seamlessly to provide a flexible-yet-consistent experience.

Dask Summit

2020-04-28T00:00:00+00:00

In late February members of the Dask community gathered together in Washington, DC. This was a mix of open source project maintainers and active users from a broad range of institutions. This post shares a summary of what happened at this workshop, including slides, images, and lessons learned.

Note: this event happened just before the widespread effects of the COVID-19 outbreak in the US and Europe. We were glad to see each other, but wouldn’t recommend doing this today.

This was an invite-only event of fifty people, with a cap of three people per organization. We intentionally invited an even mix of half people who self-identified as open source maintainers, and half people who identified as institutional users. We had attendees from academia, small startups, tech companies, government institutions, and large enterprise. It was surprising how much we all had in common. We had attendees from the following companies:

Anaconda
Berkeley Institute for Datascience
Blue Yonder
Brookhaven National Lab
Capital One
Chan Zuckerberg Initiative
Coiled Computing
Columbia University
D. E. Shaw & Co.
Flatiron Health
Howard Hughes Medial Institute, Janelia Research Campus
Inria
Kitware
Lawrence Berkeley National Lab
Los Alamos National Laboratory
MetroStar Systems
Microsoft
NIMH
NVIDIA
National Center for Atmospheric Research (NCAR)
National Energy Research Scientific Computing (NERSC) Center
Prefect
Quansight
Related Sciences
Saturn Cloud
Smithsonian Institution
SymphonyRM
The HDF Group
USGS
Ursa Labs

Objectives

The Dask community comes from a broad range of backgrounds. It’s an odd bunch, all solving very different problems, but all with a surprisingly common set of needs. We’ve all known each other on GitHub for several years, and have a long shared history, but many of us had never met in person.

In hindsight, this workshop served two main purposes:

It helped us to see that we were all struggling with the same problems and so helped to form direction and motivate future work
It helped us to create social bonds and collaborations that help us manage the day to day challenges of building and maintaining community software across organizations

Structure

We met for three days.

On days 1-2 we started with quick talks from the attendees and followed with afternoon working sessions.

Talks were short around 10-15 minutes (having only experts in the room meant that we could easily skip the introductory material) and always had the same structure:

A brief description of the domain that they’re in and why it’s important

Example: We look at seismic readings from thousand of measurement devices around the world to understand and predict catastrophic earthquakes
How they use Dask to solve this problem

Example: this means that we need to cross-correlate thousands of very long timeseries. We use Xarray on AWS with some custom operations.
What is wrong with Dask, and what they would like to see improved

Example: It turns out that our axes labels can grow larger than what Xarray was designed for. Also, the task graph size for Dask can become a limitation

These talks were structured into six sections:

Workflow and pipelines
Deployment
Imaging
General data analysis
Performance and tooling
Xarray

We didn’t capture video, but we do have slides from each of the talks below.

1: Workflow and Pipelines

Blue Yonder

Title: ETL Pipelines for Machine Learning
Presenters: Florian Jetter
Also attending:
- Nefta Kanilmaz
- Lucas Rademaker

Prefect

Title: Prefect + Dask: Parallel / Distributed Workflows
Presenters: Chris White, CTO

Dask + Prefect from Chris White

SymphonyRM

Title: Dask and Prefect for Data Science in Healthcare
Presenter: Joe Schmid, CTO

2: Deployment

Quansight

Title: Building Cloud-based Data Science Platforms with Dask
Presenters: Dharhas Pothina
Also attending: - James Bourbeau - Dhavide Aruliah

NVIDIA and Microsoft/Azure

Title: Native Cloud Deployment with Dask-Cloudprovider
Presenters: Jacob Tomlinson, Tom Drabas, and Code Peterson

Inria

Title: HPC Deployments with Dask-Jobqueue
Presenters: Loïc Esteve

Anaconda

Title: Dask Gateway
Presenters: Jim Crist
Also attending: - Tom Augspurger - Eric Dill - Jonathan Helmus

3: Imaging

Kitware

Title: Scientific Image Analysis and Visualization with ITK
Presenters: Matt McCormick

Kitware

Title: Image processing with X-rays and electrons
Presenters: Marcus Hanwell

National Institutes of Mental Health

Title: Brain imaging
Presenters: John Lee

Janelia / Howard Hughes Medical Institute

Title: Spark, Dask, and FlyEM HPC
Presenters: Stuart Berg

4: General Data Analysis

Brookhaven National Labs

Title: Dask at DOE Light Sources
Presenters: Dan Allan

D.E. Shaw Group

Title: Dask at D.E. Shaw
Presenters: Akihiro Matsukawa

Anaconda

Title: Dask Dataframes and Dask-ML summary
Presenters: Tom Augspurger

5: Performance and Tooling

Berkeley Institute for Data Science

Title: Numpy APIs
Presenters: Sebastian Berg

Ursa Labs

Title: Arrow
Presenters: Joris Van den Bossche

NVIDIA

Title: RAPIDS
Presenters: Keith Kraus
Also attending: - Mike Beaumont - Richard Zamora

NVIDIA

Title: UCX
Presenters: Ben Zaitlen

6: Xarray

USGS and NCAR

Title: Dask in Pangeo
Presenters: Rich Signell and Anderson Banihirwe

LBNL

Title: Accelerating Experimental Science with Dask
Presenters: Matt Henderson
Slides - Fill too large to preview

LANL

Title: Seismic Analysis
Presenters: Jonathan MacCarthy

Unstructured Time

Having rapid fire talks in the morning, followed by unstructured time in the afternoon was a productive combination. Below you’ll see pictures from geo-scientists and quants talking about the same challenges, and library maintainers from Pandas/Arrow/RAPDIS/Dask all working together on joint solutions.

This unstructured time is a productive combination that we would recommend to other technically diverse groups in the future. Engagement and productivity was really high throughout the workshop.

Final Thoughts

Dask’s strength comes from this broad community of stakeholders.

An early technical focus on simplicity and pragmatism allowed the project to be quickly adopted within many different domains. As a result, the practitioners within these domains are largely the ones driving the project forward today. This Community Driven Development brings an incredible diversity of technical and cultural challenges and experience that force the project to quickly evolve in a way that is constrained towards pragmatism.

There is still plenty of work to do. Short term this workshop brought up many technical challenges that are shared by all (simpler deployments, scheduling under task constraints, active memory management). Longer term we need to welcome more people into this community, both by increasing the diversity of domains, and the diversity of individuals (the vast majority of attendees were white men in their thirties from the US and western Europe).

We’re in a good position to effect this change. Dask’s recent growth has captured the attention of many different institutions. Now is a critical time to be intentional about the projects growth to make sure that the project and community continue to reflect a broad and ethical set of principles.

Acknowledgements

Organizers

Thank you very much to the organizers who took time from their busy schedules and worked so hard to put together this event.

Brittany Treadway (Capital One)
Keith Kraus (NVIDIA)
Matthew Rocklin (Coiled Computing)
Mike Beaumont (NVIDIA)
Mike McCarty (Capital One)
Neia Woodson (Capital One)
Jake Schmitt (Capital One)
Jim Crist (Anaconda)

Estimating Users

2020-01-14T00:00:00+00:00

People often ask me “How many people use Dask?”

As with any non-invasive open source software, the answer to this is “I don’t know”.

There are many possible proxies for user counts, like downloads, GitHub stars, and so on, but most of them are wildly incorrect. As a project maintainer who tries to find employment for other maintainers, I’m incentivized to take the highest number I can find, but that is somewhat dishonest. That number today is in the form of this likely false statement.

Dask has 50-100k daily downloads.

This number comes from looking at the Python Package Index (PyPI) (image from pypistats.org)

This is a huge number, but is almost certainly misleading. Common sense tells us that there are not 100k new Dask users every day.

If you dive in more deeply to numbers like these you will find that they are almost entirely due to automated processes. For example, of Dask’s 100k new users, a surprising number of them seem to be running Linux.

While it’s true that Dask is frequently run on Linux because it is a distributed library, it would be odd to see every machine in that deployment individually pip install dask. It’s more likely that these downloads are the result of automated systems, rather than individual users.

Anecdotally, if you get access to fine grained download data, one finds that a small set of IPs dominate download counts. These tend to come mostly from continuous integration services like Travis and Circle, are coming from AWS, or are coming from a few outliers in the world (sometimes people in China try to mirror everything)..

Check Windows

So, in an effort to avoid this effect we start looking at just Windows downloads.

The magnitudes here seem more honest to me. These monthly numbers translate to about 1000 downloads a day (perhaps multiplied by two or three for OSX and Linux), which seems more in line with my expectations.

However even this is strange. The structure doesn’t match my personal experience. Why the big change in adoption in 2018? What is the big spike in 2019? Anecdotally maintainers did not notice a significant jump in users there. Instead, we’ve experienced smooth continuous growth of adoption over time (this is what most long-term software growth looks like). It’s also odd that there hasn’t been continued growth since 2018. Anecdotally Dask seems to have grown somewhat constantly over the last few years. Phase transitions like these don’t match observed reality (at least in so far as I personally have observed it).

Notebook for plot available here

Documentation views

My favorite metric is looking at weekly unique users to documentation.

This is an over-estimate of users because many people look at the documentation without using the project. This is also an under-estimate because many users don’t consult our documentation on a weekly basis (oh I wish).

This growth pattern matches my expectations and my experience with maintaining a project that has steadily gained traction over several years.

Plot taken from Google Analytics

Dependencies

It’s also important to look at dependencies of a project. For example many users in the earth and geo sciences use Dask through another project, Xarray. These users are much less likely to touch Dask directly, but often use Dask as infrastructure underneath the Xarray library. We should probably add in something like half of Xarray’s users as well.

Plot taken from Google Analytics, supplied by Joe Hamman from Xarray

Summary

Dask has somewhere between 100k new users every day (download counts) or something like 10k users total (weekly unique IPs). The 10k number sounds more likely to me, maybe bumping up to 15k due to dependencies. The fact is though that no one really knows.

Judging the use of community maintained OSS is important as we try to value its impact on society. This is also a fundamentally difficult problem. I hope that this post helps to highlight how these numbers may be misleading, and encourages us all to think more deeply about estimating impact.

Dask Working Notes - Posted in 2020

Image Analysis Redux

Image Analysis Redux

Help from Open-Source

Define a Dask Cluster and Load the Data

Napari

Conclusion

2020 Dask User Survey

New Questions

Learning Resources

How do you use Dask?

How can Dask improve?

What other systems do you use?

Takeaways

Announcing the DaskHub Helm Chart

Details

Running tutorials

How

Writing the material

Structuring content

Hosting the material

Choosing a video platform

Registering attendees

Count down to the tutorial

Getting the word out

Setting up the call

Greeting and getting folks set up

Interactivity

Run through the material

Wrapping things up

Sharing the content later

Gathering feedback and planning for next time

Have you used Dask before?

Did we cover all the topics you were expecting? And if not, what was missing?

How was the pace?

Which sections did you find more informative?

What would be your preferred platform for a tutorial like this?

Would you recommend the tutorial to a colleague?

Wrap up

Comparing Dask-ML and Ray Tune's Model Selection Algorithms

Modern hyperparameter tuning techniques

Framework support

Scale up

Speed

Conclusion

Appendix

Benchmark setup

Full example usage

Configuring a Distributed Dask Cluster

How to choose instance type for your cluster

How to choose number of workers

How to choose nthreads to utilize multithreading

How to chunk arrays and partition DataFrames

The current state of distributed Dask clusters

Setup scheduler and workers

Accessing the dashboard

Recap

Cluster requirements

Cluster types

Fixed clusters

Ephemeral clusters

Adaptivity

Connectivity

Cluster Managers

Local Cluster

SSH Cluster

SpecCluster

Dask Kubernetes

Dask Helm chart

Dask Jobqueue

Dask Yarn

Dask Cloudprovider

Dask Gateway

Local CUDA Cluster

Future

Do we need more fixed cluster options?

Can we bridge some gaps?

Are ephemeral clusters too ephemeral?

Can we manage conda environments better?

Faster Scheduling