Dask Working Notes - Posted in 2021

Reflections on one year as the Dask life science fellow

2021-12-15T00:00:00+00:00

Genevieve Buckley was hired as a Dask Life Science Fellow in 2021 funded by CZI. The goal was to improve Dask, with a specific focus on the life science community. This blogpost contains another progress update, and some personal reflections looking back over this year.

Progress update
Personal reflections
What’s next in Dask?

Progress update

A previous progress update for February to September 2021 is available here. Read on for a progress update for the period September to December 2021.

To summarize, between September and December 2021 inclusive, there were:

32 merged pull requests acorss 7 repositories (dask, distributed, dask-image, dask-tutorial, ITK, napari, and napari.github.io)
8 pending pull requests
1 new dask-image release
1 Dask tutorial run, and assisted with a second tutorial.
4 new Dask blogposts published (five, if we count this one)

Read on for a more detailed description of special projects within this time.

Dask stale issues sprint

In two weeks I was able to:

close 117 stale issues, and
identify another 25 potential easy wins for the maintainer team to investigate further.

Lots of other people did work around the same time, following up on old pull requests and other maintanence work. The sprint was very successful overall.

Dask user survey results analysis

In September I analyzed the results from the 2021 Dask user survey. This was a really fun task. Because we asked a lot more questions in 2021 (18 new questions, 43 questions in total) there was was a lot more data to dig into, compared with previous years. You can read the full details about it here.

The biggest benefit from this work is that now we can use this data to prioritize improvements to the documentation and examples. The top two user requests are for more documentation and more examples from their industry. But it wasn’t until this year that we started asking what industries people worked in, so we can target new narrative documentation to the areas that need it most (geoscience, life science, and finance).

ITK compatibility with Dask

I implemented pickle serialization for itk images (ITK PR #2829). This should be one of the last major pieces of the puzzle needed to make ITK images compatible with Dask. It builds on earlier work by Matt McCormick and John Kirkham (you can read a blog post about their earlier work here).

Better cross-compatibility for Dask with other projects was a major goal of mine, so this is an important piece of work. I outline the next steps in the section What’s next in Dask?

Improve rechunking

I implemented PR #8124 fix a bug where reshaping a Dask array can cause an output array with chunks that are much too large to fit in memory. Feedback from the life science user survey indicates that improving Dask’s performance around rechunking is a priority. This work helps to address that.

High level graph work

A major piece of work earlier this year was introducing high level graphs for array slicing and array overlap operations. That is a big effort requiring a lot of ongoing work. PR #8467 tackles one of the next steps for this work.

Find objects function for dask-image

I implemented a find_objects function for dask-image in PR #240. This implementation does not need to know the maximum label number ahead of time, a subtantial improement over the previous attempt. This is a major step forward, because it removes a major blocker to introducing scikit-image like regionprops functionality.

Blogposts

Dask blogposts published between September through to December 2021 include:

Choosing good chunk sizes in Dask
- This blogpost addresses some very common concerns and questions about using Dask. I’m very pleased with this article, due to several thoughtful reviewers the final work is a much stronger and more comprehensive than the twitter thread that inspired it.
- It’s also high impact work. In the Dask survey the most common request is for more documentation, and this content helps to address that. Twitter analytics also show much higher engagement with this content than for other similar tweets, indicating a demand in the community for this type of explanation.
Mosaic Image Fusion (co-authored with Volker Hisenstein and Marvin Albert)
- This blogpost was several months in the making (started in mid-August and published in December). It’s fantastic to have people sharing some of the very cool work they do with Dask on real world problems.
CZI EOSS Update
- This blogpost shares with the community an interim progress update provided to CZI.
2021 Dask user survey results
- Discussed in more detail above, the analysis results from the Dask User Survey were published in September 2021.

Tutorials

I presented a Dask tutorial at the ResBaz Sydney online conference on the 25th of November 2021. Thanks to the ResBaz organisers and to David McFarlane, Svetlana Tkachenko, and Oksana Tkachenko for monitoring the chat for questions on the day.
Naty Clementi ran a Dask tutorial for the Women Who Code DC meetup on the 4th of November 2021. I assisted Naty, mostly by monitoring questions in the chat.

Personal reflections

Reflecting back over the whole year, there were some things that worked well and some things that were less successful.

Highlights from this year

My personal highlights include:

ITK + Dask integration work (discussed in more detail above).
A find objects fucntion for dask-image (discussed in more detail above).
Visualization work, because it’s very high impact. We’re solving issues raised by life science groups, but the improved tools benefit EVERYONE who uses Dask.
This bugfix from dask PR #7391, because this single change fixed problems in four places at once (scikit-image, dask-ml, xgcm/xhistogram, and the cupy dask tests).
Community building, conferences, and engagement. Lots of effort went into events over this year, and it’s certainly paid dividends.

What worked well

Dask stale issues sprint

This was useful for the project, as well as useful for me. Sorting through old issues was an incredibly effective way to get familiar with who the experts are for particular topics. It would have been even better if this happened in the first few months of working on Dask, instead of the last few months.
It’s been suggested that one good way to gain familiarity is spending 6 months full time managing the issue tracker. Maybe that’s true, but the much shorter stale issue sprint was a very efficient way of getting a lot of the same benefits in a short space of time. I’d recommend it for new maintainers or triage team members.

Community building events

We had a very successful year in terms of community building and events. This included tutorials, workshops, conferences, and community outreach. Summary of major events:

Led a Dask tutorial at ResBaz Sydney 2021 in November.
Co-led a half-day tutorial on napari and Dask at the Light Microscopy Australia Meeting in August.
SciPy 2021 presentation Scaling Science: leveraging Dask for life sciences in July.
Organized the Dask Life Science workshop at the Dask Summit in May 2021. The life science workshop included 15 pre-recorded talks, and 3 interactive discussions.
Co-organised the Dask Down Under workshop for the Dask Summit in May 2021. Dask Down Under contained 5 talks, 2 tutorials, 1 panel discussion, and 1 meet and greet networking event. Dask Down Under
Expert panelist at the VIS2021 symposium in February.

Visualization work

This has been very high impact work, and I’m pleased with what we’ve achieved. Improved tools for visualization were requested by users in our survey of the life science community. This was a high priority, because improvements to visuzliation tools benefit EVERYONE who uses Dask.

What didn’t work so well

Technical resources

We never really solved the problem of finding someone I could go to with technical questions. I did have people to ask about some specific projects, but in most cases I didn’t have a good way to direct questions to the right people. This is a challenging problem, especially because most Dask maintainers and contributors have full time jobs doing other things too. In my opinion, this negatively impacted the work and what we were able to achieve.

Being added to the @dask/maintenance team

There’s no point getting notifications if you don’t have GitHub permissions to do anything about them. In future I think we should add only people with at least triage or write permissions to the github teams.

Real time interaction

We tried out “Ask a maintainer” office hours for the life science community, but they were poorly attended, so we cancelled this.
We added some “Dask social chat” events to the calendar, but they were not very well attended outside of the first few. Most often, zero people attended. (There is another social chat for the Americas/Europe time zones, which is at a more convenient time for most people and might be more popular.)

Slack

Slack works well to DM specific people to set up meeting times, etc, but the public channels didn’t end up being very useful for me personally.

Lack of integration with other project teams

You can only get so much done as a solo developer. We had hoped that I would naturally end up working with teams from several different projects, but this didn’t really end up being the case. The napari project is an exception to this, and that relationship was well established before starting work for Dask. Perhaps there’s something more we could have done here to facilitate more interaction.

What’s next for Genevieve?

Genevieve will be starting a new job next year, you can find her on GitHub @GeneviveeBuckley.

What’s next in Dask?

Lots of stuff has happened in Dask, but there is still lots left to do. Here is a summary of the next steps for several projects. We’d love it if new people would like to take up the torch and contribute to any of these projects.

ITK image compatibility with Dask

The next steps for the ITK + Dask project require ITK release candidate 5.3rc3 or above to become available (likely early in 2022).
When the release is available, the next step is to try to re-run the code from the original ITK blogpost.
If there’s still work to be done we’ll need to open issues for the remaining blockers. And if it all works well, we’d like someone to write a second ITK + Dask blogpost to publicize the new functionality.

Improving performance around rechunking

More performance improvements related to rechunking is required (see #7950 and #7980).

High level graph work for arrays and slicing

The high level graph work for slicing and overlapping arrays has a number of next steps. Ian Rose has written an excellent summary here. Briefly, thecull and get_output_keys methods must be implemented, then low level fusion and optimizations can be done.

Relevant links:

Implement cull method for ArrayOverlapLayer #7789
Implement get_output_keys method for ArrayOverlapLayer #7791
Array slicing HighLevelGraph layer #7655

Documentation

Dask needs better documentation for high level graphs. Both user documentation and developer documentation is required.
At some future point, it might be worthwhile integrating blogpost content from Choosing good chunk sizes in Dask into the main Dask documentation, for better discoverability.

Mosaic Image Fusion

2021-12-01T00:00:00+00:00

This blogpost shows a case study where a researcher uses Dask for mosaic image fusion. Mosaic image fusion is when you combine multiple smaller images taken at known locations and stitch them together into a single image with a very large field of view. Full code examples are available on GitHub from the DaskFusion repository: VolkerH/DaskFusion

The problem

Image mosaicing in microscopy

In optical microscopy, a single field of view captured with a 20x objective typically has a diagonal on the order of a few 100 μm (exact dimensions depend on other parts of the optical system, including the size of the camera chip). A typical sample slide has a size of 25mm by 75mm. Therefore, when imaging a whole slide, one has to acquire hundreds of images, typically with some overlap between individual tiles. With increasing magnification, the required number of images increases accordingly.

To obtain an overview one has to fuse this large number of individual image tiles into a large mosaic image. Here, we assume that the information required for positioning and alignment of the individual image tiles is known. In the example presented here, this information is available as metadata recorded by the microscope, namely the microscope stage position and the pixel scale. Alternatively, this information could also be derived from the image data directly, e.g. through a registration step that matches corresponding image features in the areas where tiles overlap.

The solution

The array that can hold the resulting mosaic image will often have a size that is too large to fit in RAM, therefore we will use Dask arrays and the map_blocks function to enable out-of-core processing. The map_blocks function will process smaller blocks (a.k.a chunks) of the output array individually, thus eliminating the need to hold the whole output array in memory. If sufficient resources are available, dask will also distribute the processing of blocks across several workers, thus we also get parallel processing for free, which can help speed up the fusion process.

Typically whenever we want to join Dask arrays, we use Stack, Concatenate, and Block. However, these are not good tools for mosaic image fusion, because:

The image tiles will be be overlapping,
Tiles may not be positioned on an exact grid and will typically also have slight rotations as the alignment of stage and camera is not perfect. In the most general case, for example in panaromic photo mosaics, individual image tiles could be arbitrarily rotated or skewed.

The starting point for this mosaic prototype was some code that reads in the stage metadate for all tiles and calculates an affine transformation for each tile that would place it at the correct location in the output array.

The image below shows preliminary work placing mosaic image tiles into the correct positions using the napari image viewer. Shown here is a small example with 63 image tiles.

And here is an animation of placing the individual tiles.

To leverage processing with Dask we created a fuse function that generates a small block of the final mosaic and is invoked by map_blocks for each chunk of the output array. On each invocation of the fuse function map_blocks passes a dictionary (block_info). From the Dask documentation:

Your block function gets information about where it is in the array by accepting a special block_info or block_id keyword argument.

The basic outline of the fuse function of the mosaic workflow is as follows. For each chunk of the output array:

Determine which source image tiles intersect with the chunk.
Adjust the image tiles’ affine transformations to take the offset of the chunk within the array into account.
Load all intersectiong image tiles and apply their respective adjusted affine transformation to map them into the chunk.
Blend the tiles using a simple maximum projection.
Return the blended chunk.

Using a maximum projection to blend areas with overlapping tiles can lead to artifacts such as ghost images and visible tile seams, so you would typically want to use something more sophisticated in production.

Results

For datasets with many image tiles (~500-1000 tiles), we could speed up the mosaic generation from several hours to tens of minutes using this Dask based method (compared to a previous workflow using ImageJ plugins runnning on the same workstation). Due to Dask’s ability to handle data out-of-core and chunked array storage using zarr it is also possible to run the fusion on hardware with limited RAM.

Finally, we have the final mosaic fusion result.

Code

Code relatiing to this mosaic image fusion project can be found in the DaskFusion GitHub repository here: VolkerH/DaskFusion

There is a self-contained example available in this notebook, which downloads reduced-size example data to demonstrate the process.

What’s next?

Currently, the DaskFusion code is a proof of concept for single-channel 2D images and simple maximum projection for blending the tiles in overlapping areas, it is not production code. However, the same principle can be used for fusing multi-channel image volumes, such as from Light-Sheet data if the tile chunk intersection calculation is extended to higher-dimensional arrays. Such even larger datasets will benefit even more from leveraging dask, as the processing can be distributed across multiple nodes of a HPC cluster using dask jobqueue.

Also see

Marvin’s lightning talk on multi-view image fusion: 15 minute video available here on YouTube

The GitHub repository MVRegFus that Marvin talks about in the video is available here: m-albert/MVRegFus

The napari-lazy-openslide visualization plugin by Trevor Manz: “An experimental plugin to lazily load multiscale whole-slide tiff images with openslide and dask.”

For further information on alternative approaches to image stitching:

ASHLAR: Alignment by Simultaneous Harmonization of Layer / Adjacency Registration
Microscopy Image Stitching Tool (MIST)
The m2stitch python package by Yohsuke T. Fukai: “Provides robust stitching of tiled microscope images on a regular grid” (based on the MIST algorithm)

Acknowledgements

This computational work was done by Volker Hilsenstein, in conjunction with Marvin Albert. Volker Hilsenstein is a scientific software developer at EMBL in Theodore Alexandrov’s lab with a focus on spatial metabolomics and bio-image analysis.

The sample images were prepared and imaged by Mohammed Shahraz from the Alexandrov lab at EMBL Heidelberg.

Genevieve Buckley and Volker Hilsenstein wrote this blogpost.

Choosing good chunk sizes in Dask

2021-11-02T00:00:00+00:00

Confused about choosing a good chunk size for Dask arrays?

Array chunks can’t be too big (we’ll run out of memory), or too small (the overhead introduced by Dask becomes overwhelming). So how can we get it right?

It’s a two step process:

First, start by choosing a chunk size similar to data you know can be processed entirely within memory (i.e. without Dask), using these rough rules of thumb.
Then, watch the Dask dashboard task stream and worker memory plots, and adjust if needed. Here are the signs to watch out for.

What are Dask array chunks?
Too small is a problem
Too big is also a problem
Choosing an initial chunk size
- Rough rules of thumb
- Chunks should be aligned with array storage on disk
Using the Dask dashboard
- What to watch for on the dashboard
Rechunking arrays
Unmanaged memory
Thanks for reading

What are Dask array chunks?

Dask arrays are big structures, made out of many small chunks. Typically, each small chunk is an individual numpy array, and they are arranged together to make a much larger Dask array.

You can find more information about Dask array chunks on this page of the documentation: https://docs.dask.org/en/latest/array-chunks.html

How do I know what chunks my array has?

If you have a Dask array, you can use the chunksize or chunks attribues to see information about the chunks. You can also visualize this with the Dask array HTML representation.

arr.chunksize shows the largest chunk size. For arrays where you expect roughly uniform chunk sizes, this is a good way to summarize chunk size information.

arr.chunks shows fully explicit sizes of all chunks along all dimensions within the Dask array (see item 3 here). This is more verbose, and is a good choice with arrays that have irregular chunks.

Too small is a problem

If array chunks are too small, it’s inefficient. Why is this?

Using Dask introduces some amount of overhead for each task in your computation. This overhead is the reason the Dask best practices advise you to avoid too-large graphs. This is because if the amount of actual work done by each task is very tiny, then the percentage of overhead time vs useful work time is not good.

Typically, the Dask scheduler takes 1 millisecond to coordinate a single task. That means we want the computation time for each task to be comparitively larger, eg: seconds instead of milliseconds.

It might be hard to understand this intuitively, so here’s an analogy. Let’s imagine we’re building a house. It’s a pretty big job, and if there were only one worker it would take much too long to build. So we have a team of workers and a site foreman. The site foreman is equivalent to the Dask scheduler: their job is to tell the workers what tasks they need to do.

Say we have a big pile of bricks to build a wall, sitting in the corner of the building site. If the foreman (the Dask scheduler) tells workers to go and fetch a single brick at a time, then bring each one to where the wall is being built, you can see how this is going to be very slow and inefficient! The workers are spending most of their time moving between the wall and the pile of bricks. Much less time is going towards doing the actual work of mortaring bricks onto the wall.

Instead, we can do this in a smarter way. The foreman (Dask scheduler) can tell the workers to go and bring one full wheelbarrow load of bricks back each time. Now workers are spending much less time moving between the wall and the pile of bricks, and the wall will be finished much quicker.

Too big is also a problem

If the Dask array chunks are too big, this is also bad. Why? Chunks that are too large are bad because then you are likely to run out of working memory. You may see out of memory errors happening, or you might see performance decrease substantially as data spills to disk.

When too much data is loaded in memory on too few workers, Dask will try to spill data to disk instead of crashing. Spilling data to disk makes things run very slowly, because all the extra read/write operations to disk. Things don’t just get a little bit slower, they get a LOT slower, so it’s smart to watch out for this.

To watch out for this, look at the worker memory plot on the Dask dashboard. Orange bars are a warning you are close to the limit, and gray means data is being spilled to disk - not good! For more tips, see the section on using the Dask dashboard below.

Choosing an initial chunk size

Rough rules of thumb

If you already created a prototype, which may not involve Dask at all, using a small subset of the data you intend to process, you’ll have a clear idea of what size of data can be processed easily for this workflow. You can use this knowledge to choose similar sized chunks in Dask.
Some people have observed that chunk sizes below 1MB are almost always bad. Chunk size between 100MB and 1GB are generally good, going over 1 or 2GB means you have a really big dataset and/or a lot of memory available per core,
Upper bound: Avoid too large task graphs. More than 10,000 or 100,000 chunks may start to perform poorly.
Lower bound: To get the advantage of parallelization, you need the number of chunks to at least equal the number of worker cores available (or better, the number of worker cores times 2). Otherwise, some workers will stay idle.
The time taken to compute each task should be much larger than the time needed to schedule the task. The Dask scheduler takes roughly 1 millisecond to coordinate a single task, so a good task computation time would be measured in seconds (not milliseconds).

Chunks should be aligned with array storage on disk

If you are reading data from disk, the storage structure will inform what shape your Dask array chunks should be. For best performance, choose chunks that are well aligned with the way data is stored.

From the Dask best practices on how to orient your chunks:

When reading data you should align your chunks with your storage format. Most array storage formats store data in chunks themselves. If your Dask array chunks aren’t multiples of these chunk shapes then you will have to read the same data repeatedly, which can be expensive. Note though that often storage formats choose chunk sizes that are much smaller than is ideal for Dask, closer to 1MB than 100MB. In these cases you should choose a Dask chunk size that aligns with the storage chunk size and that every Dask chunk dimension is a multiple of the storage chunk dimension.

Some examples of data storage structures on disk include:

A HDF5 or Zarr array. The size and shape of chunks/blocks stored on disk should align well with the Dask array chunks you select.
A folder full of tiff files. You might decide that each tiff file should become a single chunk in the Dask array (or that multiple tiff files should be grouped into a single chunk).

Using the Dask dashboard

The second part of choosing a good chunk size is monitoring the Dask dashboard to see if you need to make any adjustments.

If you’re not very familiar with the Dask dashboard, or you just sometimes forget where to find certain dashboard plots (like the worker memory plot), then you’ll probably enjoy these quick video tutorials:

We recommend always having the dashboard up when you’re working with Dask. It’s a fantastic way to get a sense of what’s working well, or poorly, so you can make adjustments.

What to watch for on the dashboard

Bad signs to watch out for include:

Lots of white space in the task stream plot is a bad sign. White space means nothing is happening. Chunks may be too small.
Lots and lots of red in the task stream plot is a bad sign. Red means worker communication. Dask workers need some communication, but if they are doing almost nothing except communication then there is not much productive work going on.
On the worker memory plot, watch out for orange bars which are a sign you are getting close to the memory limit. Chunks may be too big.
On the worker memory plot, watch out for grey bars which mean data is being spilled to disk. Chunks may be too big.

Here is an example of the Dask dashboard during a good computation (time 6:12 in this video).

For comparison, here is an example of the Dask dashboard during a bad computation (time 6:57 in this video).

In this example, it’s inefficient because the chunks are much too small, so we see a lot of white space and red worker communication in the task stream plot.

Rechunking arrays

If you need to change the chunking of a Dask array in the middle of a computation, you can do that with the rechunk method.

rechunked_array = original_array.rechunk(new_shape)

Warning: Rechunking Dask arrays comes at a cost.

The Dask graph must be rearranged to accomodate the new chunk structure. This happens immediately, and will block any other interaction with python until Dask has rearranged the task graph.
This also inserts new tasks into the Dask graph. At compute time, there are now more tasks to execute.

For these reasons, it is best to choose a good initial chunk size and avoid rechunking.

However, sometimes the data is stored on disk is not well aligned and rechunking may be necessary. For an example of this, here is Draga Doncila Pop talking about chunk alignment with satellite image data.

The rechunker library can be useful in these situations:

Rechunker takes an input array (or group of arrays) stored in a persistent storage device (such as a filesystem or a cloud storage bucket) and writes out an array (or group of arrays) with the same data, but different chunking scheme, to a new location. Rechunker is designed to be used within a parallel execution framework such as Dask.

Unmanaged memory

Last, remember that you don’t only need to consider the size of the array chunks in memory, but also the working memory consumed by your analysis functions. Sometimes that is called “unmanaged memory” in Dask.

“Unmanaged memory is RAM that the Dask scheduler is not directly aware of and which can cause workers to run out of memory and cause computations to hang and crash.” – Guido Imperiale

Here are some tips for handling unmanaged memory:

Tackling unmanaged memory with Dask (Coiled blogpost) by Guido Imperiale
Handle Unmanaged Memory in Dask (8 minute video)

Thanks for reading

We hope this was helpful figuring out how to choose good chunk sizes for Dask. This blogpost was inspired by this twitter thread. If you’d like to follow Dask on Twitter, you can do that at https://twitter.com/dask_dev

CZI EOSS Update

2021-10-20T00:00:00+00:00

Dask was awarded funding last year in round 2 of the CZI Essential Open Source Software grant program. That funding was used to hire Genevieve Buckley to work on Dask with a focus on life sciences. Last month Dask submitted an interim progress report to CZI, covering the period from February to September 2021. That progress update is published verbatim below, to share with the wider Dask community.

Brief summary

The scope of work performed by the Dask fellow includes code contributions, conference presentations and tutorials, community engagement, and outreach including blogposts.

The primary deliverable of this proposal is consistency and the success of neighboring software projects

Project work to date includes:

38 pull requests merged (plus 6 draft pull requests) across 5 different repositories.
3 conferences (presentations and organising of specialist workshops)
1 half day workshop (plus another one upcoming)
Student supervision for Dask’s Google Summer of Code project
9 blogposts (plus 2 drafts for upcoming publication)

Code contributions

Code contributions are not limiteed to the main Dask repository, but also neighbouring software projects which use Dask as well (like the napari software project), including: dask, dask-image, dask-examples, napari, & napari.github.io.

To date, across the five repositories named above the Dask fellow has contributed:

38 pull requests
6 draft pull requests
12 closed pull requests (not merged, discarded in favour of another approach)

The Dask fellow is an official maintainer of the dask-image project, and additional milestones achieved for that project include:

The maintainer team has been grown by one (we welcome Marvin Albert to our ranks)
2 new dask-image releases in 2020

Code contribution highlights

Highlights include:

Bugfixes benefitting the broader community
- dask PR #7391: This PR fixed slicing the output from Dask’s bincount function. The impact of this fix was substantial, as it solved issues filed in four separate projects: scikit-image, dask-ml, xgcm/xhistogram and the cupy dask tests.
Expanded GPU support
- dask PR #6680: This PR provided support for different array types in the *_like array creation functions. Now users can create cupy like Dask arrays for GPU processing, or indeed any other array type (eg: sparse).
- dask-image PR #157: This PR provided GPU support for binary morphological functions in the dask-image project.
Visualization tools benefitting all Dask users
- dask PR #7716: This PR automatically displays the high level graph visualization in the jupyter notebook cell output (somthing already done automatically for low level graphs).
- dask PR #7763: This PR introduced a HTML representation for Dask HighLevelGraph objects. This allows users and developers a much easier way to inspect the structure and status of HighLevelGraphs.
- Further developed on during the Dask Google Summer of Code project, full report available here.
High Level Graphs
- dask PR #7595: This PR introduced a high level graph layer for array overlaps. High level graphs are a tool we can use to optimize Dask’s performance.
- dask PR #7655 (ongoing): This PR introduces a high level graph for Dask array slicing operations.
Memory improvements (ongoing)
- dask PR #8124 (ongoing): This PR investigates improved automatic rechunking strategies for memory problems caused by reshaping Dask arrays.
- dask PR #7950 (ongoing): This PR aims to improve memory and performance of the tensordot function with auto-rechunking of Dask arrays.
- dask PR #7980 (ongoing): This PR aims to fix the unbounded memory use problem in tensordot, reported here.

Conferences

Notable conference events in 2021 included the SciPy conference, the Dask Summit, and VIS2021.

SciPy conference

The Dask fellow presented a talk titled “Scaling Science: leveraging Dask for life sciences” at the 2021 SciPy conference. Full recording available here.

Dask Summit

The Dask fellow organised two workshops at the 2021 Dask Summit:

Dask Down Under (co-organised with Nick Mortimer), and
The Dask life science workshop

Dask Down Under

The scope of Dask Down Under was more like a mini-conference for Australian timezones, rather than a typical workshop. Dask Down Under involved two days of events, covering:

5 talks
2 tutorials
1 panel discussion
1 meet and greet networking event

It was very well recieved by the community. A full report on the Dask Down under events is available here. A YouTube playlist of the Dask Down Under events is available here on the Dask YouTube channel.

Dask life science workshop

The Dask life science workshop involved:

15 pre-recorded lightning talks
3 interactive discussion times (accessible across timezones in Europe, Oceania, and the Americas)
Asynchronous text chat throughout the Dask Summit

A full report on the Dask life science workshop is available here. A YouTube playlist of all the Dask life science lightning talks is available here on the Dask YouTube channel.

VIS2021 symposium

The Dask fellow was an invited panellist at the VIS2021 symposium in February 2021. The “Problem Solver” panel discussion covered practical problems in image analysis and how tools like Dask and napari can help solve them.

Tutorials and workshops

The Dask fellow co-presented a half-day workshop (five hours) at the 2021 Light Microscopy Australia Meeting with Juan Nunez-Iglesias. napari is an open source multidimensional image viewer built using Dask for out-of-core image processing. Workshop content is available at this link: jni/lma-2021-bioimage-analysis-python

Upcoming workshop: The Dask fellow has been invited to deliver a workshop on napari and big data using Dask at an upcoming NEUBIAS Academy. Workshop content is available at this link: GenevieveBuckley/napari-big-data-training

Google Summer of Code

The Dask fellow supervised a Google Summer of Code student in 2021. Martin Durant acted as a secondary supervisor. The project ran over a 3 month period, and involved implementing a number of features to improve visualization of Dask graphs and objects. A full report on the Dask GSOC project is available here.

Blogposts

We set a goal of one blogpost per month, and exceeded it. To date, nine blogposts have been published by the Dask fellow, with another two currently in draft status.

Getting to know the life science community
Dask with PyTorch for large scale image analysis (co-authored with Nick Sofreniew)
Skeleton analysis
Life sciences at the 2021 Dask Summit
The 2021 Dask User Survey is out now
Dask Down Under (co-authored with Nick Mortimer)
Ragged output, how to handle awkward shaped results
High Level Graphs update
Google Summer of Code 2021 - Dask Project

Draft status, will be published soon:

Mosaic Image Fusion (co-authored with Volker Hisenstein)
2021 Dask user survey results

2021 Dask User Survey

2021-09-15T00:00:00+00:00

This post presents the results of the 2021 Dask User Survey, which ran earlier this year. Thanks to everyone who took the time to fill out the survey! These results help us better understand the Dask community and will guide future development efforts.

The raw data, as well as the start of an analysis, can be found in this binder:

Let us know if you find anything in the data.

Highlights
Who are Dask users?
How people like to use Dask
Diagnostics
Stability
User satisfaction, support, and documentation
Suggestions for improvement
Previous survey results

Highlights

We had 247 responses to the survey (roughly the same as last year, which had just under 240 responses). Overall, responses were similar to previous years.

We asked 43 questions in the survey (an increase of 18 questions compared to the year before). We asked a bunch of new questions about the types of datasets people work with, the stability of Dask, and what kinds of industries people work in.

Our community wants:

More documentation and examples
More intermediate level documentation
To improve the resiliency of Dask (i.e. do computations complete?)

Users also value these features:

Improved scaling
Ease of deployment
Better scikit-learn & machine learning support

The typical Dask user

The survey shows us there is a lot of diversity in our community, and there is no one way to use Dask. That said, our hypothetical “typical” Dask user:

Works with gigabyte sized datasets
Stored on a local filesystem
Has been using Dask between 1 and 3 years
Uses Dask occasionally, not every day
Uses Dask interactively at least part of the time
Uses a compute cluster (probably)
Likes to view the Dask dashboard with a web browser
For the most part, Dask is stable enough for their needs, but improving the Dask’s resiliancy would be helpful
Uses the Dask dataframe, delayed, and maybe the Dask Array API, alongside numpy/pandas and other python libraries
The most useful thing that would help this person is more documentation, and more examples using Dask in their field.
They likely work in a scientific field (perhaps geoscience, life science, physics, or astronomy), or alternatively they might work in accounting, finance, insurance, or as a tech worker.

You can read the survey results from previous years here: 2020 survey results, 2019 survey results.

# Let's load in the survey data...
%matplotlib inline

from pprint import pprint
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import textwrap
import re


df2019 = (
    pd.read_csv("data/2019-user-survey-results.csv.gz", parse_dates=["Timestamp"])
      .replace({"How often do you use Dask?": "I use Dask all the time, even when I sleep"}, "Every day")
)

df2020 = (
    pd.read_csv("data/2020-user-survey-results.csv.gz")
      .assign(Timestamp=lambda df: pd.to_datetime(df['Timestamp'], format="%Y/%m/%d %H:%M:%S %p %Z").astype('datetime64[ns]'))
      .replace({"How often do you use Dask?": "I use Dask all the time, even when I sleep"}, "Every day")
)

df2021 = (
    pd.read_csv("data/2021-user-survey-results.csv.gz")
      .assign(Timestamp=lambda df: pd.to_datetime(df['Timestamp']).astype('datetime64[ns]'))
      .replace({"How often do you use Dask?": "I use Dask all the time, even when I sleep"}, "Every day")
)

common = df2019.columns.intersection(df2020.columns).intersection(df2021.columns)
added = df2021.columns.difference(df2020.columns)
dropped = df2020.columns.difference(df2021.columns)

df = pd.concat([df2019, df2020, df2021])
df['Year'] = df.Timestamp.dt.year
df = df.set_index(['Year', 'Timestamp']).sort_index()

Who are Dask users?

Most people said they use Dask occasionally, while a smaller group use Dask every day. There is a wide variety in how long people have used Dask for, with the most common response being between one and three years.

q = "How often do you use Dask?"
ax = sns.countplot(y=q, data=df2021[q].dropna().str.split(";").explode().to_frame());
ax.set(ylabel="", title=q);

q = "How long have you used Dask?"  # New question in 2021
order = ["More than 3 years", "1 - 3 years", "3 months - 1 year", "Less than 3 months", "I've never used Dask"]
ax = sns.countplot(y=q, data=df2021[q].dropna().str.split(";").explode().to_frame(), order=order);
ax.set(ylabel="", title=q);

Just over half of respondants use Dask with other people (their team or organisation), and the other half use Dask on their own.

q = "Do you use Dask as part of a larger group?"
order = [
    'I use Dask mostly on my own',
    'My team or research group also use Dask',
    'Beyond my group, many people throughout my institution use Dask',
]
ax = sns.countplot(y=q, data=df2021[q].dropna().str.split(";").explode().to_frame(), order=order)
ax.set(ylabel="", title=q);

In the last year, there has been an increase in the number of people who say that many people throughout their institution use Dask (32 people said this in 2021, compared to 19 in 2020). Between 2019 and 2020, there was a drop in the number of people who said their immediate team also uses Dask (121 people said this in 2019, compared to 94 in 2020). It’s not clear why we saw either of these changes, so it will be interesting to see what happens in future years.

q = 'Do you use Dask as part of a larger group?'
ax = sns.countplot(y=q, hue="Year", data=df.reset_index());
ax.set(ylabel="", title=q);

What industry do you work in?

There was a wide variety of industries represented in the survey.

Almost half of responses were in an industry related to science, academia, or a governmant laboratory. Geoscicence had the most responses, while life sciences, physics, and astronomy were also popular fields.

Around 30 percent of responses were from people in businesss and tech. Of these, there was a roughly even split between people in accounting/finance/insurance vs other tech workers.

Around 10 percent of responses belonged to manufacturing, engineering, and other industry (energy, aerospace, etc). The remaining responses were difficult to categorise.

q = "What industry do you work in?"  # New question in 2021
data = df2021[q].dropna().str.split(";").explode().to_frame()
order = data.value_counts()[data.value_counts() > 1].keys().get_level_values(0)
ax = sns.countplot(y=q, data=data, order=order);
ax.set(ylabel="", title=q);

How easy is it for you to upgrade to newer versions of Python libraries?

The majority of users are able to easily upgrade to newer versoins of python libraries when they want.

q = "How easy is it for you to upgrade to newer versions of Python libraries"
sns.countplot(y=q, data=df2021[q].dropna().explode().to_frame()).set_ylabel('Scale from 1 (Difficult) to 4 (Easy)');

How people like to use Dask

People like to use Dask in conjunction with numpy and pandas, along with a range of other python libraries. The most popular Dask APIs are Dask Dataframes, Dask Delayed, and Dask Arrays.

The vast majority of people like to use Dask interactively with Jupyter or IPython at least part of the time, and most people view the Dask Dashboard with a web browser.

What are some other libraries that you often use with Dask?”

The ten most common libraries people use with Dask are: numpy, pandas, xarray, scikit-learn, scipy, statsmodels, matplotlib, xgboost, numba, and joblib.

q = "What are some other libraries that you often use with Dask?"
data = df2021[q].dropna().str.lower().str.split(", ").explode().to_frame()
labels = pd.value_counts(data[q]).iloc[:10].index
sns.countplot(y=q, data=data, order=labels).set_ylabel('');

Dask APIs

The three most popular Dask APIs people use are:

In 2021, we saw a small increase in the number of people who use dask delayed, compared with previous years. This might be a good thing, it’s possible that as people develop experience and confidence with Dask, they are more likely to start using more advanced features such as delayed. Besides this change, preferences were pretty simliar to the results from previous years.

apis = df2021['Dask APIs'].str.split(", ").explode()
top = apis.value_counts().loc[lambda x: x > 10]
apis = apis[apis.isin(top.index)].reset_index()

sns.countplot(y="Dask APIs", data=apis);

Interactive or Batch?

The vast majority of people like to use Dask interactively with Jupyter or IPython at least part of the time. Less than 15% of Dask users only use Dask in batch mode (submitting scripts that run in the future).

q = 'Interactive or Batch?'
data = df2021[q].dropna()
data = data.str.replace('Interactive:  I use Dask with Jupyter or IPython when playing with data, Batch: I submit scripts that run in the future', "Interactive and Batch")
data = data.str.replace('Interactive:  I use Dask with Jupyter or IPython when playing with data', "Interactive")
data = data.str.replace('Batch: I submit scripts that run in the future', "Batch")
order = ["Interactive and Batch", "Interactive", "Batch"]
sns.countplot(y=q, data=data.explode().to_frame(), order=order).set_ylabel('');

How do you view Dask’s dashboard?

Most people look at the Dask dashboard using a web browser. A smaller group use the dask jupyterlab extension.

A few people are still not sure what the dashboard is all about. If that’s you too, you might like to watch this 20 minute video that explains why the dashboard is super useful, or see the rest of the docs here.

q = "How do you view Dask's dashboard?"
ax = sns.countplot(y=q, data=df2021[q].dropna().str.split(", ").explode().to_frame());
ax.set(ylabel="", title=q);

Local machine or Cluster?

Roughly two thirds of respondants use a computing cluster at least part of the time.

q = 'Local machine or Cluster?'
df[q].dropna().str.contains("Cluster").astype(int).groupby("Year").mean()

Year
2019    0.654902
2020    0.666667
2021    0.630081
Name: Local machine or Cluster?, dtype: float64

q = 'Local machine or Cluster?'
order = [
    'Personal laptop',
    'Large workstation',
    'Cluster of 2-10 machines',
    'Cluster with 10-100 machines',
    'Cluster with 100+ machines'
]
ax = sns.countplot(y=q, data=df2021[q].dropna().str.split(", ").explode().to_frame(), order=order);
ax.set(ylabel="", title=q);

If you use a cluster, how do you launch Dask?

SSH is the most common way to launch Dask on a compute cluster, followed by a HPC resource manager, then Kubernetes.

q = "If you use a cluster, how do you launch Dask? "
data = df2021[q].dropna()
data = data.str.replace("HPC resource manager (SLURM, PBS, SGE, LSF or similar)", "HPC resource manager (SLURM PBS SGE LSF or similar)", regex=False)
data = data.str.replace("I don't know, someone else does this for me", "I don't know someone else does this for me", regex=False)
data = data.str.split(", ").explode().to_frame()
order = data.value_counts()[data.value_counts() > 1].keys().get_level_values(0)
ax = sns.countplot(y=q, data=data, order=order);
ax.set(ylabel="", title=q);

If you use a cluster, do you have a need for multiple worker types in the same cluster?

Of the people who use compute clusters, a little less than half have a need for multiple worker types in the same cluster. Examples of this might include mixed workers with GPU vs no GPU, mixed workers with low or high memory allocations, etc.

q = "If you use a cluster, do you have a need for multiple worker / machine types (e.g. GPU / no GPU, low / high memory) in the same cluster?"  # New question in 2021
ax = sns.countplot(y=q, data=df2021[q].dropna().str.split(";").explode().to_frame());
ax.set(ylabel="", title="Do you need multiple worker/machine types on a cluster?");

Datasets

How large are your datasets typically?

Dask users most commonly work with gigabyte sized datasets. Very few users work with petabyte sized datasets.

q = "How large are your datasets typically?"  # New question in 2021
ax = sns.countplot(y=q, data=df2021[q].dropna().str.split(", ").explode().to_frame());
ax.set(ylabel="", title=q);

Where are your datasets typically stored?

Most people store their data on a local filesystem.

q = "Where are your datasets typically stored?"  # New question in 2021
data = df2021[q].dropna().str.split(", ").explode().to_frame()
order = data.value_counts()[data.value_counts() > 1].keys().get_level_values(0)
ax = sns.countplot(y=q, data=data, order=order);
ax.set(ylabel="", title=q);

What file formats do you typically work with?

The two most common file formats (csv and parquet) are popular among Dask Dataframe users. The JSON file format is also very commonly used with Dask. The fourth and fifth most common filetypes (HDF5 and zarr) are popular among Dask Array users. This fits with what we know about the Dask Dataframe API being the most popular, with Dask Arrays close behind.

q = "What file formats do you typically work with?"  # New question in 2021
data = df2021[q].dropna().str.split(", ").explode().to_frame()
order = data.value_counts()[data.value_counts() > 1].keys().get_level_values(0)
ax = sns.countplot(y=q, data=data, order=order);
ax.set(ylabel="", title=q);

This survey question had a long tail: a very wide variety of specialized file formats were reported, most only being used by one or two individuals who replied to the survey.

A lot of these specialized file formats store image data, specific to particular fields (astronomy, geoscience, microscopy, etc.).

list(data.value_counts()[data.value_counts() == 1].keys().get_level_values(0))

['proprietary measurement format',
 'netCDF3',
 'czi',
 'specifically NetCDF4',
 'grib2',
 'in-house npy-like array format',
 'jpeg2000',
 'netCDF4 (based on HDF5)',
 'proprietary microscopy file types. Often I convert to Zarr with a loss of metadata.',
 'sas7bdat',
 'npy',
 'npy and pickle',
 'root with uproot',
 'root',
 'regular GeoTiff',
 '.npy',
 'Text',
 'VCF BAM CRAM',
 'UM',
 'CASA measurement sets',
 'Casa Tables (Radio Astronomy specific)',
 'Custom binary',
 'FITS',
 'FITS (astronomical images)',
 'FITS and a custom semi-relational table specification that I want to kill and replace with something better',
 'Feather (Arrow)',
 'GPKG',
 'GeoTIFF',
 'NetCDF4',
 'Netcdf',
 'Netcdf4',
 'PP',
 'SQL',
 'SQL query to remote DB',
 'SQL to Dataframe',
 'Seismic data (miniSEED)',
 'TFRecords',
 'TIFF',
 'Testing with all file formats. Just want it as a replacement for spark. ',
 '.raw image files',
 'ugh']

XKCD comic “Standards” https://xkcd.com/927/

Preferred Cloud?

The most popular cloud solution is Amazon Web Services (AWS), followed by Google Cloud Platform (GCP) and Microsoft Azure.

q = "Preferred Cloud?"
order = [
    "Amazon Web Services (AWS)",
    "Google Cloud Platform (GCP)",
    "Microsoft Azure",
    "Digital Ocean",
]
ax = sns.countplot(y=q, data=df2021[q].dropna().str.split(", ").explode().to_frame(), order=order);
ax.set(ylabel="", title=q);

Do you use Dask projects to deploy?

Among those who use dask projects to deploy, dask-jobqueue and dask helm chart are the two most popular options. There was a wide variety of projects people used for deployment.

q = "Do you use Dask projects to deploy?"
order = [
    "dask-jobqueue",
    "dask's helm chart",
    "dask-kubernetes",
    "dask's docker image at daskdev/dask",
    "dask-gateway",
    "dask-ssh",
    "dask-cloudprovider",
    "dask-yarn",
    "qhub",
    "dask-mpi",
]
ax = sns.countplot(y=q, data=df2021[q].dropna().str.lower().str.split(", ").explode().to_frame(), order=order);
ax.set(ylabel="", title=q);

Diagnostics

We saw earlier that most people like to view the Dask Dashboard using their web browser.

In the dashboard, people said the most useful diagnostics plots were:

The task stream plot
The progress plot, and
The memory useage per worker plot

q = "Which Diagnostic plots are most useful?"  # New question in 2021
ax = sns.countplot(y=q, data=df2021[q].dropna().str.split(', ').explode().to_frame());
ax.set(ylabel="", title=q);

We also asked some new questions about diagnostics in 2021.

We found that most people (65 percent) do not use Dask performance reports, which is a way to save the diagnostic dashboard to static HTML plots for later review.

q = "Do you use Dask's Performance reports?"  # New question in 2021
ax = sns.countplot(y=q, data=df2021[q].explode().to_frame(), order=["Yes", "No"]);
ax.set(ylabel="", title=q);

Very few people use Dask’s Prometheus metrics. Jacob Tomlinson has an excellent article on Monitoring Dask + RAPIDS with Prometheus + Grafana, if you’re interested in learning more about how to use this feature.

q = "Do you use Dask's Prometheus Metrics?"  # New question in 2021
ax = sns.countplot(y=q, data=df2021[q].explode().to_frame(), order=["Yes", "No"]);
ax.set(ylabel="", title=q);

Stability

We asked a number of questions around the stability of Dask, many of them new questions in 2021.

The majority of people said Dask was resiliant enough for them (eg: computations complete). However this is an area we could improve in, as 36 percent of people are not satisfied. This was a new question 2021, so we can’t say how people opinion of Dask’s resiliancy has changed over time.

q = "Is Dask resilient enough for you? (e.g. computations complete)."  # new question in 2021
ax = sns.countplot(y=q, data=df2021[q].dropna().explode().to_frame(), order=["Yes", "No"]);
ax.set(ylabel="", title="Is Dask resilient enough for you?");

Most people say Dask in general is stable enough for them (eg: between different version releases). This is similar to the survey results from previous years.

q = "Is Dask stable enough for you?"
ax = sns.countplot(y=q, data=df2021[q].dropna().explode().to_frame(), order=["Yes", "No"]);
ax.set(ylabel="", title=q);

People also say that the API of Dask is stable enough for them too.

q = "Is Dask's API stable enough for you?"
ax = sns.countplot(y=q, data=df2021[q].dropna().explode().to_frame(), order=["Yes", "No"]);
ax.set(ylabel="", title=q);

The vast majority of people are satisfied with the current release frequency (roughly once every two weeks).

q = "How is Dask's release frequency?"  # New question in 2021
ax = sns.countplot(y=q, data=df2021[q].dropna().explode().to_frame());
ax.set(ylabel="", title=q);

Most people say they would pin their code to a long term support release, if one was available for Dask.

q = "If Dask had Long-term support (LTS) releases, would you pin your code to use them?"  # New question in 2021
ax = sns.countplot(y=q, data=df2021[q].dropna().explode().to_frame(), order=["Yes", "No"]);
ax.set(ylabel="", title="Would you pin to a long term support release?");

User satisfaction, support, and documentation

We asked a bunch of new questions about user satisfaction in the 2021 survey.

How easy is Dask to use?

The majority of people say that Dask is moderately easy to use, the same as in previous surveys.

q = "On a scale of 1 - 5 (1 being hardest, 5 being easiest) how easy is Dask to use?"
ax = sns.countplot(y=q, data=df2021[q].dropna().explode().to_frame());
ax.set(ylabel="1 = Difficult, 5 = Easy", title="How easy is Dask to use?");

How is Dask’s documentation?

Most people think that Dask’s documentation is pretty good.

q = "How is Dask's documentation?"  # New question in 2021
ax = sns.countplot(y=q, data=df2021[q].dropna().explode().to_frame());
ax.set(ylabel="1 = Not good, 5 = Great", title=q);

How satisfied are you with maintainer responsiveness on GitHub?

Almost everybody who responded feels positively about Dask’s maintainer responsiveness on GitHub .

q = "How satisfied are you with maintainer responsiveness on GitHub?"  # New question in 2021
ax = sns.countplot(y=q, data=df2021[q].dropna().explode().to_frame());
ax.set(ylabel="1 = Not satisfied, 5 = Thrilled", title=q);

What Dask resources have you used for support in the last six months?

The documentation at dask.org is the first place most users look for help.

The breakdown of responses to this question in 2021 was very similar to previous years, with the exception that no-one seemed to know that the Dask YouTube channel or Gitter chat existed in 2019.

q = 'What Dask resources have you used for support in the last six months?'

resource_map = {
    "Tutorial": "Tutorial at tutorial.dask.org",
    "YouTube": "YouTube channel",
    "gitter": "Gitter chat"
}

df[q] = df[q].str.replace(';',', ')  # Make separator values consistent
d = df[q].str.split(', ').explode().replace(resource_map)
top = d.value_counts()[:8].index
d = d[d.isin(top)]

fig, ax = plt.subplots(figsize=(8, 8))
ax = sns.countplot(y=q, hue="Year", data=d.reset_index(), ax=ax);
ax.set(ylabel="", title=q);

Suggestions for improvement

Which would help you most right now?

The two top priorities people said would help most right now are both related to documentation. People want more documentation, and more examples in their field. Performance improvements were also commonly mentioned as something that would help the most right now.

q = "Which would help you most right now?"
order = [
    "More documentation",
    "More examples in my field",
    "Performance improvements",
    "New features",
    "Bug fixes",
]
ax = sns.countplot(y=q, data=df2021[q].explode().to_frame(), order=order)
ax.set(ylabel="", title=q);

How can Dask improve?

We also gave people the opportunity for a free text response to the question “How can Dask imporove?”

Matt has previously written an early anecdotes blogpost that dives into the responses to this question in more detail.

He found these recurring themes:

Intermediate Documentation
Documentation Organization
Functionality
High Level Optimization
Runtime Stability and Advanced Troubleshooting

Since more documentation and examples were the two most requested improvements, I’ll summarize some of the steps forward in that area here:

Regarding more intermediate documentation, Matt says:

There is a lot of good potential material that advanced users have around performance and debugging that could be fun to publish.
Matt points out that Dask has excellent reference documentation, but lacks a lot of good narrative documentation. To address this, Julia Signell is currently investigating how we could improve the organization of Dask’s documentation (you can subscribe to this issue thread if you want to follow that discussion)
Matt comments that it’s hard to have good narrative documentation when there are so many different user narratives (i.e. Dask is used by people from many different industries). This year, we added a new question to the survey asking for the industry people work in. We added this because “More examples in my field” has been one of the top two requests for the last three years. Now we can use that information to better target narrative documentation to the areas that need it most (geoscience, life science, and finance).

q = 'What industry do you work in?'
data = df2021[df2021["Which would help you most right now?"] == "More examples in my field"]
order = data[q].value_counts()[data[q].value_counts() > 1].keys()
ax = sns.countplot(y=q, data=data[q].dropna().str.split(', ').explode().to_frame(), order=order);
ax.set(ylabel="", title="What field do you want more documentation examples for?");

What common feature requests do you care about most?

Good support for numpy and pandas is critical for most users. Users also value:

Improved scaling
Ease of deployment
Resiliancy of Dask
Better scikit-learn & machine learning support

Most feature requests are similar to the survey results from previous years, although there was an increase in the number of people who say better scikit-learn/ML support is critical to them. We also added a new question about Dask’s resiliancy in 2021.

In the figure below you can see how people rated the importance of each feature request, for each of the three years we’ve run this survey.

common = (df[df.columns[df.columns.str.startswith("What common feature")]]
          .rename(columns=lambda x: x.lstrip("What common feature requests do you care about most?[").rstrip(r"]")))
a = common.loc[2019].apply(pd.value_counts).T.stack().reset_index().rename(columns={'level_0': 'Question', 'level_1': "Importance", 0: "count"}).assign(Year=2019)
b = common.loc[2020].apply(pd.value_counts).T.stack().reset_index().rename(columns={'level_0': 'Question', 'level_1': "Importance", 0: "count"}).assign(Year=2020)
c = common.loc[2021].apply(pd.value_counts).T.stack().reset_index().rename(columns={'level_0': 'Question', 'level_1': "Importance", 0: "count"}).assign(Year=2021)

counts = pd.concat([a, b, c], ignore_index=True)

d = common.stack().reset_index().rename(columns={"level_2": "Feature", 0: "Importance"})
order = ["Not relevant for me", "Somewhat useful", 'Critical to me']
sns.catplot(x='Importance', row="Feature", kind="count", col="Year", data=d, sharex=False, order=order);

Previous survey results

Thanks to everyone who took the survey!

If you want to read more about the 2021 Dask survey, the blogpost on early anecdotes from the Dask 2021 survey is available here.

You can read the survey results from previous years here:

Google Summer of Code 2021 - Dask Project

2021-08-23T00:00:00+00:00

Here’s an update on new features related to visualizing Dask graphs and HTML representations. You can try these new features today with version 2021.08.1 or above. This work was done by Freyam Mehta during the Google Summer of Code 2021. Dask took part in the program under the NumFOCUS umbrella organization.

Visualizing Dask graphs
HTML Representations

Visualizing Dask graphs

There are several new features involving Dask task graph visualization. Task graphs are a visual representation of the order and dependencies of each individual task within a dask computation. They are a very userful diagnostic tool, and have been used for a long time.

Freyam worked on making these visualizations more illustrative, engaging, and informative. The Graphviz library boasts a great set of attributes which can be modifified to create a more visually appealing output.

These features primarily improve the Dask high level graph visualizations. Both low level and high level Dask graphs can be accessed with very similar methods:

Dask low level graph: result.visualize()
Dask high level graph: result.dask.visualize()

…where result is a dask object or collection.

Graphviz node size scaling

The first change you may notice to the Dask high level graphs, is that the node sizes have been adjusted to scale with the number of tasks in each layer. Layers with more tasks would appear larger than the rest.

This is a helpful feature to have, because now users can get a much more intuitive sense of where the bulk of their computation takes place.

Example:

import dask.array as da

array = da.random.random((10000, 10000), chunks=(200, 200))
result = array + array.T - array.mean(axis=0)

result.dask.visualize()  # Dask high level graph

Note: this change only affects the graphviz output for Dask high level graphs. Low level graphs are left unchanged, because each visual node corresponds to one task.

Reference: Pull request #7869 by Freyam Mehta “Add node size scaling to the Graphviz output for the high level graphs”

New tooltips

Dask high level graphs now include hover tooltips to provide a brief summary of more detailed information. To use the tooltips, generate a dask high level graph (eg: result.dask.visualize()) then hover your mouse above the layer you are interested in.

Tooltips provide information such as the layer type and number of tasks associated with it. There is additional information provided for specific dask collections, like dask arrays and dataframes.

Dask array tooltip information additionally includes:

Array shape
Chunk size
Chunk type (eg: are the array chunks numpy, cupy, sparse, etc.)
Data type (eg: are the array values float, integer, boolean, etc.)

Dask dataframe tooltip information additionally includes:

Number of partitions
Dataframe type
Dataframe columns

Users have asked for a less overwhelming view into the dask task graph. We hope the high level graph view coupled with more detailed tooltip information can provide this middle ground, with enough information to be useful, but not so much as to become overwhelming (like the low level task graphs for large computations).

Note: This feature is available for SVG output. Other image formats, like .png, etc. do not support tooltips.

Reference: Pull request #7973 by Freyam Mehta “Add tooltips to graphviz”

Color by layer type

There is also a new feature enabling users to color code a high level graph according to layer type. This option can be enabled by passing the color="layer_type" keyword argument, eg: result.dask.visualize(color="layer_type"). This change is intended to make it easier for users to see which layer types predominate.

While there are no hard and fast rules about what makes a Dask computation efficient, there are some general guidelines:

Dataframe shuffles are particularly expensive operations. You can read more about this here.
Reading and writing data to/from storage/network services is often high-latency and therefore a bottleneck.
Blockwise layers are generally efficient for computation.
All layers are materialized during computation.

See the Dask best pracices pages for more information on creating more efficient Dask computations.

Example:

import dask
import dask.dataframe as dd

df = dask.datasets.timeseries()
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()

df3.dask.visualize(color="layer_type")  # Dask high level graph with colored nodes by layer type

Reference: Pull request #7974 by Freyam Mehta “Add colors to represent high level layer types”

Bugfix in visualize method

Freyam also fixed a bug which caused an error when users tried to call dask.visualize() with filename=None (issue #7685, fixed by pull request #7740).

The bug was fixed by adding an extra condition before it reaches the error. If the format is None, Dask now uses use a default png format.

import dask
import dask.array as da

array = da.arange(10)
dask.visualize(array, filename=None)  # success

Reference: Pull request #7740 by Freyam Mehta “Fixing calling .visualize() with filename=None”

HTML representations

Dask makes use of HTML representations in several places, for example in Dask collections like the Array and Dataframe classes (for background reading, see this blogpost).

More recently, we’ve introduced HTML representations for high level graphs into Dask, and Jacob Tomlinson has implemented HTML representations in several places in the dask distributed library (for further reading, see this other blogpost).

During Freyam’s Google Summer of Code project, he extended the HTML representations for Dask high level graphs to include images, and introduced two entirely new HTML representations to the dask distributed library.

Array images in HTML repr for high level graphs

The HTML representation for dask high level graphs has been extended, and now includes SVG images of dask arrays at intermediate stages of computation.

The motivation for this feature is similar to the motivation behind adding tooltips, discussed above. Users want easier ways to access information about the way a Dask computation changes as it moves through each stage of computation. We hope this improvement to the HTML representation for Dask high level graphs will provide an at a glance summary of array shape and chunk size at each stage.

Example:

import dask.array as da

array = da.ones((10, 20), chunks=(5, 10))
array = array.T

array.dask  # shows the HTML representation in Jupyter

Reference: Pull request #7886 by Freyam Mehta “Add dask.array SVG to the HTML Repr”

New HTML repr for ProcessInterface class

A new HTML representation has been created for the ProcessInterface class in dask distributed.

The HTML representation displays the status, address, and external address of the process.

There are three possible status options:

Process created, not yet running (blue icon)
Process is running (green icon)
Process closed (orange icon)

The ProcessInterface class is not intended to be used directly. Instead, more typically this information will be accessed via subclasses such as the SSH scheduler or workers.

Example:

from dask.distributed import LocalCluster, Client, SSHCluster

cluster = SSHCluster(["127.0.0.1", "127.0.0.1", "127.0.0.1"])
cluster.scheduler  # HTML representation for the SSH scheduler, shown in Jupyter
cluster.workers  # dict of all the workers
# or
cluster.workers[0]  # HTML representation for the first SSH worker in the cluster

Reference: Pull request #5181 by Freyam Mehta “Add HTML Repr for ProcessInterface Class and all its subclasses”

New HTML repr for Security class

Pull request #5178 added a new HTML representation for the Security class in the dask distributed library.

The Security HTML representation shows:

Whether encryption is required
Whether the object instance was created using Security.temporary() or Security(**paths_to_keys).
- For temporary security objects, keys are generated dynamically and the only copy is kept in memory.
- For security objects created using keys stored on disk, the HTML representation will show the full filepath to the relevant security certificates on disk.

Example: temporary security object

from dask.distributed import Security

s = Security.temporary()
s  # shows the HTML representation in Jupyter

Example: security object using certificates saved to disk

from dask.distributed import Security

s = Security(require_encryption=True, tls_ca_file="ca.pem", tls_scheduler_cert="scert.pem")
s  # shows the HTML representation in Jupyter

In addition, the text representation has also been updated to reflect the same information shown in the HTML representation.

Reference: Pull request #5178 by Freyam Mehta “Add HTML Repr for Security Class”

High Level Graphs update

2021-07-07T00:00:00+00:00

There is a lot of work happening in Dask right now on high level graphs. We’d like to share a snapshot of current work in this area. This post is for people interested in technical details of behind the scenes work improving performance in Dask. You don’t need to know anything about it in order to use Dask.

Brief background
Blockwise layers progress
A high level graph for map overlap
Slicing and high level graphs
Visualization
Documentation

Brief background

What are high level graphs?

High level graphs are a more compact representation of instructions needed to generate the full low level task graph. The documentation page on Dask high level graphs is here: https://docs.dask.org/en/latest/high-level-graphs.html

Why are they useful?

High level graphs are useful for faster scheduling. Instead of sending very large task graphs between the scheduler and the workers, we can instead send the smaller high level graph representation to the worker. Reducing the amount of data that needs to be passed around allows us to improve the overall performance.

You can read more about faster scheduling in our previous blogpost. More recently, Adam Breindel has written about this over on the Coiled blog (link).

Do I need to change my code to use them?

No, you won’t need to change anything. This work is being done under the hood in Dask, and you should see some speed improvements without having to change anything in your code.

In fact, you might already be benefitting from high level graphs:

“Starting with Dask 2021.05.0, Dask DataFrame computations will start sending HighLevelGraph’s directly from the client to the scheduler by default. Because of this, users should observe a much smaller delay between when they call .compute() and when the corresponding tasks begin running on workers for large DataFrame computations” https://coiled.io/blog/dask-heartbeat-by-coiled-2021-06-10/

Read on for a snapshot of progress in other areas.

Blockwise layers progress

Summary

The Blockwise high level graph layer was introduced in the 2020.12.0 Dask release. Since then, there has been a lot of effort made to use Blockwise high level graph layer whereever possible for improved performance, most especially for IO operations. The following is a non-exhaustive list.

Work to date

Highlights include (in no particular order):

Merged PR by Rick Zamora: Use Blockwise for DataFrame IO (parquet, csv, and orc) #7415
Merged PR by Rick Zamora: Move read_hdf to Blockwise 7625
Merged PR by Rick Zamora: Move timeseries and daily-stock to Blockwise #7615
Merged PR by John Kirkham: Rewrite da.fromfunction w/ da.blockwise #7704

Ongoing work

Lots of other work with Blockwise is currently in progress:

Ian Rose: Blockwise array creation redux #7417. This PR creates blockwise implementations for the from_array and from_zarr functions.
Rick Zamora: Move DataFrame from_array and from_pandas to Blockwise #7628
Bruce Merry: Use BlockwiseDep for map_blocks with block_id or block_info #7686

A high level graph for map overlap

Summary

Investigating a high level graph for Dask’s map_overlap is a project driven by user needs. People have told us that the time taken just to generate the task graph (before any actual computation takes place) can sometimes be a big user experience problem. So, we’re looking in to ways to improve it.

Work to date

Merged PR by Genevieve Buckley: A HighLevelGraph abstract layer for map_overlap #7595

This PR defers much of the computation involved in creating the Dask task graph, but does not does not reduce the total end-to-end computation time. Further optimization is therefore required.

Ongoing work

Followup work includes:

Find number of tasks in overlap layer without materializing the layer #7788 https://github.com/dask/dask/issues/7788
Implement cull method for ArrayOverlapLayer #7789 https://github.com/dask/dask/issues/7789 (culling is simplifying a Dask graph by removing unnecessary tasks)

Slicing and high level graphs

Summary

Profiling map_overlap, we saw that a lot of time is being spent in slicing operations. So, slicing was a logical next step to investigate possible performance improvements with high level graphs.

Meanwhile, Rick Zamora has been working on the dataframe side of Dask, using high level graphs to improve dataframe slicing/selections.

Work to date

A couple of minor bugfixes/improvements:

Merged PR by Genevieve Buckley: SimpleShuffleLayer should compare parts_out with set(self.parts_out) #7787
Merged PR by Genevieve Buckley: Make Layer get_output_keys officially an abstract method #7775

Ongoing work

Rick Zamora: [WIP] Add DataFrameGetitemLayer to simplify HLG Optimizations #7663
Genevieve Buckley: Array slicing HighLevelGraph layer #7655

Visualization

Summary

We’ve also put some work into making better visualizations for Dask objects (including high level graphs).

Defining a _repr_html_ method for your classes is a great way to get nice HTML output when you’re working with jupyter notebooks. You can read this post to see more neat HTML representations in other scientific python libraries.

Dask already uses HTML representations in lots of places (like the Array and Dataframe classes). We now have new HTML representations for HighLevelGraph and Layer objects, as well as Scheduler and Client objects in Dask distributed.

Work to date

Merged PR by Jacob Tomlinson: Add HTML repr to scheduler_info and incorporate into client and cluster reprs #4857
Merged PR by Jacob Tomlinson: HTML reprs CLient.who_has & Client.has_what
Merged PR by Genevieve Buckley: Implementation of HTML repr for HighLevelGraph layers #7763 https://github.com/dask/dask/pull/7763
Merged PR by Genevieve Buckley Automatically show graph visualization in jupyter notebooks #771
Merged PR by Genevivee Buckley: Adding chunks and type information to dask high level graphs #7309. This PR inserts extra information into the high level graph, so that we can create richer visualizations using this extra context later on.

Example

Before:

<dask.highlevelgraph.HighLevelGraph at 0x7f9851b7e4f0>

After (HTML representation):

After (text-only representation):

from dask.datasets import timeseries

ddf = timeseries().shuffle("id", shuffle="tasks").head(compute=False)
ddf.dask

HighLevelGraph with 3 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7fc259015b80>
 0. make-timeseries-94aab6e7236cbd9828bcbfb35fe6caee
 1. simple-shuffle-cd01443e43b7a6eb9810ad67992c40b6
 2. head-1-5-simple-shuffle-cd01443e43b7a6eb9810ad67992c40b6

This gives us a much more meaningful representation, and is already being used by developers working on high level graphs.

Documentation

Finally, the documentation around high level graphs is sparse. This is because they’re relatively new, and have also been undergoing quite a bit of change. However, this makes it difficult for people. We’re planning to improve the documentation, for both users and devlopers of Dask.

If you’d like to follow these discussions, or help out, you can subscribe to the issues:

For Dask users: Update HighLevelGraph documentation #7709
For Dask developers: Document dev process around high level graphs #7755

Ragged output, how to handle awkward shaped results

2021-07-02T00:00:00+00:00

This blogpost explains some of the difficulties associated with distributed computation and ragged or irregularly shaped outputs. We present a recommended method for using Dask in these circumstances.

Background

Often, we come across workflows where analyzing the data involves searching for features (which may or may not be present) then computing some results from those features. Because we don’t know ahead of time how many features will be found, we can expect the processing output size to vary.

For distributed workloads, we need to split up the data, process it, and then recombine the results. That means ragged output can cause cause problems (like broadcasting errors) when Dask combines the output.

Problem constraints

In this blogpost, we’ll look at an example with the following constraints:

Input array data
A processing function requiring overlap between chunks
The output returned

Solution

The simplest strategy is a two step process:

Expand the array chunks using the overlap function.
Use map_blocks with the drop_axis keyword argument

Example code

import dask.array as da

arr = da.random.random((100, 100), chunks=(50,50))  # example input data
expanded = da.overlap.overlap(arr, depth=2, boundary="reflect")
result = expanded.map_blocks(processing_func, drop_axis=1, dtype=float)
result.compute()

Multiple output types supported

This pattern supports multiple types of output from the processing function, including:

numpy arrays
pandas Series
pandas DataFrames

You can try this for yourself using any of the example processing functions below, generating dummy data output. Or, you can try out a function of your own.

# Random length, 1D output returned
import numpy as np
import pandas as pd

# function returns numpy array
def processing_func(x):
    random_length = np.random.randint(1, 7)
    return np.arange(random_length)

# function returns pandas series
def processing_func(x):
    random_length = np.random.randint(1, 7)
    output_series = np.arange(random_length)
    return pd.Series(output_series)

# function returns pandas dataframe
def processing_func(x):
    random_length = np.random.randint(1, 7)
    x_data = np.arange(random_length)
    y_data = np.random.random((random_length))
    return pd.DataFrame({"x": x_data, "y": y_data})

Why can’t I use `map_overlap` or `reduction`?

Ragged output sizes can cause broadcasting errors when the outputs are combined for some Dask functions.

However, if ragged output sizes aren’t a constraint for your particular programming problem, then you can continue to use the Dask map_overlap and reduction functions as much as you like.

Alternative solution

Dask delayed

As an alternative solution, you can use Dask delayed (a tutorial is available here).

Advantages:

Your processing function can have any type of output (it not restricted to numpy or pandas objects)
There is more flexibility in the ways you can use Dask delayed.

Disadvantages:

You will have to handle combining the outputs yourself.
You will have to be more careful about performance:
- For example, because the code below uses delayed in a list comprehension, it’s very important for performance reasons that we pass in the expected metadata. Fortunately, dask has a make_meta function available.
- You can read more about performance considerations for Dask delayed and best practices here.

Example code:

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
import dask

arr = da.ones((20, 10), chunks=(10, 10))

@dask.delayed
def processing_func(x):
    # returns dummy dataframe output
    random_length = np.random.randint(1,10)
    return pd.DataFrame({'x': np.arange(random_length),
                         'y': np.random.random(random_length)})

meta = dd.utils.make_meta([('x', np.int64), ('y', np.int64)])
expanded = da.overlap.overlap(arr, depth=2, boundary="reflect")
blocks = expanded.to_delayed().ravel()
results = [dd.from_delayed(processing_func(b), meta=meta) for b in blocks]
ddf = dd.concat(results)
ddf.compute()

Summing up

That’s it! We’ve learned how to avoid common errors when working with processing functions returning ragged outputs. The method recommended here works well with multiple output types including: numpy arrays, pandas series, and pandas DataFrames.

Dask Down Under

2021-06-25T00:00:00+00:00

Dask Down Under was a special event held for the first time last month during the 2021 Dask Summit. It featured talks, tutorials, and events tailored specifically for an Australian (and wider Oceania) audience.

To get involved in the new Pangeo Oceania community group, register your interest here.

What is Dask Down Under?
Who came?
Watch the talks
What’s next? Here’s how to get involved!

What is Dask Down Under?

Dask down under is a chance for everyone in Oceania to forge links and build community here in our backyard. Dask down under we feature talks, tutorials and panel discussions on using Dask to accelerate research. All levels from beginner to expert are encouraged to attend.

Dask Down Under involved two days of events:

5 talks
2 tutorials
1 panel discussion
1 meet and greet networking event

Who came?

There was a strong geoscience theme across Dask Down Under. This reflects the strong scientific community we have in these areas. People came from government organisations, universities, and industry.

We expected most attendees would be based in the Asia-Pacific region, since those were the timezones targeted by these events.

Unexpectedly, we also saw a lot of extra traffic at the talks on day one, likely from US timezones. Publicity from Dask Summit emails and tweets mentioning Dask Down Under resulted in a lot of people stopping by to watch. This more than doubled our live attendance during the first event. It was great to see so much interest coming from other parts of the world, too.

Watch the talks

You can watch the talks and tutorials from the Dask Dwon Under workshop on the Dask youtube channel. The full playlist for the workshop is available here.

Panel discussion

A panel discussion was held, bringing together a diverse group of users from novice to expert, academic to commercial. We hope this discussion will start a conversation about using Dask in Australia, how we build our community, contribute and stay in touch with the rest of the world. You can watch it here:

Moderator: Draga Doncila Pop
Panelists:

Ben Leighton, CSIRO
Tisham Dhar, Geoscience Australia
Genevieve Buckley, Dask life science fellow
Hugo Bowne-Anderson, Coiled

Invited talks

The full playlist for the workshop is available here.

Featured talks include:

Draga Doncila Pop, Interactive visualization and near real-time analysis on out-of-core satellite images
Tisham Dhar: Dask DevOps for Remote Sensing
Kirill Kouzoubov: Patterns for large scale temporal processing of geo-spatial data using Dask
Ben Leighton and Kim Opie: Image Processing Using Dask - Using dask and skimage to identity vegetation morphology across the Australian landscape
Nick Mortimer: Making the most of your schedule: From HPC to Local Cluster

What’s next?

Here’s how you can get involved:

Several people have discussed setting up a new Pangeo Oceania group. You can register your interest here.

> Soon we'll start holding regular Pangeo Oceania meetups for sharing information, support, training, and workflow advocacy across our region.  We look forward to you helping to shape the Pangeo Oceania community. And if you have a friend or colleague that should be here too, please share this sign-up link: http://bit.ly/Pangeo_email_signup

The Python for Atmosphere and Ocean Science (PyAOS) provides information and resources to the user community: https://pyaos.github.io/ To keep the site up-to-date, the first ever PyAOS census is being conducted. It would be great if Python users in the atmosphere and/or ocean science community could take a few minutes to fill out the survey. https://forms.gle/L84W7bsxmP86G3Ji9

Dask Survey 2021, early anecdotes

2021-06-18T00:00:00+00:00

The annual Dask user survey is under way and currently accepting responses at dask.org/survey.

This post provides a preview into early results, focusing on anecdotal responses.

The Dask user survey helps developers focus and prioritize our larger efforts. It’s also a fascinating and rewarding dataset of anecdotal use cases of how people use Dask today. Thank you to everyone who has participated so far, you make a difference.

The survey is still open, and I encourage people to speak up about their experience. This blogpost is intended to encourage participation by giving you a sense for how it affects development, and by sharing user stories provided within the survey.

This article skips all of the quantitative data that we collect, and focuses in on direct feedback listed in the final comments. For a more quantitative analysis see the posts from previous years by Tom at 2020 Dask User Survey Results and 2019 Dask User Survey Results.

How can Dask Improve?

In this post we’re going to look at answers to this one question. This was a long-form response field asking “How can Dask Improve?”. Looking through some of the responses we see that a few of them fall into some common themes. I’ve grouped them here.

In each section we’ll include raw responses, followed up with a few comments from me in response.

Intermediate Documentation

More long-form content about the internals of Dask to understand when things don’t work and why. The “Hacking Dask” tutorial in the Dask 2021 summit was precisely the kind of content I really need, because 90% of my time with Dask is spent not understanding why I’m running out of memory and I feel like I’ve ready all the documentation pages 5 times already (although sometimes I also stumble upon a useful page I’ve never seen before).

There’s also a dearth of documentation of intermediate topics like blockwise in dask.array. (I think I ended up reverse engineering how it worked from docs, GitHub issue comments, reading the code, and black-box reverse engineering with different functions before I finally “got it”.)

Improve documentation and error messages to cover more of the 2nd-level problems that people run into beyond the first-level tutorial examples.

more examples for complex concepts (passing metadata to custom functions, for example). more examples/support for using dask arrays and cupy.

I think the hardest thing about Dask is debugging performance issues with dask delayed and complex mixing of other libraries and not knowing when things are being pickled or not. I am getting better at reading the performance reports, but I think that better documentation and tutorials surrounding understanding the reports would help me greater than new features. For example, make a tutorial that does some non-trivial dask-delayed work (ie not just computing a mean) that is written against best practices and show how the performance improves with each adopted best practice/explain why things were slow with each step. I think there could also be improvements to the performance reports to point out the slowest 5 parts of your code and what lines they are, and possibly relevant docs links.

Response

I really like this theme. We now have a solid community of intermediate-advanced Dask users that we should empower. We usually write materials that target the broad base of beginning users, but maybe we should rethink this a bit. There is a lot of good potential material that advanced users have around performance and debugging that could be fun to publish.

Documentation Organization

Documentation website is sometimes confusing to navigate, better separation of API and examples would help. Maybe this can inspire: https://documentation.divio.com/

I actually think Dask’s documentation is pretty good. But the docs could use some reorganizing – it is often difficult to find the relevant APIs. And there is an incredible amount of HPC insider knowledge that is required to launch a typical workflow - right now much of this knowledge is hidden in the github issues (which is great! but more of it could be pushed into the FAQs to make it more accessible).

More detailed documentation and examples. Start to finish examples that do not assume I know very much (about Dask, command line tools, Cloud technologies, Kubernetes, etc.).

I think an easier introduction to delayed/bags and additional examples for more complex use-cases could be helpful.

Response

We get alternating praise and scorn for our documentation. We have what I would call excellent reference documentation. In fact, if anyone wants to build a dynamic distributed task scheduler today I’m going to claim that distributed.dask.org is probably the most comprehensive reference out there.

However, we lack good narrative documentation, which is the concern raised by most of these comments. This is hard to do because Dask is used in so many different user narratives. It’s challenging to orient the Dask documentation around all of them simultaneously.

I appreciated the direct reference in the first comment to a website with a framework. In general I’d love to talk to people who lay out documentation semi-professionally and learn more.

Functionality

Here is a soup of various feature requests, there are a few themes among them

Have a better pandas support (like multi-index), which can help me migrate my existing code to Dask.

I’d like to see better support for actors. I think having a remote object is a common use case.

Improve Dataframes - multi index!! More feature parity with Pandas API.

Maybe a little less machine learning, more “classical” big data applications (CDF, PDEs, particle physics etc.). Not everything is map-reducable.

Better database integration. Re-writing an SQL query in SQL Alchemy can be very impractical. Would also be great if there were better ways to ensure the process didn’t die from misjudging how much memory was needed per chunk.

Better diagnostic tools; what operations are bottlenecking a task graph? Support for multiindex.

I do work that regularly requires sorting a DataFrame by multiple columns. Pandas can do this single-core; H2O and Spark can do this multicore and distributed. But dask cannot sort_values() on multiple columns at all (such as df.sort_values([ "col1", "col2" ,"col3" ], ascending=False)).

Type-hints! It is very tedious using Dask in a huge ML-Application without even having the option to do some static type-checking.

Additionally it is very frustrating that Dask tries to mimic Pandas API, but then 40% of the API doesn’t work (isn’t implemented), or deviates so far from the Pandas API that some parameters aren’t implemented. Only way to find out about that is to read the docs. With some typehints one could mitigate much of this trial-and-error process when switching from Pandas to Dask.

It’s hard to track everything around dask!!! Actors are a bit unloved, but I find them super useful

Type annotations for all methods for better IDE (VSCode) support

I think the Actor model could use a little love

Response

Interesting trends, not many that I would have expected

MultiIndex (well, this was expected)
Actors
Type hinting for IDE support
SQL access

High Level Optimization

Needs better physical data independence. Manual data chunking, memory management, query optimization are all a big hassle. Automate those more.

Dask makes it easy for users with no parallel computing experience to scale up quickly (me), but we have no sense of how to judge our resource needs. It’d be great if Dask had some tools or tutorials that helped me judge the size of my problem (e.g. memory usage). These may already exist, but examples of how to do it may be hard to find.

Runtime Stability and Advanced Troubleshooting

Stability is the most important factor

I have answered no to the Long Term Support version of dask but often the really great opportunities are those that arre on demand. The problem is that when these fixes are released, their not well advertised and something under the hood has changed. So, it ends up breaking something else or my particular knowledge of the workings are no longer correct. Dask maintainers have a bit of a weird clique and it can feel as a newbie or a learner that your talked down to or in reality. They don’t have the time to help someone. So they should probably have some more maintainers answering some of the more mundane questions via the blog or via some other method, Things we have seen people do wrong or having difficulty in . A bit of basic, a bit of intermediate and a bit of advanced. If the underlying dask API has changed, then these should be updated with new posts with updates of what has changed. Showing a breakdown of doing it the hard way. So people can see what is done step by step with standard workflows that work. Then vs dask, with less boilerplate and/or speed improvement. If there are places where speed isn’t improved. Show that the difference of where it doesnt work alongside the workflow where it might.

We have long deployed dask clusters (weeks to months) and have noticed that they sometimes go into a wonky state. We’ve been unable to identify root cause(s). Redeployment is simple and easy when it does occur, but slightly annoying nonetheless.

My biggest pain point is the scheduler, as I tend to spend time writing infrastructure to manage the scheduler and breaking apart / rewriting tasks graphs to minimize impact on the scheduler.

As my answers make clear (and from previous conversations with Matt, James, and Genevieve) the biggest improvement I’d like to see is stable releases. Stable from both a runtime point of view (i.e. rock solid Dask distributed), and from an API point of view (so I don’t have to fix my code every couple of weeks). So a big +1 to LTS releases.

Better error handling/descriptions of errors, better interoperability between (slightly) different versions

If something goes wrong (in Dask, the batch system, or the interaction between Dask and the batch system), the problem is very opaque and difficult to diagnose. Dask needs significant additional documentation, and probably additional features, to make debugging easier and more transparent.

Better ways of getting out logs of worker memory usage, especially after dask crashes/failures. Ways of getting performance reports written to log files, rather than html files which don’t write if the dask client process fails.

Two big problems for me are when dask fails determining what when wrong and how to fix it.

Response

Stability definitely took a dive last December. I’m feeling good right now though. There is a lot of good work that should be merged in and released in the next few weeks that I think will significantly improve many of the common pain points.

However, there are still many significant improvements yet to be made. I in particular like the theme above in reporting and logging when things fail. We’re ok at this today, but there is a lot of room for growth.

What’s Next?

Do the views above fully express your thoughts on where Dask should go, or is there something missing?

Share your perspective at dask.org/survey. The whole process should take less than five minutes.

The evolution of a Dask Distributed user

2021-06-01T00:00:00+00:00

This week was the 2021 Dask Summit and one of the workshops that we ran covered many deployment options for Dask Distributed.

We covered local deployments, SSH, Hadoop, Kubernetes, the Cloud and managed services, but one question that came up a few times was “where do I start?”.

I wanted to share the journey that I’ve seen many Dask users take in the hopes that you may recognize yourself as being somewhere along this path and it may inform you where to look next.

As a user who is new to Dask you’re likely working your way through the documentation or perhaps a tutorial.

We often introduce the concept of the distributed scheduler early on, but you don’t need it to get initial benefits from Dask. Switching from Pandas to Dask for larger than memory datasets is a common entry point and performs perfectly well using the default threaded scheduler.

# Switching from this
import pandas as pd
df = pd.read_csv('/data/.../2018-*-*.csv')
df.groupby(df.account_id).balance.sum()

# To this
import dask.dataframe as dd
df = dd.read_csv('/data/.../2018-*-*.csv')
df.groupby(df.account_id).balance.sum().compute()

But by the time you’re a few pages into the documentation you’re already being encouraged to create Client() and LocalCluster() objects.

Note: When you create a Client() with no arguments/config set Dask will launch a LocalCluster() object for you under the hood. So often Client() is equivalent to Client(LocalCluster()).

This is a common area for users to stick around in, launch a local distributed scheduler and do your work maximising the resources on your local machine.

from dask.distributed import Client
import dask.dataframe as dd

client = Client()

df = dd.read_csv('/data/.../2018-*-*.csv')
df.groupby(df.account_id).balance.sum().compute()

Breaking free from your machine

Once you get used to task graphs and work scheduling you may begin thinking about how you can expand your computation beyond your local machine.

Our code doesn’t really need to change much, we are already connecting a client and doing Dask work, all we need are more networked machines with the same user environments, data, etc.

Personally I used to work in an organisation where every researcher was given a Linux desktop under their desk. These machines were on a LAN and had Active Directory and user home directories stored on a storage server. This meant you could sit down at any desk and log in and have a consistent experience. This also meant you could SSH to another machine on the network and your home directory would be there with all your files including your data and conda environments.

This is a common setup in many organisations and it can be tempting to SSH onto the machines of folks who may not be fully utilising their machine and run your work there. And I’m sure you ask first right!

Organisations may also have servers in racks designated for computational use and the setup will be similar. You can SSH onto them and home directories and data are available via network storage.

With Dask Distributed you can start to expand your workload onto these machines using SSHCluster. All you need is your SSH keys set up so you can log into those machines without a password.

from dask.distributed import Client, SSHCluster
import dask.dataframe as dd

cluster = SSHCluster(
    [
        "localhost",
        "alices-desktop.lan",
        "bobs-desktop.lan",
        "team-server.lan",
    ]
)
client = Client(cluster)

df = dd.read_csv("/data/.../2018-*-*.csv")
df.groupby(df.account_id).balance.sum().compute()

Now the same workload can run on all of the CPUs in our little ad-hoc cluster, using all the memory and pulling data from the same shared storage.

Moving to a compute platform

Using (and abusing) hardware like desktops and shared servers will get you reasonably far, but probably to the dismay of your IT team.

Organisations who have many users trying to perform large compute workloads will probably be thinking about or already have some kind of platform that is designated for running this work.

The platforms your organisation has will be the result of many somewhat arbitrary technology choices. What programming languages does your company use? What deals did vendors offer at the time of procurement? What skills do the current IT staff have? What did your CTO have for breakfast the day they chose a vendor?

I’m not saying these decisions are made thoughtlessly, but the criteria that are considered are often orthogonal to how the resource will ultimately be used by you. At Dask we support whatever platform decisions your organisations make. We try to build deployment tools for as many popular platforms as we can including:

Hadoop via dask-yarn
Kubernetes via dask-kubernetes and the helm chart
HPC (with schedulers like SLURM, PBS and SGE) via dask-jobqueue
Cloud platforms (including AWS, Azure and GCP) with dask-cloudprovider

As a user within an organisations you may have been onboarded to one of these platforms. You’ve probably been given some credentials and a little training on how to launch jobs on it.

The dask-foo tools listed above are designed to sit on top of those platforms and submit jobs on your behalf as if they were individual compute jobs. But instead of submitting a Python script to the platform we submit Dask schedulers and workers and then connect to them to leverage the provisioned resource. Clusters on top of clusters.

With this approach your IT team has full control over the compute resource. They can ensure folks get their fair share with quotas and queues. But you as a user gets the same Dask experience you are used to on your local machine.

Your data may be in a slightly different place on these platforms though. Perhaps you are on the cloud and your data is in object storage for example. Thankfully we have tools built on fsspec like s3fs or adlfs we can read this data in pretty much the same way. So still not much change to your workflow.

from dask.distributed import Client
from dask_cloudprovider.azure import AzureVMCluster
import dask.dataframe as dd

cluster = AzureVMCluster(resource_group="<resource group>",
                         vnet="<vnet>",
                         security_group="<security group>",
                         n_workers=10)
client = Client(cluster)

df = dd.read_csv("adl://.../2018-*-*.csv")
df.groupby(df.account_id).balance.sum().compute()

Centralizing your Dask resources

When your organisation gets enough folks adopting and using Dask it may be time for your IT team to step in and provide you with a managed service. Having many users submitting many ad-hoc clusters in a myriad of ways is likely to be less efficient than a centrally managed and more importantly ordained service from IT.

The motivation to move to a managed service is often driven at the organisational level rather than by individuals. Once you’ve reached this stage of Dask usage you’re probably quite comfortable with your workflows and it may be inconvenient to change them. However the level of Dask deployment knowledge you’ve acquired to reach this stage is probably quite large, and as Dask usage at your organization grows it’s not practical to expect everyone to reach the same level of competency.

At the end of the day being an expert in deploying distributed systems is probably not listed in your job description and you probably have something more important to be getting on with like data science, finance, physics, biology or whatever it is Dask is helping you do.

You may also be feeling some pressure from IT. You are running clusters on top of clusters and to them your Dask cluster is a black box and this can make them comfortable as they are the ones responsible for this hardware. It is common to feel constrained by your IT team, I know because I’ve been a sysadmin and used to constrain folks. But the motivations of your IT team are good ones, they are trying to save the organisation money, make best use of limited resources and ultimately get the IT out of your way so that you can get on with your job. So lean into this, engage with them, share your Dask knowledge and offer to become a pilot user for whatever solution they end up building.

One approach you could recommend they take is to deploy Dask Gateway. This can be deployed by an administrator and provides a central hub which launches Dask clusters on behalf of users. It supports many types of authentication so it can hook into whatever your organisation uses and supports many of the same backend compute platforms that the standalone tools do, including Kubernetes, Hadoop and HPC.

This will allow them to ensure security settings are correct and consistent across clusters. If you are using containers they probably want you to use some official images which are regularly updated and vulnerability scanned. It may also give them more insight into what types of workloads folks are running and plan future systems more accurately. By using Dask Gateway this puts the control and responsibility of these things onto their side of the fence.

Users will need to authenticate with the gateway, but then can launch Dask clusters in a platform agnostic way.

from dask.distributed import Client
from dask_gateway import Gateway
import dask.dataframe as dd

gateway = Gateway(
    address="http://daskgateway.myorg.com",
    auth="kerberos"
)
cluster = gateway.new_cluster()
client = Client(cluster)

df = dd.read_csv("/data/.../2018-*-*.csv")
df.groupby(df.account_id).balance.sum().compute()

Again reading your data requires some knowledge on how it is stored on the underlying compute platform you the gateway is using, but the changes required are minimal.

Managed services

If your organisation is too small to have an IT team to manage this for you, or you just have a preference for managed services, there are startups popping up to provide this to you as a service including Coiled and Saturn Cloud.

Future platforms

Today the large cloud vendors have managed data science platforms including AWS Sagemaker, Azure Machine Learning and Google Cloud AI Platform. But these do not include Dask as a service.

These cloud services are focussed on batch processing and machine learning today, but these clouds also have managed services for Spark and other compute cluster offerings. With Dask’s increasing popularity it wouldn’t surprise me if managed Dask services are released by these cloud vendors in the years to follow.

Summary

One of the most powerful features of Dask is that your code can stay pretty much the same regardless of how big or complex the distributed compute cluster is. It scales from a single machine to thousands of servers with ease.

But scaling up requires both user and organisational growth and folks already seem to be treading a common path on that journey.

Hopefully this post will give you an idea of where you are on that path and where to jump to next. Whether you’re new to the community and discovering the power of multi-core computing or an old hand who is trying to wrangle hundreds of users who all love Dask, good luck!

The 2021 Dask User Survey is out now

2021-05-25T00:00:00+00:00

The Dask User Survey is out again! Tell us how you use Dask, and help us make it better for everyone.

Click this link to take the survey.

Feedback from users is very important. It helps give us a clear picture who our users are and what is important to them. Your responses will inform prioritization for Dask development and improve the experience for the Dask community.

We expect the survey to take no more than 5-10 minutes. It has the following short sections:

How do you use Dask?
How could Dask improve?
What other tools do you use with Dask?
Optional: What do you work on?

Survey results from previous years

We will also publish answers to non-sensitive questions in our annual survey review to help keep everyone informed.

You can see the results from previous user surveys here:

Life sciences at the 2021 Dask Summit

2021-05-24T00:00:00+00:00

The Dask life science workshop ran as part of the 2021 Dask Summit. Lightning talks from this workshop are available here, and you can read on for a summary of the event.

What is the Dask life science workshop?

The Dask life science workshop ran as part of the 2021 Dask Summit. Currently many people in life sciences use Dask, but individual groups are relatively isolated from one another. This workshop gave us an opportunity to learn from each other, as well as opportunities to identify common frustrations and areas for improvement.

The workshop involved:

Pre-recorded lightning talks
Interactive discussion times (accessible across timezones in Europe, Oceania, and the Americas)
Asynchronous text chat throughout the Dask Summit

If I missed it, how can I catch up?

If you missed the Dask Summit, you can catch up on YouTube. There is a playlist of all the life science lightning talks available here.

You can also join our #life-science channel on Slack: Click here for an invitation link.

Who came?

We invited attendees at the life science workshop to do a short Q&A about their work with Dask. This is a small subset of the people who joined us, many people came to the conference and did not do a Q&A.

The responses give us an overview of the diversity of work people in the community are doing. In no particular order, here are some of those Q&As:

Name: Tom White
Timezone: EU/UK
What kind of science do you work on? Statistical genetics
Something you’ve tried (or would like to try) with Dask? Run per-row linear regressions at scale.
What do you want to do next with Dask? Collaborative optimization of a public workflow (GWAS).
Lightning talk: click here

Name: Giovanni Palla
Affiliation: Helmholtz Center Munich
Timezone: Europe
What kind of science do you work on? Computational Biology and Spatial transcriptomics
Something you’ve tried (or would like to try) with Dask? dask-image for image processing.
**What do you want to do next with Dask? Further integration with Squidpy.
**Lightning talk:** click here

Name: Isaac Virshup
Affiliation: University of Melbourne. Open source projects Scanpy and AnnData Timezone: AEST
What kind of science do you work on? Single cell omics data.
Something you’ve tried (or would like to try) with Dask?
I’ve used dask for some nested embarrassingly parallel calculations. Having an intelligent scheduler with good monitoring made this task as easy as it should be, especially compared with multiprocessing or joblib.
What do you want to do next with Dask?
I would love to get AnnData, a container for working with single cell assays integrated with dask. Dataset sizes in this field are constantly increasing, and it would be good to be able to work with the coolest new dataset regardless of available RAM.
Since we rely heavily on sparse arrays, a key step towards this will be getting better sparse array support (CSC and CSR especially) inside dask. After all, it’s not great if our strategy for scaling out requires many times the total memory! As a maintainer, I’m interested in hearing people’s experience with distributing tools that integrate well with dask.
Lightning talk: click here

Name: Anna Kreshuk
Affiliation: European Molecular Biology Laboratory
Timezone: CEST (GMT+2)
What kind of science do you work on? Machine learning for microscopy image analysis.
Something you’ve tried (or would like to try) with Dask? We run a lot of image processing workflows and want to see how Dask can be exploited in this context.

Name: Beth Cimini
Affiliation: Broad Institute
Timezone: US-East
What kind of science do you work on? User friendly image analysis tools for microscopy imaging.
Something you’ve tried (or would like to try) with Dask? Making Dask work in CellProfiler, to make it easy to analyze big images in high throughput!
Lightning talk: click here

Name: Volker Hilsenstein
Affiliation: EMBL / Alexandrov lab
Timezone: Central European Summer Time
What kind of science do you work on? Spatial Metabolomics, combining microscopy and mass spectrometry.
Something I would like to try with dask: fusing large mosaics of individual images or image volumes for which affine transformation into a joint coordinate system are available.

Name: Marvin Albert
Affiliation: University of Zurich
Timezone: UTC/GMT +2
What kind of science do you work on? Life sciences / image analysis
Something you’ve tried (or would like to try) with Dask? What do you want to do next with Dask? Parallelise / reduce the memory footprint of image processing tasks and define workflows that can run on different compute environments.
Lightning talk: click here

Name: Jordao Bragantini
Affiliation: CZ Biohub
Timezone: Pacific Daylight Time (UTC -7)
What kind of science do you work on? Light-sheet microscopy
Something you’ve tried (or would like to try) with Dask? Image processing of very large data.
What do you want to do next with Dask? Implement algorithms for cell segmentation.
Lightning talk: click here

Name: Josh Moore
Affiliation: Open Microscopy Environment (OME)
Timezone: CEST
What kind of science do you work on? Bioimaging (infrastructure for RDM)
Something you’ve tried (or would like to try) with Dask? Accessing large image (Zarr) volumes over HTTP, primarily. What do you want to do next with Dask? Improve pre-fetching for typical usage patterns, possibly integrating multiscale data (i.e. google maps zooming)
Lightning talk: click here

Name: Jackson Maxfield Brown
Timezone: PST
What kind of science do you work in? Cell biology, specifically microscopy and computational biology.
Something you’ve tried (or would like to try) with Dask? Built a metadata aware / backed microscopy imaging reading library that uses Dask to read any size image w/ chunking by metadata dimension information. As well as TB-scale image processing pipelines using Dask + Prefect.
What do you want to do next with Dask? Tighter integration with other libraries. I see cuCim from the RAPIDs team and would love to extend work with them to have a more general “bio-image-spec” so we can all play nicely together.
Lightning talk: click here

Name: Gregory R. Lee
Affiliation: Quansight
Timezone: EST (UTC-5)
What kind of science do you work on? Scientific software development (with a background doing research in magnetic resonance imaging).
Something you’ve tried (or would like to try) with Dask?
In past research work, I used Dask primarily in two scenarios, both on a single workstation:

To achieve multi-threading by processing image blocks in parallel on the CPU (e.g. like in dask-image)
Serial blockwise processing of large volumetric data on the GPU (i.e. CuPy arrays of 10-100 GB in size) to reduce peak memory requirements.

What do you want to do next with Dask?
Audit scikit-image functions to determine which can easily be accelerated using block-wise approaches as in dask-image. Ideally a subset of functions would work directly with dask-arrays as inputs rather than requiring users to learn about Dask’s map_overlap, etc. to use this feature.
Lightning talk: click here

What’s next?

Dask is now considering holding “office hours” for the life science community. If we can find enough maintainers able to host one-hour Q&A sessions, then we’ll trial this for a short period of time.

Stability of the Dask library

2021-05-21T00:00:00+00:00

Dask is moving fast these days. Sometimes we break things as a result.

Historically this hasn’t been a problem, according to our survey last year most users were fairly happy with Dask’s stability.

However the last year has seen a lot of evolution of the project, which in turn causes code churn. This can cause friction for downstream users today, but also means more-than-incremental changes for the future. We’ve optimized a little bit for long-term growth over short-term stability.

There are two structural things driving some of these changes:

An increase in computational scale
An increase in organizational scale

Computational Scale

Dask today is used across a wider range of problems, a more diverse set of hardware, and at larger scales more routinely than before.

Addressing this increase in scale across many dimensions has caused us to redesign Dask’s internal infrastructure in several ways.

We’ve changed how Dask graphs are represented and communicated to the scheduler
We’ve pulled out Dask’s internal state machines and made them more formalized
We’ve rewritten large chunks of the scheduler in Cython
We’ve overhauled how we serialize messages that go between all Dask servers
We’re now tracking memory with much finer granularity than we did before
… and more

We’ve been doing all of these internal changes with minimal impact to the myriad of downstream user communities (Xarray, Prefect, RAPIDS, XGBoost, …). This is largely due to those downstream developer communities, who help to identify, isolate, and work through the subtle tremors that occur on the surface when we make these subsurface shifts.

Organizational scale

Historically Dask’s core was maintained by a relatively small set of people, mostly at Anaconda. There were dozens of developers that worked on various dask-foo projects, but only a small group that thought about things like serialization, state machines, and so on. In particular I personally tracked every issue and knew the entire project. Whenever a potential conflict arose I was usually able to identify it early.

This has all changed dramatically.

First, there are now several multi-company teams working on different parts of Dask internals.

Second, we’ve also taken some time to redesign parts of Dask internals to make them more maintainable. Dask scheduling is like a finely made clock. Historically parts of that clock were built and designed by individuals with a craftsman-like approach. Now we’re redesigning things with more of a group mindset. This results in more maintainable designs, but it also means that we’re taking apart the clock and putting it back together. It takes a little while to find all of the missing parts :)

How this affects you today

This all started around when we switched to Calendar Versioning at the end of last year (Dask version 2.30.1 rolled over into 2020.12.0 last December). You may have noticed

an increased sensitivity to version mismatches (as we change the Dask protocol different versions of Dask can no longer talk to each other well)
releases with stability issues (2020.12 was particularly rough)
tighter pinning between dask and distributed versions during releases

How this will affect you

We’ve merged in a PR to change the default behavior when moving high level graphs to the scheduler for Dask Dataframes. This should result in much less delay when submitting large computations and almost no delay in optimization. It also opens up a conduit for us to send a lot more semantic information to the scheduler about your computation, which can result in new visualizations and smarter scheduling in the future.

It will also probably break some things.

To be clear, all tests pass among Dask, distributed, xarray, prefect, rapids, and other downstream projects. We’ve done our homework here, but almost certainly we’ve missed something.

This is only one of several larger changes happening in the coming months. We appreciate your patience and your engagement as we make some of these larger shifts. For better or worse end users are the final testing suite :)

Skeleton analysis

2021-05-07T00:00:00+00:00

In this blogpost, we show how to modify a skeleton network analysis with Dask to work with constrained RAM (eg: on your laptop). This makes it more accessible: it can run on a small laptop, instead of requiring access to a supercomputing cluster. Example code is also provided here.

Skeleton structures are everywhere
The scientific problem
The compute problem
Our approach
Results
Limitations
Problems encountered
How we solved them
What’s next
How you can help

Skeleton structures are everywhere

Lots of biological structures have a skeleton or network-like shape. We see these in all kinds of places, including:

blood vessel branching
the branching of airways
neuron networks in the brain
the root structure of plants
the capillaries in leaves
… and many more

Analysing the structure of these skeletons can give us important information about the biology of that system.

The scientific problem

For this bogpost, we will look at the blood vessels inside of a lung. This data was shared with us by Marcus Kitchen, Andrew Stainsby, and their team of collaborators.

This research group focusses on lung development. We want to compare the blood vessels in a healthy lung, against a lung from a hernia model. In the hernia model the lung is underdeveloped, squashed, and smaller.

The compute problem

These image volumes have a shape of roughtly 1000x1000x1000 pixels. That doesn’t seem huge but given the high RAM consumption involved in processing the analysis, it crashes when running on a laptop.

If you’re running out of RAM, there are two possible appoaches:

Get more RAM. Run things on a bigger computer, or move things to a supercomputing cluster. This has the advantage that you don’t need to rewrite your code, but it does require access to more powerful computer hardware.
Manage the RAM you’ve got. Dask is good for this. If we use Dask, and some reasonable chunking of our arrays, we can manage things so that we never hit the RAM ceiling and crash. This has the advantage that you don’t need to buy more computer hardware, but it will require re-writing some code.

Our approach

We took the second approach, using Dask so we can run our analysis on a small laptop with constrained RAM without crashing. This makes it more accessible, to more people.

All the image pre-processing steps will be done with dask-image, and the skeletonize function of scikit-image.

We use skan as the backbone of our analysis pipeline. skan is a library for skeleton image analysis. Given a skeleton image, it can describe statistics of the branches. To make it fast, the library is accelerated with numba (if you’re curious, you can hear more about that in this talk and its related notebook).

There is an example notebook containing the full details of the skeleton analysis available here. You can read on to hear just the highlights.

Results

The statistics from the blood vessel branches in the healthy and herniated lung shows clear differences between the two.

Most striking is the difference in the number of blood vessel branches. The herniated lung has less than 40% of the number of blood vessel branches in the healthy lung.

There are also quantitative differences in the sizes of the blood vessels. Here is a violin plot showing the distribution of the distances between the start and end points of each blood vessel branch. We can see that overall the blood vessel branches start and end closer together in the herniated lung. This is consistent with what we might expect, since the healthy lung is more well developed than the lung from the hernia model and the hernia has compressed that lung into a smaller overall space.

EDIT: This blogpost previously described the euclidean distance violin plot as measuring the thickness of the blood vessels. This is incorrect, and the mistake was not caught in the review process before publication. This post has been updated to correctly describe the euclidean-distance measuremet as the distance between the start and end of branches, as if you pulled a string taught between those points. An alternative measurement, branch-length describes the total branch length, including any winding twists and turns.

Limitations

We rely on one big assumption: once skeletonized the reduced non-zero pixel data will fit into memory. While this holds true for datasets of this size (the cropped rabbit lung datasets are roughly 1000 x 1000 x 1000 pixels), it may not hold true for much larger data.

Dask computation is also triggered at a few points through our prototype workflow. Ideally all computation would be delayed until the very final stage.

Problems encountered

This project was originally intended to be a quick & easy one. Famous last words!

What I wanted to do was to put the image data in a Dask array, and then use the map_overlap function to do the image filtering, thresholding, skeletonizing, and skeleton analysis. What I soon found was that although the image filtering, thresholding, and skeletonization worked well, the skeleton analysis step had some problems:

Dask’s map_overlap function doesn’t handle ragged or non-uniformly shaped results from different image chunks very well, and…
Internal function in the skan library were written in a way that was incompatible with distributed computation.

How we solved them

Problem 1: The skeletonize function from scikit-image crashes due to lack of RAM

The skeletonize function of scikit-image is very memory intensive, and was crashing on a laptop with 16GB RAM.

We solved this by:

Putting our image data into a Dask array with dask-image imread,
Rechunking the Dask array. We need to change the chunk shapes from 2D slices to small cuboid volumes, so the next step in the computation is efficient. We can choose the overall size of the chunks so that we can stay under the memory threshold needed for skeletonize.
Finally, we run the skeletonize function on the Dask array chunks using the map_overlap function. By limiting the size of the array chunks, we stay under our memory threshold!

Problem 2: Ragged or non-uniform output from Dask array chunks

The skeleton analysis functions will return results with ragged or non-uniform length for each image chunk. This is unsurpising, because different chunks will have different numbers of non-zero pixels in our skeleton shape.

When working with Dask arrays, there are two very commonly used functions: map_blocks and map_overlap. Here’s what happens when we try a function with ragged outputs with map_blocks versus map_overlap.

import dask.array as da
import numpy as np

x = da.ones((100, 10), chunks=(10, 10))

def foo(a):  # our dummy analysis function
    random_length = np.random.randint(1, 7)
    return np.arange(random_length)

With map_blocks, everything works well:

result = da.map_blocks(foo, x, drop_axis=1)
result.compute()  # this works well

But if we need some overlap for function foo to work correctly, then we run into problems:

result = da.map_overlap(foo, x, depth=1, drop_axis=1)
result.compute()  # incorrect results

Here, the first and last element of the results from foo are trimmed off before the results are concatenated, which we don’t want! Setting the keyword argument trim=False would help avoid this problem, except then we get an error:

result = da.map_overlap(foo, x, depth=1, trim=False, drop_axis=1)
result.compute()  # ValueError

Unfortunately for us, it’s really important to have a 1 pixel overlap in our array chunks, so that we can tell if a skeleton branch is ending or continuing on into the next chunk.

There’s some complexity in the way map_overlap results are concatenated back together so rather than diving into that, a more straightforward solution is to use Dask delayed instead. Chris Roat shows a nice example of how we can use Dask delayed in a list comprehension that is then concatenated with Dask (link to original discussion).

import numpy as np
import pandas as pd

import dask
import dask.array as da
import dask.dataframe as dd

x = da.ones((20, 10), chunks=(10, 10))

@dask.delayed
def foo(a):
    size = np.random.randint(1,10)  # Make each dataframe a different size
    return pd.DataFrame({'x': np.arange(size),
                         'y': np.arange(10, 10+size)})

meta = dd.utils.make_meta([('x', np.int64), ('y', np.int64)])
blocks = x.to_delayed().ravel()  # no overlap
results = [dd.from_delayed(foo(b), meta=meta) for b in blocks]
ddf = dd.concat(results)
ddf.compute()

Warning: It’s very important to pass in a meta keyword argument to the function from_delayed. Without it, things will be extremely inefficient!

If the meta keyword argument is not given, Dask will try and work out what it should be. Ordinarily that might be a good thing, but inside a list comprehension that means those tasks are computed slowly and sequentially before the main computation even begins, which is horribly inefficient. Since we know ahead of time what kinds of results we expect from our analysis function (we just don’t know the length of each set of results), we can use the utils.make_meta function to help us here.

Problem 3: Grabbing the image chunks with an overlap

Now that we’re using Dask delayed to piece together our skeleton analysis results, it’s up to us to handle the array chunks overlap ourselves.

We’ll do that by modifying Dask’s dask.array.core.slices_from_chunks function, into something that will be able to handle an overlap. Some special handling is required at the boundaries of the Dask array, so that we don’t try to slice past the edge of the array.

Here’s what that looks like (gist):

from itertools import product
from dask.array.slicing import cached_cumsum

def slices_from_chunks_overlap(chunks, array_shape, depth=1):
    cumdims = [cached_cumsum(bds, initial_zero=True) for bds in chunks]

    slices = []
    for starts, shapes in zip(cumdims, chunks):
        inner_slices = []
        for s, dim, maxshape in zip(starts, shapes, array_shape):
            slice_start = s
            slice_stop = s + dim
            if slice_start > 0:
                slice_start -= depth
            if slice_stop >= maxshape:
                slice_stop += depth
            inner_slices.append(slice(slice_start, slice_stop))
        slices.append(inner_slices)

    return list(product(*slices))

Now that we can slice an image chunk plus an extra pixel of overlap, all we need is a way to do that for all the chunks in an array. Drawing inspiration from this block iteration we make a similar iterator.

block_iter = zip(
    np.ndindex(*image.numblocks),
    map(functools.partial(operator.getitem, image),
        slices_from_chunks_overlap(image.chunks, image.shape, depth=1))
)

meta = dd.utils.make_meta([('row', np.int64), ('col', np.int64), ('data', np.float64)])
intermediate_results = [dd.from_delayed(skeleton_graph_func(block), meta=meta) for _, block in block_iter]
results = dd.concat(intermediate_results)
results = results.drop_duplicates()  # we need to drop duplicates because it counts pixels in the overlapping region twice

With these results, we’re able to create the sparse skeleton graph.

Problem 4: Summary statistics with skan

Skeleton branch statistics can be calculate with the skan summarize function. The problem here is that the function expects a Skeleton object instance, but initializing a Skeleton object calls methods that are not compatible for distributed analysis.

We’ll solve this problem by first initializing a Skeleton object instance with a tiny dummy dataset, then overwriting the attributes of the skeleton object with our real results. This is a hack, but it lets us achieve our goal: summary branch statistics for our large dataset.

First we make a Skeleton object instance with dummy data:

from skan._testdata import skeleton0

skeleton_object = Skeleton(skeleton0)  # initialize with dummy data

Then we overwrite the attributes with the previously calculated results:

skeleton_object.skeleton_image = ...
skeleton_object.graph = ...
skeleton_object.coordinates
skeleton_object.degrees = ...
skeleton_object.distances = ...
...

Then finally we can calculate the summary branch statistics:

from skan import summarize

statistics = summarize(skel_obj)
statistics.head()

	skeleton-id	node-id-src	node-id-dst	branch-distance	branch-type	mean-pixel-value	stdev-pixel-value	image-coord-src-0	image-coord-src-1	image-coord-src-2	image-coord-dst-0	image-coord-dst-1	image-coord-dst-2	coord-src-0	coord-src-1	coord-src-2	coord-dst-0	coord-dst-1	coord-dst-2	euclidean-distance
0	1	1	2	1	2	0.474584	0.00262514	22	400	595	22	400	596	22	400	595	22	400	596	1
1	2	3	9	8.19615	2	0.464662	0.00299629	37	400	622	43	392	590	37	400	622	43	392	590	33.5261
2	3	10	11	1	2	0.483393	0.00771038	49	391	589	50	391	589	49	391	589	50	391	589	1
3	5	13	19	6.82843	2	0.464325	0.0139064	52	389	588	55	385	588	52	389	588	55	385	588	5
4	7	21	23	2	2	0.45862	0.0104024	57	382	587	58	380	586	57	382	587	58	380	586	2.44949

statistics.describe()

	skeleton-id	node-id-src	node-id-dst	branch-distance	branch-type	mean-pixel-value	stdev-pixel-value	image-coord-src-0	image-coord-src-1	image-coord-src-2	image-coord-dst-0	image-coord-dst-1	image-coord-dst-2	coord-src-0	coord-src-1	coord-src-2	coord-dst-0	coord-dst-1	coord-dst-2	euclidean-distance
count	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095	1095
mean	2089.38	11520.1	11608.6	22.9079	2.00091	0.663422	0.0418607	591.939	430.303	377.409	594.325	436.596	373.419	591.939	430.303	377.409	594.325	436.596	373.419	190.13
std	636.377	6057.61	6061.18	24.2646	0.0302199	0.242828	0.0559064	174.04	194.499	97.0219	173.353	188.708	96.8276	174.04	194.499	97.0219	173.353	188.708	96.8276	151.171
min	1	1	2	1	2	0.414659	6.79493e-06	22	39	116	22	39	114	22	39	116	22	39	114	0
25%	1586	6215.5	6429.5	1.73205	2	0.482	0.00710439	468.5	278.5	313	475	299.5	307	468.5	278.5	313	475	299.5	307	72.6946
50%	2431	11977	12010	16.6814	2	0.552626	0.0189069	626	405	388	627	410	381	626	405	388	627	410	381	161.059
75%	2542.5	16526.5	16583	35.0433	2	0.768359	0.0528814	732	579	434	734	590	432	732	579	434	734	590	432	265.948
max	8034	26820	26822	197.147	3	1.29687	0.357193	976	833	622	976	841	606	976	833	622	976	841	606	737.835

Success!

We’ve achieved distributed skeleton analysis with Dask. You can see the example notebook containing the full details of the skeleton analysis here.

What’s next?

A good next step is modifing the skan library code so that it directly supports distributed skeleton analysis.

How you can help

If you’d like to get involved, there are a couple of options:

Try a similar analysis on your own data. The notebook with the full example code is available here. You can share or ask questions in the Dask slack or on twitter.
Help add support for distributed skeleton analysis to skan. Head on over to the skan issues page and leave a comment if you’d like to join in.

Dask with PyTorch for large scale image analysis

2021-03-29T00:00:00+00:00

This post explores applying a pre-trained PyTorch model in parallel with Dask Array.

We cover a simple example applying a pre-trained UNet to a stack of images to generate features for every pixel.

A Worked Example

Let’s start with an example applying a pre-trained UNet to a stack of light sheet microscopy data.

In this example, we:

Load the image data from Zarr into a multi-chunked Dask array
Load a pre-trained PyTorch model that featurizes images
Construct a function to apply the model onto each chunk
Apply that function across the Dask array with the dask.array.map_blocks function.
Store the result back into Zarr format

Step 1. Load the image data

First, we load the image data into a Dask array.

The example dataset we’re using here is lattice lightsheet microscopy of the tail region of a zebrafish embryo. It is described in this Science paper (see Figure 4), and provided with permission from Srigokul Upadhyayula.

Liu et al. 2018 “Observing the cell in its native state: Imaging subcellular dynamics in multicellular organisms” Science, Vol. 360, Issue 6386, eaaq1392 DOI: 10.1126/science.aaq1392 (link)

This is the same data that we analysed in our last blogpost on Dask and ITK. You should note the similarities to that workflow even though we are now using new libaries and performing different analyses.

cd '/Users/nicholassofroniew/Github/image-demos/data/LLSM'

# Load our data
import dask.array as da
imgs = da.from_zarr("AOLLSM_m4_560nm.zarr")
imgs

dask.array<from-zarr, shape=(20, 199, 768, 1024), dtype=float32, chunksize=(1, 1, 768, 1024)>

Step 2. Load a pre-trained PyTorch model

Next, we load our pre-trained UNet model.

This UNet model takes in an 2D image and returns a 2D x 16 array, where each pixel is now associate with a feature vector of length 16.

We thank Mars Huang for training this particular UNet on a corpous of biological images to produce biologically relevant feature vectors, as part of his work on interactive bio-image segmentation. These features can then be used for more downstream image processing tasks such as image segmentation.

# Load our pretrained UNet¶
import torch
from segmentify.model import UNet, layers

def load_unet(path):
    """Load a pretrained UNet model."""

    # load in saved model
    pth = torch.load(path)
    model_args = pth['model_args']
    model_state = pth['model_state']
    model = UNet(**model_args)
    model.load_state_dict(model_state)

    # remove last layer and activation
    model.segment = layers.Identity()
    model.activate = layers.Identity()
    model.eval()

    return model

model = load_unet("HPA_3.pth")

Step 3. Construct a function to apply the model to each chunk

We make a function to apply our pre-trained UNet model to each chunk of the Dask array.

Because Dask arrays are just made out of Numpy arrays which are easily converted to Torch arrays, we’re able to leverage the power of machine learning at scale.

# Apply UNet featurization
import numpy as np

def unet_featurize(image, model):
    """Featurize pixels in an image using pretrained UNet model.
    """
    import numpy as np
    import torch

    # Extract the 2D image data from the Dask array
    # Original Dask array dimensions were (time, z-slice, y, x)
    img = image[0, 0, ...]

    # Put the data into a shape PyTorch expects
    # Expected dimensions are (Batch x Channel x Width x Height)
    img = img[None, None, ...]

    # convert image to torch Tensor
    img = torch.Tensor(img).float()

    # pass image through model
    with torch.no_grad():
        features = model(img).numpy()

    # generate feature vectors (w,h,f)
    features = np.transpose(features, (0,2,3,1))[0]

    # Add back the leading length-one dimensions
    result = features[None, None, ...]

    return result

Note: Very observant readers might notice that the steps for extracting the 2D image data and then putting it into a shape PyTorch expects appear to be redundant. It is redundant for our particular example, but that might easily not have been the case.

To explain this in more detail, the UNet expects 4D input, with dimensions (Batch x Channel x Width x Height). The original Dask array dimensions were (time, z-slice, y, x). In our example it just so happens those match in a way that makes removing and then adding the leading dimensions redundant, but depending on the shape of the original Dask array this might not have been true.

Step 4. Apply that function across the Dask array

Now we apply that function to the data in our Dask array using dask.array.map_blocks.

# Apply UNet featurization
out = da.map_blocks(unet_featurize, imgs, model, dtype=np.float32, chunks=(1, 1, imgs.shape[2], imgs.shape[3], 16), new_axis=-1)
out

dask.array<unet_featurize, shape=(20, 199, 768, 1024, 16), dtype=float32, chunksize=(1, 1, 768, 1024, 16)>

Step 5. Store the result back into Zarr format

Last, we store the result from the UNet model featurization as a zarr array.

# Trigger computation and store
out.to_zarr("AOLLSM_featurized.zarr", overwrite=True)

Now we’ve saved our output, these features can be used for more downstream image processing tasks such as image segmentation.

Summing up

Here we’ve shown how to apply a pre-trained PyTorch model to a Dask array of image data.

Because our Dask array chunks are Numpy arrays, they can be easily converted to Torch arrays. This way, we’re able to leverage the power of machine learning at scale.

This workflow was very similar to our example using the dask.array.map_blocks function with ITK to perform image deconvolution. This shows you can easily adapt the same type of workflow to achieve many different types of analysis with Dask.

Image segmentation with Dask

2021-03-19T00:00:00+00:00

We look at how to create a basic image segmentation pipeline, using the dask-image library.

Just show me the code
Image segmentation pipeline
Custom functions
Scaling up computation
Bonus content: using arrays on GPU
How you can get involved

The content of this blog post originally appeared as a conference talk in 2020.

Just show me the code

If you want to run this yourself, you’ll need to download the example data from the Broad Bioimage Benchmark Collection: https://bbbc.broadinstitute.org/BBBC039

And install these requirements:

pip install dask-image>=0.4.0 tifffile

Here’s our full pipeline:

import numpy as np
from dask_image.imread import imread
from dask_image import ndfilters, ndmorph, ndmeasure

images = imread('data/BBBC039/images/*.tif')
smoothed = ndfilters.gaussian_filter(images, sigma=[0, 1, 1])
thresh = ndfilters.threshold_local(smoothed, blocksize=images.chunksize)
threshold_images = smoothed > thresh
structuring_element = np.array([[[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 1, 0], [1, 1, 1], [0, 1, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0]]])
binary_images = ndmorph.binary_closing(threshold_image, structure=structuring_element)
label_images, num_features = ndmeasure.label(binary_image)
index = np.arange(num_features)
area = ndmeasure.area(images, label_images, index)
mean_intensity = ndmeasure.mean(images, label_images, index)

You can keep reading for a step by step walkthrough of this image segmentation pipeline, or you can skip ahead to the sections on custom functions, scaling up computation, or GPU acceleration.

Image segmentation pipeline

Our basic image segmentation pipeline has five steps:

Reading in data
Filtering images
Segmenting objects
Morphological operations
Measuring objects

Set up your python environment

Before we begin, we’ll need to set up our python virtual environment.

At a minimum, you’ll need:

pip install dask-image>=0.4.0 tifffile matplotlib

Optionally, you can also install the napari image viewer to visualize the image segmentation.

pip install "napari[all]"

To use napari from IPython or jupyter, run the %gui qt magic in a cell before calling napari. See the napari documentation for more details.

Download the example data

We’ll use the publically available image dataset BBBC039 Caicedo et al. 2018, available from the Broad Bioimage Benchmark Collection Ljosa et al., Nature Methods, 2012. You can download the dataset here: https://bbbc.broadinstitute.org/BBBC039

These are fluorescence microscopy images, where we see the nuclei in individual cells.

Step 1: Reading in data

Step one in our image segmentation pipeline is to read in the image data. We can do that with the dask-image imread function.

We pass the path to the folder full of *.tif images from our example data.

from dask_image.imread import imread

images = imread('data/BBBC039/images/*.tif')

By default, each individual .tif file on disk has become one chunk in our Dask array.

Step 2: Filtering images

Denoising images with a small amount of blur can improve segmentation later on. This is a common first step in a lot of image segmentation pipelines. We can do this with the dask-image gaussian_filter function.

from dask_image import ndfilters

smoothed = ndfilters.gaussian_filter(images, sigma=[0, 1, 1])

Step 3: Segmenting objects

Next, we want to separate the objects in our images from the background. There are lots of different ways we could do this. Because we have fluorescent microscopy images, we’ll use a thresholding method.

Absolute threshold

We could set an absolute threshold value, where we’d consider pixels with intensity values below this threshold to be part of the background.

absolute_threshold = smoothed > 160

Let’s have a look at these images with the napari image viewer. First we’ll need to use the %gui qt magic:

%gui qt

And now we can look a the images with napari:

import napari

viewer = napari.Viewer()
viewer.add_image(absolute_threshold)
viewer.add_image(images, contrast_limits=[0, 2000])

But there’s a problem here.

When we look at the results for different image frames, it becomes clear that there is no “one size fits all” we can use for an absolute threshold value. Some images in the dataset have quite bright backgrounds, others have fluorescent nuclei with low brightness. We’ll have to try a different kind of thresholding method.

Local threshold

We can improve the segmentation using a local thresholding method.

If we calculate a threshold value independently for each image frame then we can avoid the problem caused by fluctuating background intensity between frames.

thresh = ndfilters.threshold_local(smoothed, images.chunksize)
threshold_images = smoothed > thresh

# Let's take a look at the images with napari
viewer.add_image(threshold_images)

The results here look much better, this is a much cleaner separation of nuclei from the background and it looks good for all the image frames.

Step 4: Morphological operations

Now that we have a binary mask from our threshold, we can clean it up a bit with some morphological operations.

Morphological operations are changes we make to the shape of structures a binary image. We’ll briefly describe some of the basic concepts here, but for a more detailed reference you can look at this excellent page of the OpenCV documentation.

Erosion is an operation where the edges of structures in a binary image are eaten away, or eroded.

Image credit: OpenCV documentation

Dilation is the opposite of an erosion. With dilation, the edges of structures in a binary image are expanded.

Image credit: OpenCV documentation

We can combine morphological operations in different ways to get useful effects.

A morphological opening operation is an erosion, followed by a dilation.

Image credit: OpenCV documentation

In the example image above, we can see the left hand side has a noisy, speckled background. If the structuring element used for the morphological operations is larger than the size of the noisy speckles, they will disappear completely in the first erosion step. Then when it is time to do the second dilation step, there’s nothing left of the noise in the background to dilate. So we have removed the noise in the background, while the major structures we are interested in (in this example, the J shape) are restored almost perfectly.

Let’s use this morphological opening trick to clean up the binary images in our segmentation pipeline.

from dask_image import ndmorph
import numpy as np

structuring_element = np.array([
    [[0, 0, 0], [0, 0, 0], [0, 0, 0]],
    [[0, 1, 0], [1, 1, 1], [0, 1, 0]],
    [[0, 0, 0], [0, 0, 0], [0, 0, 0]]])
binary_images = ndmorph.binary_opening(threshold_images, structure=structuring_element)

You’ll notice here that we need to be a little bit careful about the structuring element. All our image frames are combined in a single Dask array, but we only want to apply the morphological operation independently to each frame. To do this, we sandwich the default 2D structuring element between two layers of zeros. This means the neighbouring image frames have no effect on the result.

# Default 2D structuring element

[[0, 1, 0],
 [1, 1, 1],
 [0, 1, 0]]

Step 5: Measuring objects

The last step in any image processing pipeline is to make some kind of measurement. We’ll turn our binary mask into a label image, and then measure the intensity and size of those objects.

For the sake of keeping the computation time in this tutorial nice and quick, we’ll measure only a small subset of the data. Let’s measure all the objects in the first three image frames (roughly 300 nuclei).

from dask_image import ndmeasure

# Create labelled mask
label_images, num_features = ndmeasure.label(binary_images[:3], structuring_element)
index = np.arange(num_features - 1) + 1  # [1, 2, 3, ...num_features]

Here’s a screenshot of the label image generated from our mask.

>>> print("Number of nuclei:", num_features.compute())

Number of nuclei: 271

Measure objects in images

The dask-image ndmeasure subpackage includes a number of different measurement functions. In this example, we’ll choose to measure:

The area in pixels of each object, and
The average intensity of each object.

area = ndmeasure.area(images[:3], label_images, index)
mean_intensity = ndmeasure.mean(images[:3], label_images, index)

Run computation and plot results

import matplotlib.pyplot as plt

plt.scatter(area, mean_intensity, alpha=0.5)
plt.gca().update(dict(title="Area vs mean intensity", xlabel='Area (pixels)', ylabel='Mean intensity'))
plt.show()

Custom functions

What if you want to do something that isn’t included in the dask-image API? There are several options we can use to write custom functions.

dask map_overlap / map_blocks
dask delayed
scikit-image apply_parallel()

Dask map_overlap and map_blocks

The Dask array map_overlap and map_blocks are what is used to build most of the functions in dask-image. You can use them yourself too. They will apply a function to each chunk in a Dask array.

import dask.array as da

def my_custom_function(args):
    # ... does something really cool

result = da.map_overlap(my_custom_function, my_dask_array, args)

You can read more about overlapping computations here.

Dask delayed

If you want more flexibility and fine-grained control over your computation, then you can use Dask delayed. You can get started with the Dask delayed tutorial here.

scikit-image apply_parallel function

If you’re a person who does a lot of image processing in python, one tool you’re likely to already be using is scikit-image. I’d like to draw your attention to the apply_parallel function available in scikit-image. It uses map-overlap, and can be very helpful.

It’s useful not only when when you have big data, but also in cases where your data fits into memory but the computation you want to apply to the data is memory intensive. This might cause you to exceed the available RAM, and apply_parallel is great for these situations too.

Scaling up computation

When you want to scale up from a laptop onto a supercomputing cluster, you can use dask-distributed to handle that.

from dask.distributed import Client

# Setup a local cluster
# By default this sets up 1 worker per core
client = Client()
client.cluster

See the documentation here to get set up for your system.

Bonus content: using arrays on GPU

We’ve recently been adding GPU support to dask-image.

We’re able to add GPU support using a library called CuPy. CuPy is an array library with a numpy-like API, accelerated with NVIDIA CUDA. Instead of having Dask arrays which contain numpy chunks, we can have Dask arrays containing cupy chunks instead. This blogpost explains the benefits of GPU acceleration and gives some benchmarks for computations on CPU, a single GPU, and multiple GPUs.

GPU support available in dask-image

From dask-image version 0.6.0, there is GPU array support for four of the six subpackages:

imread
ndfilters
ndinterp
ndmorph

Subpackages of dask-image that do not yet have GPU support are.

ndfourier
ndmeasure

We hope to add GPU support to these in the future.

An example

Here’s an example of an image convolution with Dask on the CPU:

# CPU example
import numpy as np
import dask.array as da
from dask_image.ndfilters import convolve

s = (10, 10)
a = da.from_array(np.arange(int(np.prod(s))).reshape(s), chunks=5)
w = np.ones(a.ndim * (3,), dtype=np.float32)
result = convolve(a, w)
result.compute()

And here’s the same example of an image convolution with Dask on the GPU. The only thing necessary to change is the type of input arrays.

# Same example moved to the GPU
import cupy  # <- import cupy instead of numpy (version >=7.7.0)
import dask.array as da
from dask_image.ndfilters import convolve

s = (10, 10)
a = da.from_array(cupy.arange(int(cupy.prod(cupy.array(s)))).reshape(s), chunks=5)  # <- cupy dask array
w = cupy.ones(a.ndim * (3,))  # <- cupy array
result = convolve(a, w)
result.compute()

You can’t mix arrays on the CPU and arrays on the GPU in the same computation. This is why the weights w must be a cupy array in the second example above.

Additionally, you can transfer data between the CPU and GPU. So in situations where the GPU speedup is larger than than cost associated with transferring data, this may be useful to do.

Reading in images onto the GPU

Generally, we want to start our image processing by reading in data from images stored on disk. We can use the imread function with the arraytype=cupy keyword argument to do this.

from dask_image.imread import imread

images = imread('data/BBBC039/images/*.tif')
images_on_gpu = imread('data/BBBC039/images/*.tif', arraytype="cupy")

How you can get involved

Create and share your own segmentation or image processing workflows with Dask (join the current discussion on segmentation or propose a new blogpost topic here)

Contribute to adding GPU support to dask-image: https://github.com/dask/dask-image/issues/133

Measuring Dask memory usage with dask-memusage

2021-03-11T00:00:00+00:00

Using too much computing resources can get expensive when you’re scaling up in the cloud.

To give a real example, I was working on the image processing pipeline for a spatial gene sequencing device, which could report not just which genes were being expressed but also where they were in a 3D volume of cells. In order to get this information, a specialized microscope took snapshots of the cell culture or tissue, and the resulting data was run through a Dask pipeline.

The pipeline was fairly slow, so I did some back-of-the-envelope math to figure out our computing costs would be once we started running more data for customers. It turned out that we’d be using 70% of our revenue just paying for cloud computing!

Clearly I needed to optimize this code.

When we think about the bottlenecks in large-scale computation, we often focus on CPU: we want to use more CPU cores in order to get faster results. Paying for all that CPU can be expensive, as in this case, and I did successfully reduce CPU usage by quite a lot.

But high memory usage was also a problem, and fixing that problem led me to build a series of tools, tools that can also help you optimize and reduce your Dask memory usage.

In the rest of this article you will learn:

How high memory usage can drive up your computing costs.
How a tool called dask-memusage can help you find peak memory usage of the tasks in your Dask execution graph.
How to further pinpoint high memory usage using the Fil memory profiler, so you can reduce memory usage.

As a reminder, I was working on a Dask pipeline that processed data from a specialized microscope. The resulting data volume was quite large, and certain subsets of images had to be processed together as a unit. From a computational standpoint, we effectively had a series of inputs X0, X1, X2, … that could be independently processed by a function f().

The internal processing of f() could not easily be parallelized further. From a CPU scheduling perspective, this was fine, it was still an embarrassingly parallel problem given the large of number of X inputs.

For example, if I provisioned a virtual machine with 4 CPU cores, to process the data I could start four processes, and each would max out a single core. If I had 12 inputs and each processing step took about the same time, they might run as follows:

CPU0: f(X0), f(X4), f(X8)
CPU1: f(X1), f(X5), f(X9)
CPU2: f(X2), f(X6), f(X10)
CPU3: f(X3), f(X7), f(X11)

If I could make f() faster, the pipeline as a whole would also run faster.

CPU is not the only resource used in computation, however: RAM can also be a bottleneck. For example, let’s say each call to f(Xi) took 12GB of RAM. That means to fully utilize 4 CPUs, I would need 48GB of RAM—but what if my computer only has 16GB of RAM?

Even though my computer has 4 CPUs, I can only utilize one CPU on a computer with 16GB RAM, because I don’t have enough RAM to run more than one task in parallel. In practice, these tasks ran in the cloud, where I could ensure the necessary RAM/core ratio was preserved by choosing the right pre-configured VM instances. And on some clouds you can freely set the amount of RAM and number of CPU cores for each virtual machine you spin up.

However, I didn’t quite know how much memory was used at peak, so I’d had to limit parallelism to reduce out-of-memory errors. As a result, the default virtual machines we were using had half their CPUs resting idle, resources were paying for but not using.

In order to provision hardware appropriately and max out all the CPUs, I needed to know how much peak memory each task was using. And to do that, I created a new tool.

Measuring peak task memory usage with `dask-memusage` {#dask-memusage}

dask-memusage is a tool for measuring peak memory usage for each task in the Dask execution graph.

Per task because Dask executes code as a graph of tasks, and the graph determines how much parallelism can be used.
Peak memory is important, because that is the bottleneck. It doesn’t matter if average memory usage per task is 4GB, if two parallel tasks in the graph need 12GB each at the same time, you’re going to need 24GB of RAM if you want to to run both tasks on the same computer.

Using `dask-memusage`

Since the gene sequencing code is proprietary and quite complex, let’s use a different example. We’re going to count the occurrence of words in some text files, and then report the top-10 most common words in each file. You can imagine combining the data later on, but we won’t bother with that in this simple example.

import sys
import gc
from time import sleep
from pathlib import Path
from dask.bag import from_sequence
from collections import Counter
from dask.distributed import Client, LocalCluster
import dask_memusage


def calculate_top_10(file_path: Path):
    gc.collect()  # See notes below

    # Load the file
    with open(file_path) as f:
        data = f.read()

    # Count the words
    counts = Counter()
    for word in data.split():
        counts[word.strip(".,'\"").lower()] += 1

    # Choose the top 10:
    by_count = sorted(counts.items(), key=lambda x: x[1])
    sleep(0.1)  # See notes below
    return (file_path.name, by_count[-10:])


def main(directory):
    # Setup the calculation:

    # Create a 4-process cluster (running locally). Note only one thread
    # per-worker: because polling is per-process, you can't run multiple
    # threads per worker, otherwise you'll get results that combine memory
    # usage of multiple tasks.
    cluster = LocalCluster(n_workers=4, threads_per_worker=1,
                           memory_limit=None)
    # Install dask-memusage:
    dask_memusage.install(cluster.scheduler, "memusage.csv")
    client = Client(cluster)

    # Create the task graph:
    files = from_sequence(Path(directory).iterdir())
    graph = files.map(calculate_top_10)
    graph.visualize(filename="example2.png", rankdir="TD")

    # Run the calculations:
    for result in graph.compute():
        print(result)
    # ... do something with results ...


if __name__ == '__main__':
    main(sys.argv[1])

Here’s what the task graph looks like:

Plenty of parallelism!

We can run the program on some files:

$ pip install dask[bag] dask_memusage
$ python example2.py files/
('frankenstein.txt', [('that', 1016), ('was', 1021), ('in', 1180), ('a', 1438), ('my', 1751), ('to', 2164), ('i', 2754), ('of', 2761), ('and', 3025), ('the', 4339)])
('pride_and_prejudice.txt', [('she', 1660), ('i', 1730), ('was', 1832), ('in', 1904), ('a', 1981), ('her', 2142), ('and', 3503), ('of', 3705), ('to', 4188), ('the', 4492)])
('greatgatsby.txt', [('that', 564), ('was', 760), ('he', 770), ('in', 849), ('i', 999), ('to', 1197), ('of', 1224), ('a', 1440), ('and', 1565), ('the', 2543)])
('big.txt', [('his', 40032), ('was', 45356), ('that', 47924), ('he', 48276), ('a', 83228), ('in', 86832), ('to', 114184), ('and', 152284), ('of', 159888), ('the', 314908)])

As one would expect, the most common words are stem words, but there is still some variation in order.

Next, let’s look at the results from dask-memusage.

`dask-memusage` output, and how it works

You’ll notice that the actual use of dask-memusage involves just one extra line, other than the import:

dask_memusage.install(cluster.scheduler, "memusage.csv")

What this will do is poll the process at 10ms intervals for peak memory usage, broken down by task. In this case, here’s what memusage.csv looks like:

task_key,min_memory_mb,max_memory_mb
"('from_sequence-3637e6ff937ef8488894df60a80f62ed', 3)",51.2421875,51.2421875
"('from_sequence-3637e6ff937ef8488894df60a80f62ed', 0)",51.70703125,51.70703125
"('from_sequence-3637e6ff937ef8488894df60a80f62ed', 1)",51.28125,51.78515625
"('from_sequence-3637e6ff937ef8488894df60a80f62ed', 2)",51.30859375,51.30859375
"('calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca', 2)",56.19140625,56.19140625
"('calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca', 0)",51.70703125,54.26953125
"('calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca', 1)",52.30078125,52.30078125
"('calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca', 3)",51.48046875,384.00390625

For each task in the graph we are told minimum memory usage and peak memory usage, in MB.

In more readable form:

task_key	min_memory_mb	max_memory_mb
“(‘from_sequence-3637e6ff937ef8488894df60a80f62ed’, 3)”	51.2421875	51.2421875
“(‘from_sequence-3637e6ff937ef8488894df60a80f62ed’, 0)”	51.70703125	51.70703125
“(‘from_sequence-3637e6ff937ef8488894df60a80f62ed’, 1)”	51.28125	51.78515625
“(‘from_sequence-3637e6ff937ef8488894df60a80f62ed’, 2)”	51.30859375	51.30859375
“(‘calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca’, 2)”	56.19140625	56.19140625
“(‘calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca’, 0)”	51.70703125	54.26953125
“(‘calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca’, 1)”	52.30078125	52.30078125
“(‘calculate_top_10-afc867e38c3bd0aac8c18bb00d3634ca’, 3)”	51.48046875	384.00390625

The bottom four lines are the interesting ones; all four start with a minimum memory usage of ~50MB RAM, and then memory may or may not increase as the code runs. How much it increases presumably depends on the size of the files; most of them are quite small, so memory usage doesn’t change much. One file uses much more maximum memory than the others, 384MB of RAM; presumably it’s big.txt which is 25MB, since the other files are all smaller than 1MB.

The mechanism used, polling peak process memory, has some limitations:

You’ll notice there’s a gc.collect() at the top of the calculate_top_10(); this ensures we don’t count memory from previous code that hasn’t been cleaned up yet.
There’s also a sleep() at the bottom of calculate_top_10(). Because polling is used, tasks that run too quickly won’t get accurate information—the polling happens every 10ms or so, so you want to sleep at least 20ms.
Finally, because polling is per-process, you can’t run multiple threads per worker, otherwise you’ll get results that combine memory usage of multiple tasks.

Interpreting the data

What we’ve learned is that memory usage of calculate_top_10() grows with file size; this can be used to characterize the memory requirements for the workload. That is, we can create a model that links data input sizes and required RAM, and then we can calculate the required RAM for any given level of parallelism. And that can guide our choice of hardware, if we assume one task per CPU core.

Going back to my original motivating problem, the gene sequencing pipeline: using the data from dask-memusage, I was able to come up with a formula saying “for this size input, this much memory is necessary”. Whenever we ran a batch job we could therefore set the parallelism as high as possible given the number of CPUs and RAM on the machine.

While this allowed for more parallelism, it still wasn’t sufficient—processing was still using a huge amount of RAM, RAM that we had to pay for either with time (by using less CPUs) or money (by paying for more expensive virtual machines that more RAM). So the next step was to reduce memory usage.

Reducing memory usage with Fil {#fil}

If we look at the dask-memusage output for our word-counting example, the memory usage seems rather high: for a 25MB file, we’re using 330MB of RAM to count words. Thinking through how an ideal version of this code might work, we ought to be able to process the file with much less memory (for example we could redesign our code to process the file line by line, reducing memory).

And that’s another way in which dask-memusage can be helpful: it can point us at specific code that needs memory usage optimized, at the granularity of a task. A task can be a rather large chunk of code, though, so the next step is to use a memory profiler that can point to specific lines of code.

When working on the gene sequencing tool I used the memory_profiler package, and while that worked, and I managed to reduce memory usage quite a bit, I found it quite difficult to use. It turns out that for batch data processing, the typical use case for Dask, you want a different kind of memory profiler.

So after I’d left that job, I created a memory profiler called Fil that is expressly designed for finding peak memory usage. Unlike dask-memusage, which can be run on production workloads, Fil slows down your execution and has other limitations I’m currently working on (it doesn’t support multiple processes, as of March 2021), so for now it’s better used for manual profiling.

We can write a little script that only runs on big.txt:

from pathlib import Path
from example2 import calculate_top_10

calculate_top_10(Path("files/big.txt"))

Run it under Fil:

pip install filprofiler
fil-profile run example3.py

And the result shows us where the bulk of the memory is being allocated:

Reading in the file takes 8% of memory, but data.split() is responsible for 84% of memory. Perhaps we shouldn’t be loading the whole file into memory and splitting the whole file into words, and instead we should be processing the file line by line. A good next step if this were real code would be to fix the way calculate_top_10() is implemented.

Next steps

What should you do if your Dask workload is using too much memory?

If you’re running Dask workloads with the Distributed backend, and you’re fine with only having one thread per worker, running with dask-memusage will give you real-world per-task memory usage on production workloads. You can then use the resulting information in a variety of ways:

As a starting point for optimizing memory usage. Once you know which tasks use the most memory, you can then use Fil to figure out which lines of code are responsible and then use a variety of techniques to reduce memory usage.
When possible, you can fine tune your chunking size; smaller chunks will use less memory. If you’re using Dask Arrays you can set the chunk size; with Dask Dataframes you can ensure good partition sizes.
You can fine tune your hardware configuration, so you’re not wasting RAM or CPU cores. For example, on AWS you can choose a variety of instance sizes with different RAM/CPU ratios, one of which may match your workload characteristics.

In my original use case, the gene sequencing pipeline, I was able to use a combination of lower memory use and lower CPU use to reduce costs to a much more modest level. And when doing R&D, I was able to get faster results with the same hardware costs.

You can learn more about dask-memusage here, and learn more about the Fil memory profiler here.

Getting to know the life science community

2021-03-04T00:00:00+00:00

Dask wants to better support the needs of life scientists. We’ve been getting to know the community, in order to better understand:

Who is out there?
What kind of problems are they trying to solve?

We’ve learned that:

Lots of people want more examples tailored to their specific scientifc domain.
Better integration of Dask into other software is considered very important.
Managing memory constraints when working with big data is a common pain point.

Our strategic plan for this year involves three parallel streams:

INFRASTRUCTURE (60%) - improvements to Dask, or to other software with many life science users.
OUTREACH (20%) - blogposts, talks, webinars, tutorials, and examples.
APPLICATIONS (20%) - the application of Dask to a specific life science problem, collaborating with individual labs or groups.

If you still want to have your say, it’s not too late - click this link to get in touch!

Background
What we learned
- From Dask users
- From other software libraries
Opportunities we see
Strategic plan
Limitations
Methods

Background

Recently Dask won some funding to hire a developer (Genevieve Buckley) to improve Dask specifically for life sciences.

Working with scientists is a really great way to drive growth in open source projects. Both scientists and software developers benefit. Early on, Dask had a lot of success integrating with the geosciences community. It’d be great to see similar success for life sciences too.

There are several areas of life science where we see Dask being used today:

Biological image processing
Single cell analysis
Statistical genetics
…and many more

We’ve solicited feedback from the life science community, to come up with a strategic plan to direct our effort over the next year.

What we learned

From Dask users

When we talked to individual Dask users, we heard a lot of similar themes in their comments.

People wanted:

Better documentation and examples
Better support for working with constrained resources
Better interoperability with other software tools

The most common request was for better documentation with more examples. People across many different areas of life science all said this could help them a lot. A corresponding challenge here is the multitude of different areas of life science, all of which require targeted documentation.

GPU support was also commonly mentioned. Comments about GPUs fit into two of the categories above: GPU memory is often a constraint, and life scientists also want it to be easier to apply deep learning models to their data.

From other software libraries

We didn’t only talk with individual users of Dask. We also spoke to developers of scientific software projects.

Why would other software libraries adopt Dask?

Software projects wanted to solve problems related to:

Easier deployment to distributed clusters
Managing memory when processing large datasets
Parallelization of existing functionality

Dask is good at solving those kinds of problems, and might be a good solution for this.

Who we’ve talked to

Some of the software projects we spoke to include:

Current status

napari is a python based image viewer. Dask is already well-integrated with napari. Areas for opportunity here include:

Improved documentation about how to work efficiently with Dask arrays in napari.
Smarter caching of neighbouring image chunks to avoid lag.
Guides for how to create plugins for napari, so the community can grow.

sgkit is a statistical genetics toolkit. Dask is already well-integrated with sgkit. The developers would like improved infrastructure in the main Dask repositories that they can benefit from. Wishlist items include:

Better ways to understand how things like array chunks change as they move through a Dask computation.
Better high level graph visualizations. Graph visualizations showing all the low level operations can be overwhelming.
Better ways to identify poorly efficient areas in Dask computations.
Stability when new versions of Dask are released
Making it easier to run Dask in the cloud. They are currently using dask-cloudprovider and finding that very useful.

scanpy is a library for single cell analysis in Python. It is built together with anndata, an annotated data structure.

Data size is less of an issue for scanpy users, although anndata developers do think support for Dask would be a useful thing to add.
Support for sparse arrays is very important for these communities.

squidpy is a tool for the analysis and visualization of spatial molecular data. It builds on top of scanpy and anndata. Because squidpy involves large imaging data on top of what we’d normally see for datasets in scanpy/anndata, this is a project with a large area of opportunity for Dask.

Integrating Dask with the squidpy ImageContainer class is a good first step towards handling large image data within the availabe RAM constraints.

ilastik does not currently use Dask at all. They are curious to see if Dask can make it easier to scale up from a single machine to a cluster. Users generally train an ilastik model interactively, and then want to apply it to many images. This second step is often when people want an easy way to scale up the computing resources available.

CellProfiler is a pipeline tool for image processing. They have briefly experimented with Dask before.

Primarily, they want to parallelize existing functionality.
Most common pipelines fall into three major “user stories” where focussing effort would make the most impact:
1. Image processing
2. Object processing
3. Measurements

Opportunities we see

Because large scientific software projects have many users, improvements here would be high value for the scientific community. This is a huge area of opportunity. We plan to collaborate with these developer communities as much as possible to drive this forward.

Another area of opportunity is improving infrastructure for high level graph visualizations. Power users and novices alike would benefit from better tools for identifying areas of inefficiencies in Dask computations.

Finally, continuing to build support for Dask arrays with non-numpy chunks is also a high impact area of opportunity. In particular, support for sparse arrays, and support for arrays on the GPU were highlighted as very important to the life science community.

Strategic direction

We’re going to manage this project with three parallel streams:

INFRASTRUCTURE (60%)
OUTREACH (20%)
APPLICATIONS (20%)

Each stream will likely have one primary project at any time, with many more queued. Within each stream, proposed projects will be ranked according to: level of impact, time commitment required, and the availability of other developer resources.

Infrastructure

Infrastructure projects are improvements to either:

Projects housed within the Dask organisation, or
Other software projects involving Dask with large numbers of life science users

We’ll aim to spend around 60% of project effort on infrastructure.

Outreach

Outreach activities include blogposts, talks, webinars, tutorials, and creating examples for documentation. We aim to spend around 20% of project effort on outreach.

If you have outreach ideas you want to share (perhaps you run a student group or popular meetup) then you can get in touch with us here.

Applications

The final stream focusses on the application of Dask to a specific problem in life science.

These projects generally involve collaborating with individual labs or group, and have an end goal of summarizing their workflow in a blogpost. This feeds back into our outreach, so others in the community can learn from it.

Ideally these are short term projects, so we can showcase many different applications of Dask. We aim to spend around 20% of project effort on applications.

If you use Dask and have an example in mind you’d like to share, then you can get in touch with us here.

How will we know what success looks like?

The role of Dask Life Science Fellow has a very broad scope, so there are a lot of different ways we could be successful within this space.

Some indicators of success are:

Bugs being clearly described, or bottlenecks clearly identified
Bug fixes
Improvements or new features made to Dask infrastructure
Improvements or new features made in related project repositories
Better integration or support for Dask made in related project repositories for life sciences
Better documentation with examples tailored to specific areas of life science
Blogposts written (ideally in collaboration with Dask users)
Talks given
Webinars produced
Tutorials created

We won’t have the time or the resources to do all the things, but we will be able to make an impact by focussing on a subset.

Limitations

The information we discovered talking to the life science community is likely to be biased in a few different ways.

My (Genevieve’s) network is strongest among imaging scientists, and among people in Australia. It’s much less strong for other fields in life science, as my original training is in physics.

The Dask project has strong links to other open source python projects, including scientific software. The Dask developer community also has strong links from companies including NVIDIA, Quansight, and others. They are likely to be over-represented among the people we spoke to.

It’s much harder to find people who aren’t using Dask at all yet but have problems that would be a good fit for it. These people are very unlikely to be, say following Dask on twitter, and probably won’t be aware that we’re looking for them.

I don’t think there are any perfect solutions to these problems. We’ve tried to mitigate these effects by using loose second and third degree connections to spread awareness, as well as posting in science public forums.

Methods

We used a variety of approaches to gather feedback from the life science community.

A short survey was created to gather comments
It was advertised by the @dask_dev twitter account
We asked related software projects consider retweeting for reach (example)
We posted in scientific Slack groups and online public forums
We emailed other life scientists in our network, asking them to let their networks know too
We contacted a number of life science researchers directly.
We contacted several other scientific software groups directly and spoke with the developers.

Join the discussion

Come join us in the Dask slack! We have a #life-science channel so there’s a place to discuss things relevant to the Dask life science community. You can request an invite to the Slack here.

Dask User Summit 2021

2021-03-03T00:00:00+00:00

Dask is organizing a user summit in mid-May. This will be a remote event focused on bringing together developers and users of Dask and the distributed PyData stack in different domains.

User Summits like this are particularly important for a project like Dask which serves such a diverse set of use cases. Dask’s user communities include industries like finance, government, health, geoscience, imaging, machine learning, and more. These communities often have very similar problems, but don’t often communicate with each other.

User summits provide a venue for disparate domains to connect over shared technology challenges. Often a solution designed for one domain is useful for others. As technologists, this sharing is critical in order to promote consistent and high quality software solutions across domains, rather than silo’ed solutions.

We organized a summit a year ago, focusing mainly on developers. This was a fantastic time and resulted in a surprising amount of consensus building and forward movement both in technological and domain-specific directions.

For more on our summit last year, see this post.

Organization

We’ve asked NumFOCUS to organize this event for us. NumFOCUS runs the highly successful and community oriented PyData conference series, and had great success with their remote-first PyData Global conference late last year.

Tickets are intended to be reasonably priced on a sliding scale, with assistance given to any in need.

Open CFP

I would like to encourage people submit proposals to talk at summit.dask.org.

I would like to especially extend an invitation to those who are new to the Dask community, or new to speaking in general. This year we’re especially trying to highlight use cases of Dask, rather than developers pushing the technology forward (although these talks are of course welcome as well).

If you have an idea for a talk then please submit something and we’ll work together on making it fit. Alternatively, if you have a colleague that you think would enjoy or grow from speaking then I encourage you to encourage them as well.

Workshops

Finally, I’m excited about an experiment that we’re running this year with workshops. These are intended to be two-hour blocks of time dedicated to a particular topic, organized by a specific community member (perhaps you?). If you have a consistent theme for a set of 3-5 talks then this option gives you the ability to curate and control a dedicated block of the conference. You can invite your colleagues and collaborators. We’ll handle the conference infrastructure while you handle the content.

We stole this structure from workshops at larger academic conferences. We think that it fits Dask well specifically because of the federated nature of our community. We hope that it gives space for sub-communities to assemble and better establish cohesive working groups.

Themes in the past have included topics like Pangeo, RAPIDS, workflow management, imaging, and performance.

Apply to speak

Again, I encourage you and your colleagues to submit applications to speak this year in May. The proposal page is at https://summit.dask.org/present/#guidelines

Dask Working Notes - Posted in 2021

Reflections on one year as the Dask life science fellow

Contents

Progress update

Personal reflections

Highlights from this year

What worked well

What didn’t work so well

What’s next for Genevieve?

What’s next in Dask?

Mosaic Image Fusion

The problem

Image mosaicing in microscopy

The solution

Results

Code

What’s next?

Also see

Acknowledgements

Choosing good chunk sizes in Dask

Contents

What are Dask array chunks?

How do I know what chunks my array has?

Too small is a problem

Too big is also a problem

Choosing an initial chunk size

Rough rules of thumb

Chunks should be aligned with array storage on disk

Using the Dask dashboard

What to watch for on the dashboard

Rechunking arrays

Unmanaged memory

Thanks for reading

CZI EOSS Update

Brief summary

Code contributions

Code contribution highlights

Conferences

SciPy conference

Dask Summit

Dask Down Under

Dask life science workshop

VIS2021 symposium

Tutorials and workshops

Google Summer of Code

Blogposts

2021 Dask User Survey

Highlights

The typical Dask user

Who are Dask users?

What industry do you work in?

How easy is it for you to upgrade to newer versions of Python libraries?

How people like to use Dask

What are some other libraries that you often use with Dask?”

Dask APIs

Interactive or Batch?

How do you view Dask’s dashboard?

Local machine or Cluster?

If you use a cluster, how do you launch Dask?

If you use a cluster, do you have a need for multiple worker types in the same cluster?

Datasets

How large are your datasets typically?

Where are your datasets typically stored?

What file formats do you typically work with?

Preferred Cloud?

Do you use Dask projects to deploy?

Diagnostics

Stability

User satisfaction, support, and documentation

How easy is Dask to use?

How is Dask’s documentation?

How satisfied are you with maintainer responsiveness on GitHub?

What Dask resources have you used for support in the last six months?

Suggestions for improvement

Which would help you most right now?

How can Dask improve?

What common feature requests do you care about most?

Previous survey results

Google Summer of Code 2021 - Dask Project

Contents

Why can’t I use `map_overlap` or `reduction`?