Dask Working Notes - Posts tagged performance

High Level Query Optimization in Dask

2023-08-25T00:00:00+00:00

This work was engineered and supported by Coiled and NVIDIA. Thanks to Patrick Hoefler and Rick Zamora, in particular. Original version of this post appears on blog.coiled.io

Dask DataFrame doesn’t currently optimize your code for you (like Spark or a SQL database would). This means that users waste a lot of computation. Let’s look at a common example which looks ok at first glance, but is actually pretty inefficient.

import dask.dataframe as dd

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",  # unnecessarily reads all rows and columns
)
result = (
    df[df.hvfhs_license_num == "HV0003"]    # could push the filter into the read parquet call
    .sum(numeric_only=True)
    ["tips"]                                # should read only necessary columns
)

We can make this run much faster with a few simple steps:

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    filters=[("hvfhs_license_num", "==", "HV0003")],
    columns=["tips"],
)
result = df.tips.sum()

Currently, Dask DataFrame wouldn’t optimize this for you, but a new effort that is built around logical query planning in Dask DataFrame will do this for you. This article introduces some of those changes that are developed in dask-expr.

You can install and try dask-expr with:

pip install dask-expr

We are using the NYC Taxi dataset in this post.

Dask Expressions

Dask expressions provides a logical query planning layer on top of Dask DataFrames. Let’s look at our initial example and investigate how we can improve the efficiency through a query optimization layer. As noted initially, there are a couple of things that aren’t ideal:

We are reading all rows into memory instead of filtering while reading the parquet files.
We are reading all columns into memory instead of only the columns that are necessary.
We are applying the filter and the aggregation onto all columns instead of only "tips".

The query optimization layer from dask-expr can help us with that. It will look at this expression and determine that not all rows are needed. An intermediate layer will transpile the filter into a valid filter-expression for read_parquet:

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    filters=[("hvfhs_license_num", "==", "HV0003")],
)
result = df.sum(numeric_only=True)["tips"]

This still reads every column into memory and will compute the sum of every numeric column. The next optimization step is to push the column selection into the read_parquet call as well.

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    columns=["tips"],
    filters=[("hvfhs_license_num", "==", "HV0003")],
)
result = df.sum(numeric_only=True)

This is a basic example that you could rewrite by hand. Use cases that are closer to real workflows might potentially have hundreds of columns, which makes rewriting them very strenuous if you need a non-trivial subset of them.

Let’s take a look at how we can achieve this. dask-expr records the expression as given by the user in an expression tree:

result.pprint()

Projection: columns='tips'
  Sum: numeric_only=True
    Filter:
      ReadParquet: path='s3://coiled-datasets/uber-lyft-tlc/'
      EQ: right='HV0003'
        Projection: columns='hvfhs_license_num'
          ReadParquet: path='s3://coiled-datasets/uber-lyft-tlc/'

This tree represents the expression as is. We can observe that we would read the whole dataset into memory before we apply the projections and filters. One observation of note: It seems like we are reading the dataset twice, but Dask is able to fuse tasks that are doing the same to avoid computing these things twice. Let’s reorder the expression to make it more efficient:

result.simplify().pprint()

Sum: numeric_only=True
  ReadParquet: path='s3://coiled-datasets/uber-lyft-tlc/'
               columns=['tips']
               filters=[('hvfhs_license_num', '==', 'HV0003')]

This looks quite a bit simpler. dask-expr reordered the query and pushed the filter and the column projection into the read_parquet call. We were able to remove quite a few steps from our expression tree and make the remaining expressions more efficient as well. This represents the steps that we did manually in the beginning. dask-expr performs these steps for arbitrary many columns without increasing the burden on the developers.

These are only the two most common and easy to illustrate optimization techniques from dask-expr. Some other useful optimizations are already available:

len(...) will only use the Index to compute the length; additionally we can ignore many operations that won’t change the shape of a DataFrame, like a replace call.
set_index and sort_values won’t eagerly trigger computations.
Better informed selection of merge algorithms.
…

We are still adding more optimization techniques to make Dask DataFrame queries more efficient.

Try it out

The project is in a state where interested users should try it out. We published a couple of releases. The API covers a big chunk of the Dask DataFrame API, and we keep adding more. We have already observed very impressive performance improvements for workflows that would benefit from query optimization. Memory usage is down for these workflows as well.

We are very much looking for feedback and potential avenues to improve the library. Please give it a shot and share your experience with us.

dask-expr is not integrated into the main Dask DataFrame implementation yet. You can install it with:

pip install dask-expr

The API is very similar to what Dask DataFrame provides. It exposes mostly the same methods as Dask DataFrame does. You can use the same methods in most cases.

import dask_expr as dd

You can find a list of supported operations in the Readme. This project is still very much in progress. The API might change without warning. We are aiming for weekly releases to push new features out as fast as possible.

Why are we adding this now?

Historically, Dask focused on flexibility and smart scheduling instead of query optimization. The distributed scheduler built into Dask uses sophisticated algorithms to ensure ideal scheduling of individual tasks. It tries to ensure that your resources are utilized as efficient as possible. The graph construction process enables Dask users to build very flexible and complicated graphs that reach beyond SQL operations. The flexibility that is provided by the Dask futures API requires very intelligent algorithms, but it enables users to build highly sophisticated graphs. The following picture shows the graph for a credit risk model:

The nature of the powerful scheduler and the physical optimizations enables us to build very complicated programs that will then run efficiently. Unfortunately, the nature of these optimizations does not enable us to avoid scheduling work that is not necessary. This is where the current effort to build high level query optimization into Dask comes in.

Conclusion

Dask comes with a very smart distributed scheduler but without much logical query planning. This is something we are rectifying now through building a high level query optimizer into Dask DataFrame. We expect to improve performance and reduce memory usage for an average Dask workflow.

This API is read for interested users to play around with. It covers a good chunk of the DataFrame API. The library is under active development, we expect to add many more interesting things over the coming weeks and months.

Choosing good chunk sizes in Dask

2021-11-02T00:00:00+00:00

Confused about choosing a good chunk size for Dask arrays?

Array chunks can’t be too big (we’ll run out of memory), or too small (the overhead introduced by Dask becomes overwhelming). So how can we get it right?

It’s a two step process:

First, start by choosing a chunk size similar to data you know can be processed entirely within memory (i.e. without Dask), using these rough rules of thumb.
Then, watch the Dask dashboard task stream and worker memory plots, and adjust if needed. Here are the signs to watch out for.

What are Dask array chunks?
Too small is a problem
Too big is also a problem
Choosing an initial chunk size
- Rough rules of thumb
- Chunks should be aligned with array storage on disk
Using the Dask dashboard
- What to watch for on the dashboard
Rechunking arrays
Unmanaged memory
Thanks for reading

What are Dask array chunks?

Dask arrays are big structures, made out of many small chunks. Typically, each small chunk is an individual numpy array, and they are arranged together to make a much larger Dask array.

You can find more information about Dask array chunks on this page of the documentation: https://docs.dask.org/en/latest/array-chunks.html

How do I know what chunks my array has?

If you have a Dask array, you can use the chunksize or chunks attribues to see information about the chunks. You can also visualize this with the Dask array HTML representation.

arr.chunksize shows the largest chunk size. For arrays where you expect roughly uniform chunk sizes, this is a good way to summarize chunk size information.

arr.chunks shows fully explicit sizes of all chunks along all dimensions within the Dask array (see item 3 here). This is more verbose, and is a good choice with arrays that have irregular chunks.

Too small is a problem

If array chunks are too small, it’s inefficient. Why is this?

Using Dask introduces some amount of overhead for each task in your computation. This overhead is the reason the Dask best practices advise you to avoid too-large graphs. This is because if the amount of actual work done by each task is very tiny, then the percentage of overhead time vs useful work time is not good.

Typically, the Dask scheduler takes 1 millisecond to coordinate a single task. That means we want the computation time for each task to be comparitively larger, eg: seconds instead of milliseconds.

It might be hard to understand this intuitively, so here’s an analogy. Let’s imagine we’re building a house. It’s a pretty big job, and if there were only one worker it would take much too long to build. So we have a team of workers and a site foreman. The site foreman is equivalent to the Dask scheduler: their job is to tell the workers what tasks they need to do.

Say we have a big pile of bricks to build a wall, sitting in the corner of the building site. If the foreman (the Dask scheduler) tells workers to go and fetch a single brick at a time, then bring each one to where the wall is being built, you can see how this is going to be very slow and inefficient! The workers are spending most of their time moving between the wall and the pile of bricks. Much less time is going towards doing the actual work of mortaring bricks onto the wall.

Instead, we can do this in a smarter way. The foreman (Dask scheduler) can tell the workers to go and bring one full wheelbarrow load of bricks back each time. Now workers are spending much less time moving between the wall and the pile of bricks, and the wall will be finished much quicker.

Too big is also a problem

If the Dask array chunks are too big, this is also bad. Why? Chunks that are too large are bad because then you are likely to run out of working memory. You may see out of memory errors happening, or you might see performance decrease substantially as data spills to disk.

When too much data is loaded in memory on too few workers, Dask will try to spill data to disk instead of crashing. Spilling data to disk makes things run very slowly, because all the extra read/write operations to disk. Things don’t just get a little bit slower, they get a LOT slower, so it’s smart to watch out for this.

To watch out for this, look at the worker memory plot on the Dask dashboard. Orange bars are a warning you are close to the limit, and gray means data is being spilled to disk - not good! For more tips, see the section on using the Dask dashboard below.

Choosing an initial chunk size

Rough rules of thumb

If you already created a prototype, which may not involve Dask at all, using a small subset of the data you intend to process, you’ll have a clear idea of what size of data can be processed easily for this workflow. You can use this knowledge to choose similar sized chunks in Dask.
Some people have observed that chunk sizes below 1MB are almost always bad. Chunk size between 100MB and 1GB are generally good, going over 1 or 2GB means you have a really big dataset and/or a lot of memory available per core,
Upper bound: Avoid too large task graphs. More than 10,000 or 100,000 chunks may start to perform poorly.
Lower bound: To get the advantage of parallelization, you need the number of chunks to at least equal the number of worker cores available (or better, the number of worker cores times 2). Otherwise, some workers will stay idle.
The time taken to compute each task should be much larger than the time needed to schedule the task. The Dask scheduler takes roughly 1 millisecond to coordinate a single task, so a good task computation time would be measured in seconds (not milliseconds).

Chunks should be aligned with array storage on disk

If you are reading data from disk, the storage structure will inform what shape your Dask array chunks should be. For best performance, choose chunks that are well aligned with the way data is stored.

From the Dask best practices on how to orient your chunks:

When reading data you should align your chunks with your storage format. Most array storage formats store data in chunks themselves. If your Dask array chunks aren’t multiples of these chunk shapes then you will have to read the same data repeatedly, which can be expensive. Note though that often storage formats choose chunk sizes that are much smaller than is ideal for Dask, closer to 1MB than 100MB. In these cases you should choose a Dask chunk size that aligns with the storage chunk size and that every Dask chunk dimension is a multiple of the storage chunk dimension.

Some examples of data storage structures on disk include:

A HDF5 or Zarr array. The size and shape of chunks/blocks stored on disk should align well with the Dask array chunks you select.
A folder full of tiff files. You might decide that each tiff file should become a single chunk in the Dask array (or that multiple tiff files should be grouped into a single chunk).

Using the Dask dashboard

The second part of choosing a good chunk size is monitoring the Dask dashboard to see if you need to make any adjustments.

If you’re not very familiar with the Dask dashboard, or you just sometimes forget where to find certain dashboard plots (like the worker memory plot), then you’ll probably enjoy these quick video tutorials:

We recommend always having the dashboard up when you’re working with Dask. It’s a fantastic way to get a sense of what’s working well, or poorly, so you can make adjustments.

What to watch for on the dashboard

Bad signs to watch out for include:

Lots of white space in the task stream plot is a bad sign. White space means nothing is happening. Chunks may be too small.
Lots and lots of red in the task stream plot is a bad sign. Red means worker communication. Dask workers need some communication, but if they are doing almost nothing except communication then there is not much productive work going on.
On the worker memory plot, watch out for orange bars which are a sign you are getting close to the memory limit. Chunks may be too big.
On the worker memory plot, watch out for grey bars which mean data is being spilled to disk. Chunks may be too big.

Here is an example of the Dask dashboard during a good computation (time 6:12 in this video).

For comparison, here is an example of the Dask dashboard during a bad computation (time 6:57 in this video).

In this example, it’s inefficient because the chunks are much too small, so we see a lot of white space and red worker communication in the task stream plot.

Rechunking arrays

If you need to change the chunking of a Dask array in the middle of a computation, you can do that with the rechunk method.

rechunked_array = original_array.rechunk(new_shape)

Warning: Rechunking Dask arrays comes at a cost.

The Dask graph must be rearranged to accomodate the new chunk structure. This happens immediately, and will block any other interaction with python until Dask has rearranged the task graph.
This also inserts new tasks into the Dask graph. At compute time, there are now more tasks to execute.

For these reasons, it is best to choose a good initial chunk size and avoid rechunking.

However, sometimes the data is stored on disk is not well aligned and rechunking may be necessary. For an example of this, here is Draga Doncila Pop talking about chunk alignment with satellite image data.

The rechunker library can be useful in these situations:

Rechunker takes an input array (or group of arrays) stored in a persistent storage device (such as a filesystem or a cloud storage bucket) and writes out an array (or group of arrays) with the same data, but different chunking scheme, to a new location. Rechunker is designed to be used within a parallel execution framework such as Dask.

Unmanaged memory

Last, remember that you don’t only need to consider the size of the array chunks in memory, but also the working memory consumed by your analysis functions. Sometimes that is called “unmanaged memory” in Dask.

“Unmanaged memory is RAM that the Dask scheduler is not directly aware of and which can cause workers to run out of memory and cause computations to hang and crash.” – Guido Imperiale

Here are some tips for handling unmanaged memory:

Tackling unmanaged memory with Dask (Coiled blogpost) by Guido Imperiale
Handle Unmanaged Memory in Dask (8 minute video)

Thanks for reading

We hope this was helpful figuring out how to choose good chunk sizes for Dask. This blogpost was inspired by this twitter thread. If you’d like to follow Dask on Twitter, you can do that at https://twitter.com/dask_dev