Dask Working Notes - Posted in 2023

High Level Query Optimization in Dask

2023-08-25T00:00:00+00:00

This work was engineered and supported by Coiled and NVIDIA. Thanks to Patrick Hoefler and Rick Zamora, in particular. Original version of this post appears on blog.coiled.io

Dask DataFrame doesn’t currently optimize your code for you (like Spark or a SQL database would). This means that users waste a lot of computation. Let’s look at a common example which looks ok at first glance, but is actually pretty inefficient.

import dask.dataframe as dd

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",  # unnecessarily reads all rows and columns
)
result = (
    df[df.hvfhs_license_num == "HV0003"]    # could push the filter into the read parquet call
    .sum(numeric_only=True)
    ["tips"]                                # should read only necessary columns
)

We can make this run much faster with a few simple steps:

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    filters=[("hvfhs_license_num", "==", "HV0003")],
    columns=["tips"],
)
result = df.tips.sum()

Currently, Dask DataFrame wouldn’t optimize this for you, but a new effort that is built around logical query planning in Dask DataFrame will do this for you. This article introduces some of those changes that are developed in dask-expr.

You can install and try dask-expr with:

pip install dask-expr

We are using the NYC Taxi dataset in this post.

Dask Expressions

Dask expressions provides a logical query planning layer on top of Dask DataFrames. Let’s look at our initial example and investigate how we can improve the efficiency through a query optimization layer. As noted initially, there are a couple of things that aren’t ideal:

We are reading all rows into memory instead of filtering while reading the parquet files.
We are reading all columns into memory instead of only the columns that are necessary.
We are applying the filter and the aggregation onto all columns instead of only "tips".

The query optimization layer from dask-expr can help us with that. It will look at this expression and determine that not all rows are needed. An intermediate layer will transpile the filter into a valid filter-expression for read_parquet:

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    filters=[("hvfhs_license_num", "==", "HV0003")],
)
result = df.sum(numeric_only=True)["tips"]

This still reads every column into memory and will compute the sum of every numeric column. The next optimization step is to push the column selection into the read_parquet call as well.

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    columns=["tips"],
    filters=[("hvfhs_license_num", "==", "HV0003")],
)
result = df.sum(numeric_only=True)

This is a basic example that you could rewrite by hand. Use cases that are closer to real workflows might potentially have hundreds of columns, which makes rewriting them very strenuous if you need a non-trivial subset of them.

Let’s take a look at how we can achieve this. dask-expr records the expression as given by the user in an expression tree:

result.pprint()

Projection: columns='tips'
  Sum: numeric_only=True
    Filter:
      ReadParquet: path='s3://coiled-datasets/uber-lyft-tlc/'
      EQ: right='HV0003'
        Projection: columns='hvfhs_license_num'
          ReadParquet: path='s3://coiled-datasets/uber-lyft-tlc/'

This tree represents the expression as is. We can observe that we would read the whole dataset into memory before we apply the projections and filters. One observation of note: It seems like we are reading the dataset twice, but Dask is able to fuse tasks that are doing the same to avoid computing these things twice. Let’s reorder the expression to make it more efficient:

result.simplify().pprint()

Sum: numeric_only=True
  ReadParquet: path='s3://coiled-datasets/uber-lyft-tlc/'
               columns=['tips']
               filters=[('hvfhs_license_num', '==', 'HV0003')]

This looks quite a bit simpler. dask-expr reordered the query and pushed the filter and the column projection into the read_parquet call. We were able to remove quite a few steps from our expression tree and make the remaining expressions more efficient as well. This represents the steps that we did manually in the beginning. dask-expr performs these steps for arbitrary many columns without increasing the burden on the developers.

These are only the two most common and easy to illustrate optimization techniques from dask-expr. Some other useful optimizations are already available:

len(...) will only use the Index to compute the length; additionally we can ignore many operations that won’t change the shape of a DataFrame, like a replace call.
set_index and sort_values won’t eagerly trigger computations.
Better informed selection of merge algorithms.
…

We are still adding more optimization techniques to make Dask DataFrame queries more efficient.

Try it out

The project is in a state where interested users should try it out. We published a couple of releases. The API covers a big chunk of the Dask DataFrame API, and we keep adding more. We have already observed very impressive performance improvements for workflows that would benefit from query optimization. Memory usage is down for these workflows as well.

We are very much looking for feedback and potential avenues to improve the library. Please give it a shot and share your experience with us.

dask-expr is not integrated into the main Dask DataFrame implementation yet. You can install it with:

pip install dask-expr

The API is very similar to what Dask DataFrame provides. It exposes mostly the same methods as Dask DataFrame does. You can use the same methods in most cases.

import dask_expr as dd

You can find a list of supported operations in the Readme. This project is still very much in progress. The API might change without warning. We are aiming for weekly releases to push new features out as fast as possible.

Why are we adding this now?

Historically, Dask focused on flexibility and smart scheduling instead of query optimization. The distributed scheduler built into Dask uses sophisticated algorithms to ensure ideal scheduling of individual tasks. It tries to ensure that your resources are utilized as efficient as possible. The graph construction process enables Dask users to build very flexible and complicated graphs that reach beyond SQL operations. The flexibility that is provided by the Dask futures API requires very intelligent algorithms, but it enables users to build highly sophisticated graphs. The following picture shows the graph for a credit risk model:

The nature of the powerful scheduler and the physical optimizations enables us to build very complicated programs that will then run efficiently. Unfortunately, the nature of these optimizations does not enable us to avoid scheduling work that is not necessary. This is where the current effort to build high level query optimization into Dask comes in.

Conclusion

Dask comes with a very smart distributed scheduler but without much logical query planning. This is something we are rectifying now through building a high level query optimizer into Dask DataFrame. We expect to improve performance and reduce memory usage for an average Dask workflow.

This API is read for interested users to play around with. It covers a good chunk of the DataFrame API. The library is under active development, we expect to add many more interesting things over the coming weeks and months.

Upstream testing in Dask

2023-04-18T00:00:00+00:00

Original version of this post appears on blog.coiled.io

Dask has deep integrations with other libraries in the PyData ecosystem like NumPy, pandas, Zarr, PyArrow, and more. Part of providing a good experience for Dask users is making sure that Dask continues to work well with this community of libraries as they push out new releases. This post walks through how Dask maintainers proactively ensure Dask continuously works with its surrounding ecosystem.

Dask has a dedicated CI build that runs Dask’s normal test suite once a day with unreleased, nightly versions of several libraries installed. This lets us check whether or not a recent change in a library like NumPy or pandas breaks some aspect of Dask’s functionality.

To increase visibility when such a breakage occurs, as part of the upstream CI build, an issue is automatically opened that provides a summary of what tests failed and links to the build logs for the corresponding failure (here’s an example issue).

This makes it less likely that a failing upstream build goes unnoticed.

How things can break and are fixed

There are usually two different ways in which things break. Either:

A library made an intentional change in behavior and a corresponding compatibility change needs to be made in Dask (the next section has an example of this case).
There was some unintentional consequence of a change made in a library that resulted in a breakage in Dask.

When the latter case occurs, Dask maintainers can then engage with other library maintainers to resolve the unintended breakage. This all happens before any libraries push out a new release, so no user code breaks.

Example: pandas 2.0

One specific example of this process in action is the recent pandas 2.0 release. This is a major version release and contains significant breaking changes like removing deprecated functionality.

As these breaking changes were merged into pandas, we started seeing related failures in Dask’s upstream CI build. Dask maintainers were then able to add a variety of compatibility changes so that Dask works well with pandas 2.0 immediately.

Acknowledgements

Special thanks to Justus Magin for his work on the xarray-contrib/issue-from-pytest-log GitHub action. We’ve found this to be really convenient for easily opening up GitHub issues when test failures occur.

Also, thanks to Irina Truong (Coiled), Patrick Hoefler (Coiled), and Matthew Roeschke (NVIDIA) for their efforts ensuring pandas and Dask continue to work together.

Do you need consistent environments between the client, scheduler and workers?

2023-04-14T00:00:00+00:00

Update May 3rd 2023: Clarify GPU recommendations.

With the release 2023.4.0 of dask and distributed we are making a change which may require the Dask scheduler to have consistent software and hardware capabilities as the client and workers.

It has always been recommended that your client and workers have a consistent software and hardware environment so that data structures and dependencies can be pickled and passed between them. However recent changes to the Dask scheduler mean that we now also require your scheduler to have the same consistent environment as everything else.

For most users, this change should go unnoticed as it is common to run all Dask components in the same conda environment or docker image and typically on homogenous machines.

However, for folks who may have optimized their schedulers to use cut-down environments, or for users with specialized hardware such as GPUs available on their client/workers but not the scheduler there may be some impact.

What will the impact be?

If you run into errors such as "RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments." please ensure your software environment is consistent between your client, scheduler and workers.

If you are passing GPU objects between the client and workers we now recommend that your scheduler has a GPU too. This recommendation is just so that GPU-backed objects contained in Dask graphs can be deserialized on the scheduler if necessary. Typically the GPU available to the scheduler doesn’t need to be as powerful as long as it has similar CUDA compute capabilities. For example for cost optimization reasons you may want to use A100s on your client and workers and a T4 on your scheduler.

Users who do not have a GPU on the client and are leveraging GPU workers shouldn’t run into this as the GPU objects will only exist on the workers.

Why are we doing this?

The reason we now suggest that you have the same hardware/software capabilities on the scheduler is that we are giving the scheduler the ability to deserialize graphs before distributing them to the workers. This will allow the scheduler to make smarter scheduling decisions in the future by having a better understanding of the operation it is performing.

The downside to this is that graphs can contain complex Python objects created by any number of dependencies on the client side, so in order for the scheduler to deserialize them it needs to have the same libraries installed. Equally, if the client-side packages create GPU objects then the scheduler will also need one.

We are sure you’ll agree that this breakage for a small percentage of users will be worth it for the long-term improvements to Dask.

Deep Dive into creating a Dask DataFrame Collection with from_map

2023-04-12T00:00:00+00:00

Dask DataFrame provides dedicated IO functions for several popular tabular-data formats, like CSV and Parquet. If you are working with a supported format, then the corresponding function (e.g read_csv) is likely to be the most reliable way to create a new Dask DataFrame collection. For other workflows, from_map now offers a convenient way to define a DataFrame collection as an arbitrary function mapping. While these kinds of workflows have historically required users to adopt the Dask Delayed API, from_map now makes custom collection creation both easier and more performant.

The from_map API was added to Dask DataFrame in v2022.05.1 with the intention of replacing from_delayed as the recommended means of custom DataFrame creation. At its core, from_map simply converts each element of an iterable object (inputs) into a distinct Dask DataFrame partition, using a common function (func):

dd.from_map(func: Callable, iterable: Iterable) -> dd.DataFrame

The overall behavior is essentially the Dask DataFrame equivalent of the standard-Python map function:

map(func: Callable, iterable: Iterable) -> Iterator

Note that both from_map and map actually support an arbitrary number of iterable inputs. However, we will only focus on the use of a single iterable argument in this post.

A simple example

To better understand the behavior of from_map, let’s consider the simple case that we want to interact with Feather-formatted data created with the following Pandas code:

import pandas as pd

size = 3
paths = ["./data.0.feather", "./data.1.feather"]
for i, path in enumerate(paths):
    index = range(i * size, i * size + size)
    a = [i] * size
    b = list("xyz")
    df = pd.DataFrame({"A": a, "B": b, "index": index})
    df.to_feather(path)

Since Dask does not yet offer a dedicated read_feather function (as of dask-2023.3.1), most users would assume that the only option to create a Dask DataFrame collection is to use dask.delayed. The “best practice” for creating a collection in this case, however, is to wrap pd.read_feather or cudf.read_feather in a from_map call like so:

>>> import dask.dataframe as dd
>>> ddf = dd.from_map(pd.read_feather, paths)
>>> ddf
Dask DataFrame Structure:
                   A       B  index
npartitions=2
               int64  object  int64
                 ...     ...    ...
                 ...     ...    ...
Dask Name: read_feather, 1 graph layer

Which produces the following Pandas (or cuDF) object after computation:

>>> ddf.compute()
   A  B  index
0  x      0
0  y      1
0  z      2
1  x      3
1  y      4
1  z      5

Although the same output can be achieved using the conventional dd.from_delayed strategy, using from_map will improve the available opportunities for task-graph optimization within Dask.

Performance considerations: Specifying `meta` and `divisions`

Although func and iterable are the only required arguments to from_map, one can significantly improve the overall performance of a workflow by specifying optional arguments like meta and divisions.

Due to the lazy nature of Dask DataFrame, each collection is required to carry around a schema (column name and dtype information) in the form of an empty Pandas (or cuDF) object. If meta is not directly provided to the from_map function, the schema will need to be populated by eagerly materializing the first partition, which can increase the apparent latency of the from_map API call itself. For this reason, it is always recommended to specify an explicit meta argument if the expected column names and dtypes are known a priori.

While passing in a meta argument is likely to reduce thefrom_map API call latency, passing in a divisions argument makes it possible to reduce the end-to-end compute time. This is because, by specifying divisions, we are allowing Dask DataFrame to track useful per-partition min/max statistics. Therefore, if the overall workflow involves grouping or joining on the index, Dask can avoid the need to perform unnecessary shuffling operations.

Using `from_map` to implement a custom API

Although it is currently difficult to automatically extract division information from the metadata of an arbitrary Feather dataset, from_map makes it relatively easy to implement your own highly-functional read_feather API using PyArrow. For example, the following code is all that one needs to enable lazy Feather IO with both column projection and index selection:

def from_arrow(table):
    """(Optional) Utility to enforce 'backend' configuration"""
    from dask import config

    if config.get("dataframe.backend") == "cudf":
        import cudf

        return cudf.DataFrame.from_arrow(table)
    else:
        return table.to_pandas()


def read_feather(paths, columns=None, index=None):
    """Create a Dask DataFrame from Feather files

    Example of a "custom" `from_map` IO function

    Parameters
    ----------
    paths: list
        List of Feather-formatted paths. Each path will
        be mapped to a distinct DataFrame partition.
    columns: list or None, default None
        Optional list of columns to select from each file.
    index: str or None, default None
        Optional column name to set as the DataFrame index.

    Returns
    -------
    dask.dataframe.DataFrame
    """
    import dask.dataframe as dd
    import pyarrow.dataset as ds

    # Step 1: Extract `meta` from the dataset
    dataset = ds.dataset(paths, format="feather")
    meta = from_arrow(dataset.schema.empty_table())
    meta = meta.set_index(index) if index else meta
    columns = columns or list(meta.columns)
    meta = meta[columns]

    # Step 2: Define the `func` argument
    def func(frag, columns=None, index=None):
        # Create a Pandas DataFrame from a dataset fragment
        # NOTE: In practice, this function should
        # always be defined outside `read_feather`
        assert columns is not None
        read_columns = columns
        if index and index not in columns:
            read_columns = columns + [index]
        df = from_arrow(frag.to_table(columns=read_columns))
        df = df.set_index(index) if index else df
        return df[columns] if columns else df

    # Step 3: Define the `iterable` argument
    iterable = dataset.get_fragments()

    # Step 4: Call `from_map`
    return dd.from_map(
        func,
        iterable,
        meta=meta,
        index=index,  # `func` kwarg
        columns=columns,  # `func` kwarg
    )

Here we see that using from_map to enable completely-lazy collection creation only requires four steps. First, we use pyarrow.dataset to define a meta argument for from_map, so that we can avoid the unnecessary overhead of an eager read operation. For some file formats and/or applications, it may also be possible to calculate divisions at this point. However, as explained above, such information is not readily available for this particular example.

The second step is to define the underlying function (func) that we will use to produce each of our final DataFrame partitions. Third, we define one or more iterable objects containing the unique information needed to produce each partition (iterable). In this case, the only iterable object corresponds to a generator of pyarrow.dataset fragments, which is essentially a wrapper around the input path list.

The fourth and final step is to use the final func, interable, and meta information to call the from_map API. Note that we also use this opportunity to specify additional key-word arguments, like columns and index. In contrast to the iterable positional arguments, which are always mapped to func, these key-word arguments will be broadcasted.

Using theread_feather implementation above, it becomes both easy and efficient to convert an arbitrary Feather dataset into a lazy Dask DataFrame collection:

>>> ddf = read_feather(paths, columns=["A"], index="index")
>>> ddf
Dask DataFrame Structure:
                   A
npartitions=2
               int64
                 ...
                 ...
Dask Name: func, 1 graph layer
>>> ddf.compute()
       A
index
0      0
1      0
2      0
3      1
4      1
5      1

Advanced: Enhancing column projection

Although a read_feather implementation like the one above is likely to meet the basic needs of most applications, it is certainly possible that users will often leave out the column argument in practice. For example:

a = read_feather(paths)["A"]

For code like this, as the implementation currently stands, each IO task would be forced to read in an entire Feather file, and then select the ”A” column from a Pandas/cuDF DataFrame only after it had already been read into memory. The additional overhead is insignificant for the toy-dataset used here. However, avoiding this kind of unnecessary IO can lead to dramatic performance improvements in real-world applications.

So, how can we modify our read_feather implementation to take advantage of external column-projection operations (like ddf["A"])? The good news is that from_map is already equipped with the necessary graph-optimization hooks to handle this, so long as the func object satisfies the DataFrameIOFunction protocol:

@runtime_checkable
class DataFrameIOFunction(Protocol):
    """DataFrame IO function with projectable columns
    Enables column projection in ``DataFrameIOLayer``.
    """

    @property
    def columns(self):
        """Return the current column projection"""
        raise NotImplementedError

    def project_columns(self, columns):
        """Return a new DataFrameIOFunction object
        with a new column projection
        """
        raise NotImplementedError

    def __call__(self, *args, **kwargs):
        """Return a new DataFrame partition"""
        raise NotImplementedError

That is, all we need to do is change “Step 2” of our implementation to use the following code instead:

    from dask.dataframe.io.utils import DataFrameIOFunction

    class ReadFeather(DataFrameIOFunction):
        """Create a Pandas/cuDF DataFrame from a dataset fragment"""
        def __init__(self, columns, index):
            self._columns = columns
            self.index = index

        @property
        def columns(self):
            return self._columns

        def project_columns(self, columns):
            # Replace this object with one that will only read `columns`
            if columns != self.columns:
                return ReadFeather(columns, self.index)
            return self

        def __call__(self, frag):
            # Same logic as original `func`
            read_columns = self.columns
            if index and self.index not in self.columns:
                read_columns = self.columns + [self.index]
            df = from_arrow(frag.to_table(columns=read_columns))
            df = df.set_index(self.index) if self.index else df
            return df[self.columns] if self.columns else df

    func = ReadFeather(columns, index)

Conclusion

It is now easier than ever to create a Dask DataFrame collection from an arbitrary data source. Although the dask.delayed API has already enabled similar functionality for many years, from_map now makes it possible to implement a custom IO function without sacrificing any of the high-level graph optimizations leveraged by the rest of the Dask DataFrame API.

Start experimenting with from_map today, and let us know how it goes!

Shuffling large data at constant memory in Dask

2023-03-15T00:00:00+00:00

This work was engineered and supported by Coiled. In particular, thanks to Florian Jetter, Gabe Joseph, Hendrik Makait, and Matt Rocklin. Original version of this post appears on blog.coiled.io

With release 2023.2.1, dask.dataframe introduces a new shuffling method called P2P, making sorts, merges, and joins faster and using constant memory. Benchmarks show impressive improvements:

P2P shuffling (blue) uses constant memory while task-based shuffling (orange) scales linearly.

This article describes the problem, the new solution, and the impact on performance.

Shuffling is a key primitive in data processing systems. It is used whenever we move a dataset around in an all-to-all fashion, such as occurs in sorting, dataframe joins, or array rechunking. Shuffling is a challenging computation to run efficiently, with lots of small tasks sharding the data.

Task-based shuffling scales poorly

While systems like distributed databases and Apache Spark use a dedicated shuffle service to move data around the cluster, Dask builds on task-based scheduling. Task-based systems are more flexible for general-purpose parallel computing but less suitable for all-to-all shuffling. This results in three main issues:

Scheduler strain: The scheduler hangs due to the sheer amount of tasks required for shuffling.
Full dataset materialization: Workers materialize the entire dataset causing the cluster to run out of memory.
Many small operations: Intermediate tasks are too small and bring down hardware performance.

Together, these issues make large-scale shuffles inefficient, causing users to over-provision clusters. Fortunately, we can design a system to address these concerns. Early work on this started back in 2021 and has now matured into the P2P shuffling system:

P2P shuffling

With release 2023.2.1, Dask introduces a new shuffle method called P2P, making sorts, merges, and joins run faster and in constant memory.

This system is designed with three aspects, mirroring the problems listed above:

Peer-to-Peer communication: Reduce the involvement of the scheduler

Shuffling becomes an O(n) operation from the scheduler’s perspective, removing a key bottleneck.

Disk by default: Store data as it arrives on disk, efficiently.

Dask can now shuffle datasets of arbitrary size in small memory, reducing the need to fiddle with right-sizing clusters.
Buffer network and disk with memory: Avoid many small writes by buffering sensitive hardware with in-memory stores.

Shuffling involves CPU, network, and disk, each of which brings its own bottlenecks when dealing with many small pieces of data. We use memory judiciously to trade off between these bottlenecks, balancing network and disk I/O to maximize the overall throughput.

In addition to these three aspects, P2P shuffling implements many minor optimizations to improve performance, and it relies on pyarrow>=7.0 for efficient data handling and (de)serialization.

Results

To evaluate P2P shuffling, we benchmark against task-based shuffling on common workloads from our benchmark suite. For more information on this benchmark suite, see the GitHub repository or the latest test results.

Memory stability

The biggest benefit of P2P shuffling is constant memory usage. Memory usage drops and stays constant across all workloads:

P2P shuffling (blue) uses constant memory while task-based shuffling (orange) scales linearly.

For the tested workloads, we saw up to 10x lower memory. For even larger workloads, this gap only increases.

Performance and Speed

In the above plot, we can two performance improvements:

Faster execution: Workloads run up to 45% faster.
Quicker startup: Smaller graphs mean P2P shuffling starts sooner.

Comparing scheduler dashboards for P2P and task-based shuffling side-by-side helps to understand what causes these performance gains:

Task-based shuffling (left) shows bigger gaps in the task stream and spilling compared to P2P shuffling (right).

Graph size: 10x fewer tasks are faster to deploy and easier on the scheduler.
Controlled I/O: P2P shuffling handles I/O explicitly, which avoids less performant spilling by Dask.
Fewer interruptions: I/O is well balanced, so we see fewer gaps in work in the task stream.

Overall, P2P shuffling leverages our hardware far more effectively.

Changing defaults

These results benefit most users, so P2P shuffling is now the default starting from release 2023.2.1, as long as pyarrow>=7.0.0 is installed. Dataframe operations like the following will benefit:

df.set_index(...)
df.merge(...)
df.groupby(...).apply(...)

Keep old behavior

If P2P shuffling does not work for you, you can deactivate it by setting the dataframe.shuffle.method config value to "tasks" or explicitly setting a keyword argument, for example:

with the yaml config

dataframe:
  shuffle:
    method: tasks

or when using a cluster manager

import dask
from dask.distributed import LocalCluster

# The dataframe.shuffle.method config is available since 2023.3.1
with dask.config.set({"dataframe.shuffle.method": "tasks"}):
    cluster = LocalCluster(...)  # many cluster managers send current dask config automatically

For more information on deactivating P2P shuffle, see the discussion #7509.

What about arrays?

While the original motivation was to optimize large-scale dataframe joins, the P2P system is useful for all problems requiring lots of communication between tasks. For array workloads, this often occurs when rechunking data, such as when organizing a matrix by rows when it was stored by columns. Similar to dataframe joins, array rechunking has been inefficient in the past, which has become such a problem that the array community built specialized tools like rechunker to avoid it entirely.

There is a naive implementation of array rechunking using the P2P system available for experimental use. Benchmarking this implementation shows mixed results:

👍 Constant memory use: As with dataframe operations, memory use is constant.
❓ Variable runtime: The runtime of workloads may increase with P2P.
👎 Memory overhead: There is a large memory overhead for many small partitions.

The constant memory use is a very promising result. There are several ways to tackle the current limitations. We expect this to improve as we work with collaborators from the array computing community.

Next steps

Development on P2P shuffling is not done yet. For the future we plan the following:

dask.array:

While the early prototype of array rechunking is promising, it’s not there yet. We plan to do the following:
- Intelligently select which algorithm to use (task-based rechunking is better sometimes)
- Work with collaborators on rechunking to improve performance
- Seek out other use cases, like map_overlap, where this might be helpful
Failure recovery:

Make P2P shuffling resilient to worker loss; currently, it has to restart entirely.
Performance tuning:

Performance today is good, but not yet at peak hardware speeds. We can improve this in a few ways:
- Hiding disk I/O
- Using more memory when appropriate for smaller shuffles
- Improved batching of small operations
- More efficient serialization

To follow along with the development, subscribe to the tracking issue on GitHub.

Summary

Shuffling data is a common operation in dataframe workloads. Since 2023.2.1, Dask defaults to P2P shuffling for distributed clusters and shuffles data faster and at constant memory. This improvement unlocks previously un-runnable workload sizes and efficiently uses your clusters. Finally, P2P shuffling demonstrates extending Dask to add new paradigms while leveraging the foundations of its distributed engine.

Share results in this discussion thread or follow development at this tracking issue.

Managing dask workloads with Flyte

2023-02-13T00:00:00+00:00

It is now possible to manage dask workloads using Flyte 🎉!

The major advantages are:

Each Flyte task spins up its own ephemeral dask cluster using a Docker image tailored to the task, ensuring consistency in the Python environment across the client, scheduler, and workers.
Flyte will use the existing Kubernetes infrastructure to spin up dask clusters.
Spot/Preemtible instances are natively supported.
The whole dask task can be cached.
Enabling dask support in an already running Flyte setup can be done in just a few minutes.

This is what a Flyte task backed by a dask cluster with four workers looks like:

from typing import List

from distributed import Client
from flytekit import task, Resources
from flytekitplugins.dask import Dask, WorkerGroup, Scheduler


def inc(x):
    return x + 1


@task(
    task_config=Dask(
        scheduler=Scheduler(
            requests=Resources(cpu="2")
        ),
        workers=WorkerGroup(
            number_of_workers=4,
            limits=Resources(cpu="8", mem="32Gi")
        )
    )
)
def increment_numbers(list_length: int) -> List[int]:
    client = Client()
    futures = client.map(inc, range(list_length))
    return client.gather(futures)

This task can run locally using a standard distributed.Client() and can scale to arbitrary cluster sizes once registered with Flyte.

Flyte is a Kubernetes native workflow orchestration engine. Originally developed at Lyft, it is now an open-source (Github) and a graduate project under the Linux Foundation. It stands out among similar tools such as Airflow or Argo due to its key features, which include:

Caching/Memoization of previously executed tasks for improved performance
Kubernetes native
Workflow definitions in Python, not e.g.,YAML
Strong typing between tasks and workflows using Protobuf
Dynamic generation of workflow DAGs at runtime
Ability to run workflows locally

A simple workflow would look something like the following:

from typing import List

import pandas as pd

from flytekit import task, workflow, Resources
from flytekitplugins.dask import Dask, WorkerGroup, Scheduler


@task(
    task_config=Dask(
        scheduler=Scheduler(
            requests=Resources(cpu="2")
        ),
        workers=WorkerGroup(
            number_of_workers=4,
            limits=Resources(cpu="8", mem="32Gi")
        )
    )
)
def expensive_data_preparation(input_files: List[str]) -> pd.DataFrame:
    # Expensive, highly parallel `dask` code
    ...
    return pd.DataFrame(...)  # Some large DataFrame, Flyte will handle serialization


@task
def train(input_data: pd.DataFrame) -> str:
    # Model training, can also use GPU, etc.
    ...
    return "s3://path-to-model"


@workflow
def train_model(input_files: List[str]) -> str:
    prepared_data = expensive_data_preparation(input_files=input_files)
    return train(input_data=prepared_data)

In the above, both expensive_data_preparation() as well as train() would be run in their own Pod(s) in Kubernetes, while the train_model() workflow is a DSL which creates a Directed Acyclic Graph (DAG) of the workflow. It will determine the order of tasks based on their inputs and outputs. Input and output types (based on the type hints) will be validated at registration time to avoid surprises at runtime.

After registration with Flyte, the workflow can be started from the UI:

Why use the `dask` plugin for Flyte?

At first glance, Flyte and dask look similar in what they are trying to achieve, both capable of creating a DAG from user functions, managing inputs and outputs, etc. However, the major conceptual difference lies in their approach. While dask has long-lived workers to run tasks, a Flyte task is a designated Kubernetes Pod that creates a significant overhead in task-runtime.

While dask tasks incur an overhead of around one millisecond (refer to the docs), spinning up a new Kubernetes pod takes several seconds. The long-lived nature of the dask workers allows for optimization of the DAG, running tasks that operate on the same data on the same node, reducing the need for inter-worker data serialization (known as shuffling). With Flyte tasks being ephemeral, this optimization is not possible, and task outputs are serialized to a blob storage instead.

Given the limitations discussed above, why use Flyte? Flyte is not intended to replace tools such as dask or Apache Spark, but rather provides an orchestration layer on top. While workloads can be run directly in Flyte, such as training a single GPU model, Flyte offers numerous integrations with other popular data processing tools.

With Flyte managing the dask cluster lifecycle, each dask Flyte task will run on its own dedicated dask cluster made up of Kubernetes pods. When the Flyte task is triggered from the UI, Flyte will spin up a dask cluster tailored to the task, which will then be used to execute the user code. This enables the use of different Docker images with varying dependencies for different tasks, whilst always ensuring that the dependencies of the client, scheduler, and workers are consistent.

What prerequisites are required to run `dask` tasks in Flyte?

The Kubernetes cluster needs to have the dask operator installed.
Flyte version 1.3.0 or higher is required.
The dask plugin needs to be enabled in the Flyte propeller config. (refer to the docs)
The Docker image associated with the task must have the flytekitplugins-dask package installed in its Python environment.

How do things work under the hood?

Note: The following is for reference only and is not necessary for users who only use the plugin. However, it could be useful for easier debugging.

On a high-level overview, the following steps occur when a dask task is initiated in Flyte:

A FlyteWorkflow Custom Resource (CR) is created in Kubernetes.
Flyte Propeller, a Kubernetes Operator), detects the creation of the workflow.
The operator inspects the task’s spec and identifies it as a dask task. It verifies if it has the required plugin associated with it and locates the dask plugin.
The dask plugin within Flyte Propeller picks up the task defintion and creates a DaskJob Custom Resource using the dask-k8s-operator-go-client.
The dask operator picks up the DaskJob resource and runs the job accordingly. It spins up a pod to run the client/job-runner, one for the scheduler, and additional worker pods as designated in the Flyte task decorator.
While the dask task is running, Flyte Propeller continuously monitors the DaskJob resource, waiting on it to report success or failure. Once the job has finished or the Flyte task has been terminated, all dask related resources will be cleaned up.

Useful links

In case there are any questions or concerns, don’t hesitate to reach out. You can contact Bernhard Stadlbauer via the Flyte Slack or via GitHub.

I would like to give shoutouts to Jacob Tomlinson (Dask) and Dan Rammer (Flyte) for all of the help I’ve received. This would not have been possible without your support!

Easy CPU/GPU Arrays and Dataframes

2023-02-02T00:00:00+00:00

This article was originally posted on the RAPIDS blog.

It’s now easy to switch between CPU (NumPy / Pandas) and GPU (CuPy / cuDF) in Dask. As of Dask 2022.10.0, users can optionally select the backend engine for input IO and data creation. In the short-term, the goal of the backend-configuration system is to enable Dask users to write code that will run on both CPU and GPU systems.

The preferred backend can be configured using the array.backend and dataframe.backend options with the standard Dask configuration system:

dask.config.set({"array.backend": "cupy"})
dask.config.set({"dataframe.backend": "cudf"})

To see how users can easily switch between NumPy and CuPy, let’s start by creating an array of ones:

>>> with dask.config.set({"array.backend": "cupy"}):
...     darr = da.ones(10, chunks=(5,))  # Get cupy-backed collection
...
>>> darr
dask.array<ones_like, shape=(10,), dtype=float64, chunksize=(5,), chunktype=cupy.ndarray>

The chunktype informs us that the array is constructed with cupy.ndarray objects instead of numpy.ndarray objects.

We’ve also improved the user experience for random array creation. Previously, if a user wanted to create a CuPy-backed Dask array, they were required to define an explicit RandomState object in Dask using CuPy. For example, the following code worked prior to Dask 2022.10.0, but seems rather verbose:

>>> import cupy
>>> import dask.array as da
>>> rs = da.random.RandomState(RandomState=cupy.random.RandomState)
>>>
>>> darr = rs.randint(0, 3, size=(10, 20), chunks=(2, 5))
>>> darr
dask.array<randint, shape=(10, 20), dtype=int64, chunksize=(2, 5), chunktype=cupy.ndarray>

Now, we can leverage the array.backend configuration to create a CuPy-backed dask array for random data:

>>> with dask.config.set({"array.backend": "cupy"}):
...     darr = da.random.randint(0, 3, size=(10, 20), chunks=(2, 5))  # Get cupy-backed collection
...
>>> darr
dask.array<randint, shape=(10, 20), dtype=int64, chunksize=(2, 5), chunktype=cupy.ndarray>

Using array.backend is significantly easier and much more ergonomic – it supports all basic array creation methods including: ones, zeros, empty, full, arange, and random

Note: from_array, from_zarr, from_tiledb have not yet been implemented with this functionality

Dispatching for Dataframe Creation

When creating Dask Dataframes backed by either Pandas or cuDF, the beginning is often the input I/O methods: read_csv, read_parquet, etc. We’ll first start by constructing a dataframe on the fly with from_dict:

>>> with dask.config.set({"dataframe.backend": "cudf"}):
...     data = {"a": range(10), "b": range(10)}
...     ddf = dd.from_dict(data, npartitions=2)
...
>>> ddf
<dask_cudf.DataFrame | 2 tasks | 2 npartitions>

Here we can tell we have a cuDF backed dataframe and we are using dask-cudf because the repr shows us the type: <dask_cudf.DataFrame | 2 tasks | 2 npartitions>. Let’s also demonstrate the read functionality by generating CSV and Parquet data.

ddf.to_csv('example.csv', single_file=True)
ddf.to_parquet('example.parquet')

Now we are simply repeating the config setting but instead using the read_csv and read_parquet methods:

>>> with dask.config.set({"dataframe.backend": "cudf"}):
...     ddf = dd.read_csv('example.csv')
...     print(type(ddf))
...
<class 'dask_cudf.core.DataFrame'>
>>> with dask.config.set({"dataframe.backend": "cudf"}):
...     ddf = dd.read_parquet('example.parquet')
...     type(ddf)
...
<class 'dask_cudf.core.DataFrame'>
>>>

Why is this Useful ?

As hardware changes in exciting and exotic ways with: GPUs, TPUs, IPUs, etc., we want to provide the same interface and treat hardware as an abstraction. For example, many PyTorch workflows start with the following:

device = 'cuda' if torch.cuda.is_available() else 'cpu'

And what follows is typically standard hardware agnostic PyTorch. This is incredibly powerful as the user (in most cases) should not care what hardware underlies the source. As such, it enables the user to develop PyTorch anywhere and everywhere. The new Dask backend selection configurations gives users a similar freedom.

Conclusion

Our long-term goal of this feature is to enable Dask users to use any backend library in dask.array and dask.dataframe, as long as that library conforms to the minimal “array” or “dataframe” standard defined by the data-api consortium, respectively.

The RAPIDS team consistently works with the open-source community to understand and address emerging needs. If you’re an open-source maintainer interested in bringing GPU-acceleration to your project, please reach out on Github or Twitter. The RAPIDS team would love to learn how potential new algorithms or toolkits would impact your work.

Dask Working Notes - Posted in 2023

High Level Query Optimization in Dask

Dask Expressions

Try it out

Why are we adding this now?

Conclusion

Upstream testing in Dask

How things can break and are fixed

Example: pandas 2.0

Acknowledgements

Do you need consistent environments between the client, scheduler and workers?

What will the impact be?

Why are we doing this?

Deep Dive into creating a Dask DataFrame Collection with from_map

A simple example

Performance considerations: Specifying meta and divisions

Using from_map to implement a custom API

Advanced: Enhancing column projection

Conclusion

Shuffling large data at constant memory in Dask

Task-based shuffling scales poorly

P2P shuffling

Results

Memory stability

Performance and Speed

Changing defaults

Share your experience

Keep old behavior

What about arrays?

Next steps

Summary

Managing dask workloads with Flyte

Why use the dask plugin for Flyte?

What prerequisites are required to run dask tasks in Flyte?

How do things work under the hood?

Useful links

Easy CPU/GPU Arrays and Dataframes

Dispatching for Dataframe Creation

Why is this Useful ?

Performance considerations: Specifying `meta` and `divisions`

Using `from_map` to implement a custom API

Why use the `dask` plugin for Flyte?

What prerequisites are required to run `dask` tasks in Flyte?