Dask Working Notes - Posts tagged dataframe

Dask DataFrame is Fast Now

2024-05-30T00:00:00+00:00

This work was engineered and supported by Coiled and NVIDIA. Thanks to Patrick Hoefler and Rick Zamora, in particular. Original version of this post appears on docs.coiled.io

Performance Improvements for Dask DataFrames

Dask DataFrame scales out pandas DataFrames to operate at the 100GB-100TB scale.

Historically, Dask was pretty slow compared to other tools in this space (like Spark). Due to a number of improvements focused on performance, it’s now pretty fast (about 20x faster than before). The new implementation moved Dask from getting destroyed by Spark on every benchmark to regularly outperforming Spark on TPC-H queries by a significant margin.

Dask DataFrame workloads struggled with many things. Performance and memory usage were commonly seen pain points, shuffling was unstable for bigger datasets, making scaling out hard. Writing efficient code required understanding too much of the internals of Dask.

The new implementation changed all of this. Things that didn’t work were completely rewritten from scratch and existing implementations were improved upon. This puts Dask DataFrames on a solid foundation that allows faster iteration cycles in the future.

We’ll go through the three most prominent changes:

Apache Arrow support
Faster joins
Query optimization

We’ll cover how these changes impact performance and make it easier to use Dask efficiently, even for users that are new to distributed computing. We’ll also discuss plans for future improvements.

1. Apache Arrow Support: Efficient String Datatype

A Dask DataFrame consists of many pandas DataFrames. Historically, pandas used NumPy for numeric data, but Python objects for text data, which are inefficient and blow up memory usage. Operations on object data also hold the GIL, which doesn’t matter much for pandas, but is a catastrophy for performance with a parallel system like Dask.

The pandas 2.0 release introduced support for general-purpose Arrow datatypes, so Dask now uses PyArrow-backed strings by default. These are much better. PyArrow strings reduce memory usage by up to 80% and unlock multi-threading for string operations. Workloads that previously struggled with available memory now fit comfortably in much less space, and are a lot faster because they no longer constantly spill excess data to disk.

Memory Usage of the Legacy DataFrames Compared with Arrow Strings

2. Faster Joins with a New Shuffle Algorithm

Shuffling is an essential component of distributed systems to enable sorting, joins, and complex group by operations. It is an all-to-all, network-intensive operation that’s often the most expensive component in a workflow. Dask has a new shuffling system, which greatly impacts overall performance, especially on complex, data-intensive workloads.

A shuffle operation is intrinsically an all-to-all communication operation where every input partition has to provide a tiny slice of data to every output partition. Dask was already using it’s own task-based algorithm that managed to reduce the O(n * n) task complexity to O(log(n) * n) where n is the number of partitions. This was a drastic reduction in the number of tasks, but the non-linear scaling ultimately did not allow Dask to process arbitrarily large datasets.

Dask introduced a new P2P (peer-to-peer) shuffle method that reduced the task complexity to O(n) which scales linearly with the size of the dataset and the size of the cluster. It also incorporates an efficient disk integration which allows easily shuffling datasets which are much larger than memory. The new system is extremely stable and “just works” across any scale of data.

Memory Usage of the Legacy Shuffle Compared with P2P

3. Optimizer

Dask itself is lazy, which means that it registers your whole query before doing any actual work. This is a powerful concept that enables a lot of optimizations, but historically Dask wasn’t taking advantage of this knowledge in the past. Dask also did a bad job of hiding internal complexities and left users on their own while navigating the difficulties of distributed computing and running large scale queries. It made writing efficient code painful for non-experts.

The Dask release in March includes a complete re-implementation of the DataFrame API to support query optimization. This is a big deal. The new engine centers around a query optimizer that rewrites your code to make it more efficient and better tailored to Dask’s strengths. Let’s dive into some optimization strategies, how they make Dask run faster and scale better.

We will start with a couple general-purpose optimizations:

Column projection
Filter pushdown

And then dive into more specific techniques that are tailored to distributed systems generally and Dask more specifically:

Automatic partition resizing
Trivial merge and join operations

3.1 Column Projection

Most datasets have more columns than needed. Dropping them requires foresight (“What columns will I need for this query? 🤔”) so most people don’t think about this when loading data. This is bad for performance because carrying around lots of excess data slows everything down. Column Projection drops columns as soon as they aren’t needed anymore. It’s a straightforward optimization, but highly beneficial.

The legacy implementation always reads all columns from storage and only drops columns when explicitly specified by the user. Simply operating on less data is a big win for performance and memory usage.

The optimizer looks at the query and figures out which columns are needed for each operation. It looks at the final step of the query and then works backwards step by step to the data source, injecting drop operations to get rid of unnecessary columns.

Only require a subset of columns are needed. Replace doesn't need access to all columns, so Dask drops unnecessary columns directly in the IO step.

3.2 Filter Pushdown

Filter pushdown is another general-purpose optimization with the same goal as column projection: operate on less data. The legacy implementation did not reorder filter operations. The new implementation executes filter operations as early as possible while maintaining the same results.

The optimizer identifies every filter in the query and looks at the previous operation to see if we can move the filter closer to the data source. It will repeat this until it finds an operation that can’t be switched with a filter. This is a bit harder than column projections, because Dask has to make sure that the operations don’t change the values of the DataFrame. For example, switching a filter and a merge operation is fine (values don’t change), but switching a filter and a replace operation is invalid, because the values might change and rows that would previously have been filtered out now won’t be, or vice versa.

Initially, the filter happens after the Dropna, but Dask can execute the filter before Dropna without changing the result. This allows Dask to push the filter into the IO step.

Additionally, if the filter is strong enough then Dask can potentially drop complete files in the IO step. This is a best-case scenario, where an earlier filter brings a huge performance improvement and even requires reading less data from remote storage.

3.3 Automatically Resizing Partitions

In addition to implementing the common optimization techniques described above, we’ve also improved a common pain point specific to distributed systems generally and Dask users specifically: optimal partition sizes.

Dask DataFrames consist of many small pandas DataFrames called partitions. Often, the number of partitions is decided for you and Dask users are advised to manually “repartition” after reducing or expanding their data (for example by dropping columns, filtering data, or expanding with joins) (see the Dask docs). Without this extra step, the (usually small) overhead from Dask can become a bottleneck if the pandas DataFrames become too small, making Dask workflows painfully slow.

Manually controlling the partition size is a difficult task that we, as Dask users, shouldn’t have to worry about. It is also slow because it requires network transfer of some partitions. Dask DataFrame now automatically does two things to help when the partitions get too small:

Keeps the size of each partition constant, based on the ratio of data you want to compute vs. the original file size. If, for example, you filter out 80% of the original dataset, Dask will automatically combine the resulting smaller partitions into fewer, larger partitions.
Combines too-small partitions into larger partitions, based on an absolute minimum (default is 75 MB). If, for example, your original dataset is split into many tiny files, Dask will automatically combine them.

Select two columns that take up 40 MB of memory out of the 200 MB from the whole parquet file.

The optimizer will look at the number of columns and the size of the data within those. It calculates a ratio that is used to combine multiple files into one partition.

The ratio of 40/200 results in combining five files into a single partition.

This step is currently limited to IO operations (like reading in a parquet dataset), but we plan to extend it to other operations that allow cheaply combining partitions.

3.4 Trivial Merge and Join Operations

Merge and join operations are typically cheap on a single machine with pandas but expensive in a distributed setting. Merging data in shared memory is cheap, while merging data across a network is quite slow, due to the shuffle operations explained earlier.

This is one of the most expensive operations in a distributed system. The legacy implementation triggered a network transfer of both input DataFrames for every merge operation. This is sometimes necessary, but very expensive.

Both joins are performed on the same column. The left DataFrame is already properly partitioned after the first join, so Dask can avoid shuffling again with the new implementation.

The optimizer will determine when shuffling is necessary versus when a trivial join is sufficient because the data is already aligned properly. This can make individual merges an order of magnitude faster. This also applies to other operations that normally require a shuffle like groupby().apply().

Dask merges used to be inefficient, which caused long runtimes. The optimizer fixes this for the trivial case where these operations happen after each other, but the technique isn’t very advanced yet. There is still a lot of potential for improvement.

The current implementation shuffles both branches that originate from the same table. Injecting a shuffle node further up avoids one of the expensive operations.

The optimizer will look at the expression and inject shuffle nodes where necessary to avoid unnecessary shuffles.

How do the improvements stack up compared to the legacy implementation?

Dask is now 20x faster than before. This improvement applies to the entire DataFrame API (not just isolated components), with no known performance regressions. Dask now runs workloads that were impossible to complete in an acceptable timeframe before. This performance boost is due to many improvements all layered on top of each other. It’s not about doing one thing especially well, but about doing nothing especially poorly.

Performance Improvements on Query 3 of the TPC-H Benchmarks from https://github.com/coiled/benchmarks/tree/main/tests/tpch

Performance, while the most enticing improvement, is not the only thing that got better. The optimizer hides a lot of complexity from the user and makes the transition from pandas to Dask a lot easier because it’s now much more difficult to write poorly performing code. The whole system is more robust.

The new architecture of the API is a lot easier to work with as well. The legacy implementation leaked a lot of internal complexities into high-level API implementations, making changes cumbersome. Improvements are almost trivial to add now.

What’s to come?

Dask DataFrame changed a lot over the last 18 months. The legacy API was often difficult to work with and struggled with scaling out. The new implementation dropped things that didn’t work and improved existing implementations. The heavy lifting is finished now, which allows for faster iteration cycles to improve upon the status quo. Incremental improvements are now trivial to add.

A few things that are on the immediate roadmap:

Auto repartitioning: this is partially implemented, but there is more potential to choose a more efficient partition size during optimization.
Faster Joins: there’s still lots of fine-tuning to be done here. For example, there is a PR in flight with a 30-40% improvement.
Join Reordering: Dask doesn’t do this yet, but it’s on the immediate roadmap

Learn more

This article focuses on a number of improvements to Dask DataFrame and how much faster and more reliable it is as a result. If you’re choosing between Dask and other popular DataFrame tools, you might also consider:

DataFrames at Scale Comparison: TPC-H which compares Dask, Spark, Polars, and DuckDB performance on datasets ranging from 10 GB to 10 TB both locally and on the cloud

Do you need consistent environments between the client, scheduler and workers?

2023-04-14T00:00:00+00:00

Update May 3rd 2023: Clarify GPU recommendations.

With the release 2023.4.0 of dask and distributed we are making a change which may require the Dask scheduler to have consistent software and hardware capabilities as the client and workers.

It has always been recommended that your client and workers have a consistent software and hardware environment so that data structures and dependencies can be pickled and passed between them. However recent changes to the Dask scheduler mean that we now also require your scheduler to have the same consistent environment as everything else.

For most users, this change should go unnoticed as it is common to run all Dask components in the same conda environment or docker image and typically on homogenous machines.

However, for folks who may have optimized their schedulers to use cut-down environments, or for users with specialized hardware such as GPUs available on their client/workers but not the scheduler there may be some impact.

What will the impact be?

If you run into errors such as "RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments." please ensure your software environment is consistent between your client, scheduler and workers.

If you are passing GPU objects between the client and workers we now recommend that your scheduler has a GPU too. This recommendation is just so that GPU-backed objects contained in Dask graphs can be deserialized on the scheduler if necessary. Typically the GPU available to the scheduler doesn’t need to be as powerful as long as it has similar CUDA compute capabilities. For example for cost optimization reasons you may want to use A100s on your client and workers and a T4 on your scheduler.

Users who do not have a GPU on the client and are leveraging GPU workers shouldn’t run into this as the GPU objects will only exist on the workers.

Why are we doing this?

The reason we now suggest that you have the same hardware/software capabilities on the scheduler is that we are giving the scheduler the ability to deserialize graphs before distributing them to the workers. This will allow the scheduler to make smarter scheduling decisions in the future by having a better understanding of the operation it is performing.

The downside to this is that graphs can contain complex Python objects created by any number of dependencies on the client side, so in order for the scheduler to deserialize them it needs to have the same libraries installed. Equally, if the client-side packages create GPU objects then the scheduler will also need one.

We are sure you’ll agree that this breakage for a small percentage of users will be worth it for the long-term improvements to Dask.

Deep Dive into creating a Dask DataFrame Collection with from_map

2023-04-12T00:00:00+00:00

Dask DataFrame provides dedicated IO functions for several popular tabular-data formats, like CSV and Parquet. If you are working with a supported format, then the corresponding function (e.g read_csv) is likely to be the most reliable way to create a new Dask DataFrame collection. For other workflows, from_map now offers a convenient way to define a DataFrame collection as an arbitrary function mapping. While these kinds of workflows have historically required users to adopt the Dask Delayed API, from_map now makes custom collection creation both easier and more performant.

The from_map API was added to Dask DataFrame in v2022.05.1 with the intention of replacing from_delayed as the recommended means of custom DataFrame creation. At its core, from_map simply converts each element of an iterable object (inputs) into a distinct Dask DataFrame partition, using a common function (func):

dd.from_map(func: Callable, iterable: Iterable) -> dd.DataFrame

The overall behavior is essentially the Dask DataFrame equivalent of the standard-Python map function:

map(func: Callable, iterable: Iterable) -> Iterator

Note that both from_map and map actually support an arbitrary number of iterable inputs. However, we will only focus on the use of a single iterable argument in this post.

A simple example

To better understand the behavior of from_map, let’s consider the simple case that we want to interact with Feather-formatted data created with the following Pandas code:

import pandas as pd

size = 3
paths = ["./data.0.feather", "./data.1.feather"]
for i, path in enumerate(paths):
    index = range(i * size, i * size + size)
    a = [i] * size
    b = list("xyz")
    df = pd.DataFrame({"A": a, "B": b, "index": index})
    df.to_feather(path)

Since Dask does not yet offer a dedicated read_feather function (as of dask-2023.3.1), most users would assume that the only option to create a Dask DataFrame collection is to use dask.delayed. The “best practice” for creating a collection in this case, however, is to wrap pd.read_feather or cudf.read_feather in a from_map call like so:

>>> import dask.dataframe as dd
>>> ddf = dd.from_map(pd.read_feather, paths)
>>> ddf
Dask DataFrame Structure:
                   A       B  index
npartitions=2
               int64  object  int64
                 ...     ...    ...
                 ...     ...    ...
Dask Name: read_feather, 1 graph layer

Which produces the following Pandas (or cuDF) object after computation:

>>> ddf.compute()
   A  B  index
0  x      0
0  y      1
0  z      2
1  x      3
1  y      4
1  z      5

Although the same output can be achieved using the conventional dd.from_delayed strategy, using from_map will improve the available opportunities for task-graph optimization within Dask.

Performance considerations: Specifying `meta` and `divisions`

Although func and iterable are the only required arguments to from_map, one can significantly improve the overall performance of a workflow by specifying optional arguments like meta and divisions.

Due to the lazy nature of Dask DataFrame, each collection is required to carry around a schema (column name and dtype information) in the form of an empty Pandas (or cuDF) object. If meta is not directly provided to the from_map function, the schema will need to be populated by eagerly materializing the first partition, which can increase the apparent latency of the from_map API call itself. For this reason, it is always recommended to specify an explicit meta argument if the expected column names and dtypes are known a priori.

While passing in a meta argument is likely to reduce thefrom_map API call latency, passing in a divisions argument makes it possible to reduce the end-to-end compute time. This is because, by specifying divisions, we are allowing Dask DataFrame to track useful per-partition min/max statistics. Therefore, if the overall workflow involves grouping or joining on the index, Dask can avoid the need to perform unnecessary shuffling operations.

Using `from_map` to implement a custom API

Although it is currently difficult to automatically extract division information from the metadata of an arbitrary Feather dataset, from_map makes it relatively easy to implement your own highly-functional read_feather API using PyArrow. For example, the following code is all that one needs to enable lazy Feather IO with both column projection and index selection:

def from_arrow(table):
    """(Optional) Utility to enforce 'backend' configuration"""
    from dask import config

    if config.get("dataframe.backend") == "cudf":
        import cudf

        return cudf.DataFrame.from_arrow(table)
    else:
        return table.to_pandas()


def read_feather(paths, columns=None, index=None):
    """Create a Dask DataFrame from Feather files

    Example of a "custom" `from_map` IO function

    Parameters
    ----------
    paths: list
        List of Feather-formatted paths. Each path will
        be mapped to a distinct DataFrame partition.
    columns: list or None, default None
        Optional list of columns to select from each file.
    index: str or None, default None
        Optional column name to set as the DataFrame index.

    Returns
    -------
    dask.dataframe.DataFrame
    """
    import dask.dataframe as dd
    import pyarrow.dataset as ds

    # Step 1: Extract `meta` from the dataset
    dataset = ds.dataset(paths, format="feather")
    meta = from_arrow(dataset.schema.empty_table())
    meta = meta.set_index(index) if index else meta
    columns = columns or list(meta.columns)
    meta = meta[columns]

    # Step 2: Define the `func` argument
    def func(frag, columns=None, index=None):
        # Create a Pandas DataFrame from a dataset fragment
        # NOTE: In practice, this function should
        # always be defined outside `read_feather`
        assert columns is not None
        read_columns = columns
        if index and index not in columns:
            read_columns = columns + [index]
        df = from_arrow(frag.to_table(columns=read_columns))
        df = df.set_index(index) if index else df
        return df[columns] if columns else df

    # Step 3: Define the `iterable` argument
    iterable = dataset.get_fragments()

    # Step 4: Call `from_map`
    return dd.from_map(
        func,
        iterable,
        meta=meta,
        index=index,  # `func` kwarg
        columns=columns,  # `func` kwarg
    )

Here we see that using from_map to enable completely-lazy collection creation only requires four steps. First, we use pyarrow.dataset to define a meta argument for from_map, so that we can avoid the unnecessary overhead of an eager read operation. For some file formats and/or applications, it may also be possible to calculate divisions at this point. However, as explained above, such information is not readily available for this particular example.

The second step is to define the underlying function (func) that we will use to produce each of our final DataFrame partitions. Third, we define one or more iterable objects containing the unique information needed to produce each partition (iterable). In this case, the only iterable object corresponds to a generator of pyarrow.dataset fragments, which is essentially a wrapper around the input path list.

The fourth and final step is to use the final func, interable, and meta information to call the from_map API. Note that we also use this opportunity to specify additional key-word arguments, like columns and index. In contrast to the iterable positional arguments, which are always mapped to func, these key-word arguments will be broadcasted.

Using theread_feather implementation above, it becomes both easy and efficient to convert an arbitrary Feather dataset into a lazy Dask DataFrame collection:

>>> ddf = read_feather(paths, columns=["A"], index="index")
>>> ddf
Dask DataFrame Structure:
                   A
npartitions=2
               int64
                 ...
                 ...
Dask Name: func, 1 graph layer
>>> ddf.compute()
       A
index
0      0
1      0
2      0
3      1
4      1
5      1

Advanced: Enhancing column projection

Although a read_feather implementation like the one above is likely to meet the basic needs of most applications, it is certainly possible that users will often leave out the column argument in practice. For example:

a = read_feather(paths)["A"]

For code like this, as the implementation currently stands, each IO task would be forced to read in an entire Feather file, and then select the ”A” column from a Pandas/cuDF DataFrame only after it had already been read into memory. The additional overhead is insignificant for the toy-dataset used here. However, avoiding this kind of unnecessary IO can lead to dramatic performance improvements in real-world applications.

So, how can we modify our read_feather implementation to take advantage of external column-projection operations (like ddf["A"])? The good news is that from_map is already equipped with the necessary graph-optimization hooks to handle this, so long as the func object satisfies the DataFrameIOFunction protocol:

@runtime_checkable
class DataFrameIOFunction(Protocol):
    """DataFrame IO function with projectable columns
    Enables column projection in ``DataFrameIOLayer``.
    """

    @property
    def columns(self):
        """Return the current column projection"""
        raise NotImplementedError

    def project_columns(self, columns):
        """Return a new DataFrameIOFunction object
        with a new column projection
        """
        raise NotImplementedError

    def __call__(self, *args, **kwargs):
        """Return a new DataFrame partition"""
        raise NotImplementedError

That is, all we need to do is change “Step 2” of our implementation to use the following code instead:

    from dask.dataframe.io.utils import DataFrameIOFunction

    class ReadFeather(DataFrameIOFunction):
        """Create a Pandas/cuDF DataFrame from a dataset fragment"""
        def __init__(self, columns, index):
            self._columns = columns
            self.index = index

        @property
        def columns(self):
            return self._columns

        def project_columns(self, columns):
            # Replace this object with one that will only read `columns`
            if columns != self.columns:
                return ReadFeather(columns, self.index)
            return self

        def __call__(self, frag):
            # Same logic as original `func`
            read_columns = self.columns
            if index and self.index not in self.columns:
                read_columns = self.columns + [self.index]
            df = from_arrow(frag.to_table(columns=read_columns))
            df = df.set_index(self.index) if self.index else df
            return df[self.columns] if self.columns else df

    func = ReadFeather(columns, index)

Conclusion

It is now easier than ever to create a Dask DataFrame collection from an arbitrary data source. Although the dask.delayed API has already enabled similar functionality for many years, from_map now makes it possible to implement a custom IO function without sacrificing any of the high-level graph optimizations leveraged by the rest of the Dask DataFrame API.

Start experimenting with from_map today, and let us know how it goes!

Understanding Dask’s meta keyword argument

2022-08-09T00:00:00+00:00

If you have worked with Dask DataFrames or Dask Arrays, you have probably come across the meta keyword argument. Perhaps, while using methods like apply():

import dask
import pandas as pd

ddf = dask.datasets.timeseries()


def my_custom_arithmetic(r):
    if r.id > 1000:
        return r.x * r.x + r.y + r.y
    else:
        return 0


ddf["my_computation"] = ddf.apply(
    my_custom_arithmetic, axis=1, meta=pd.Series(dtype="float64")
)

ddf.head()
# Output:
#
#                        id    name         x         y  my_computation
# timestamp
# 2000-01-01 00:00:00  1055  Victor -0.575374  0.868320        2.067696
# 2000-01-01 00:00:01   994   Zelda  0.963684  0.972240        0.000000
# 2000-01-01 00:00:02   982  George -0.997531 -0.876222        0.000000
# 2000-01-01 00:00:03   981  Ingrid  0.852159 -0.419733        0.000000
# 2000-01-01 00:00:04  1029   Jerry -0.839431 -0.736572       -0.768500

You might have also seen one or more of the following warnings/errors:

UserWarning: You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.

UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.

ValueError: Metadata inference failed in …

If the above messages look familiar, this blog post is for you. :)

We will discuss:

what the is meta keyword argument,
why does Dask need meta, and
how to use it effectively.

We will look at meta mainly in the context of Dask DataFrames, however, similar principles also apply to Dask Arrays.

Before answering this, let’s quickly discuss Dask DataFrames.

A Dask DataFrame is a lazy object composed of multiple pandas DataFrames, where each pandas DataFrame is called a “partition”. These are stacked along the index and Dask keeps track of these partitions using “divisions”, which is a tuple representing the start and end index of each partition.

When you create a Dask DataFrame, you usually see something like the following:

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> ddf = dd.DataFrame.from_dict(
...     {
...         "x": range(6),
...         "y": range(10, 16),
...     },
...     npartitions=2,
... )
>>> ddf
Dask DataFrame Structure:
                   x      y
npartitions=2
0              int64  int64
3                ...    ...
5                ...    ...
Dask Name: from_pandas, 2 tasks

Here, Dask has created the structure of the DataFrame using some “metadata” information about the column names and their datatypes. This metadata information is called meta. Dask uses meta for understanding Dask operations and creating accurate task graphs (i.e., the logic of your computation).

The meta keyword argument in various Dask DataFrame functions allows you to explicitly share this metadata information with Dask. Note that the keyword argument is concerned with the metadata of the output of those functions.

Why does Dask need `meta`?

Dask computations are evaluated lazily. This means Dask creates the logic and flow, called task graph, of the computation immediately, but evaluates them only when necessary – usually, on calling .compute().

An example task graph generated to compute the sum of the DataFrame:

>>> s = ddf.sum()
>>> s
Dask Series Structure:
npartitions=1
x    int64
y      ...
dtype: int64
Dask Name: dataframe-sum-agg, 5 tasks

>>> s.visualize()

>>> s.compute()
x    15
y    75
dtype: int64

This is a single operation, but Dask workflows usually have multiple such operation chained together. Therefore, to create the task graph effectively, Dask needs to know the strucutre and datatypes of the DataFame after each operation. Especially because Dask does not know the actual values/structure of the DataFrame yet.

This is where meta is comes in.

In the above example, the Dask DataFrame changed into a Series after sum(). Dask knows this (even before we call compute()) only becasue of meta.

Internally, meta is represented as an empty pandas DataFrame or Series, which has the same structure as the Dask DataFrame. To learn more about how meta is defined internally, check out the DataFrame Internal Design documentation.

To see the actual metadata information for a collection, you can look at the ._meta attribute[1]:

>>> s._meta
Series([], dtype: int64)

How to specify `meta`?

You can specify meta in a few different ways, but the recommended way for Dask DataFrame is:

“An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output.”

~ DataFrame.apply() docstring

For example:

>>> meta_df = pd.DataFrame(columns=["x", "y"], dtype=int)
>>> meta_df

Empty DataFrame
Columns: [x, y]
Index: []

>>> ddf2 = ddf.apply(lambda x: x, axis=1, meta=meta_df).compute()
>>> ddf2
   x  y
0  0  0
1  1  1
2  2  2
3  0  3
4  1  4
5  2  5

The other ways you can describe meta are:

For a DataFrame, you can specify meta as a:
- Python dictionary: {column_name_1: dtype_1, column_name_2: dtype_2, …}
- Iterable of tuples: [(column_name_1, dtype_1), (columns_name_2, dtype_2, …)]
Note that while describing meta as shown above, using a dictionary or iterable of tuples, the order in which you mention column names is important. Dask will use the same order to create the pandas DataFrame for meta. If the orders don’t match, you will see the following error:
```
ValueError: The columns in the computed data do not match the columns in the provided metadata
```
For a Series output, you can specify meta using a single tuple: (coulmn_name, dtype)

You should not describe meta using just a dtype (like: meta="int64"), even for scalar outputs. If you do, you will see the following warning:

FutureWarning: Meta is not valid, `map_partitions` and `map_overlap` expects output to be a pandas object. Try passing a pandas object as meta or a dict or tuple representing the (name, dtype) of the columns. In the future the meta you passed will not work.

During operations like map_partitions or apply (which uses map_partitions internally), Dask coerces the scalar output of each partition into a pandas object. So, the output of functions that take meta will never be scalar.

For example:

>>> ddf = ddf.repartition(npartitions=1)
>>> result = ddf.map_partitions(lambda x: len(x)).compute()
>>> type(result)
pandas.core.series.Series

Here, the Dask DataFrame ddf has only one partition. Hence, len(x) on that one partition would result in a scalar output of integer dtype. However, when we compute it, we see a pandas Series. This confirms that Dask is coercing the outputs to pandas objects.

Another note, Dask Arrays may not always do this conversion. You can look at the API reference for your particular Array operation for details.

>>> import numpy as np
>>> import dask.array as da
>>> my_arr = da.random.random(10)
>>> my_arr.map_blocks(lambda x: len(x)).compute()
10

`meta` does not force the structure or dtypes

meta can be thought of as a suggestion to Dask. Dask uses this meta to generate the task graph until it can infer the actual metadata from the values. It does not force the output to have the structure or dtype of the specified meta.

Consider the following example, and remember that we defined ddf with x and y column names in the previous sections.

If we provide different column names (a and b) in the meta description, Dask uses these new names to create the task graph:

>>> meta_df = pd.DataFrame(columns=["a", "b"], dtype=int)
>>> result = ddf.apply(lambda x:x*x, axis=1, meta=meta_df)
>>> result
Dask DataFrame Structure:
                   a      b
npartitions=2
0              int64  int64
3                ...    ...
5                ...    ...
Dask Name: apply, 4 tasks

However, if we compute result, we will get the following error:

>>> result.compute()
ValueError: The columns in the computed data do not match the columns in the provided metadata
  Extra:   ['x', 'y']
  Missing: ['a', 'b']

While computing, Dask evaluates the actual metadata with columns x and y. This does not match the meta that we provided, and hence, Dask raises a helpful error message. Notice how Dask does not change the output to have a and b here, rather uses a and b column names only for intermediate task graphs.

Using `._meta` directly

In some rare case, you can also set the ._meta attribute[1] directly for a Dask DataFrame. For example, if the DataFrame was created with incorrect dtypes, like:

>>> ddf = dd.DataFrame.from_dict(
...     {
...         "x": range(6),
...         "y": range(10, 16),
...     },
...     dtype="object", # Note the “object” dtype
...     npartitions=2,
... )
>>> ddf
Dask DataFrame Structure:
                    x       y
npartitions=2
0              object  object
3                 ...     ...
5                 ...     ...
Dask Name: from_pandas, 2 tasks

The values are clearly integers but the dtype says object, so we can’t perform integer operations like addition:

>>> add = ddf + 2
ValueError: Metadata inference failed in `add`.

Original error is below:
------------------------
TypeError('can only concatenate str (not "int") to str')

Here, we can explicitly define ._meta[1]:

>>> ddf._meta = pd.DataFrame(columns=["x", "y"], dtype="int64")

Then, perform the addition:

>>> add = ddf + 2
>>> add
Dask DataFrame Structure:
                   x      y
npartitions=2
0              int64  int64
3                ...    ...
5                ...    ...
Dask Name: add, 4 tasks

Thanks for reading!

Have you run into issues with meta before? Please let us know on Discourse, and we will consider including it here, or updating the Dask documentation. :)

[1] NOTE: ._meta is not a public property, so we recommend using it only when necessary. There is an ongoing discussion around creating public methods to get, set, and view the information in ._meta, and this blog post will be updated to use the public methods when they’re created.

DataFrame Groupby Aggregations

2019-10-08T00:00:00+00:00

In this post we’ll dive into how Dask computes groupby aggregations. These are commonly used operations for ETL and analysis in which we split data into groups, apply a function to each group independently, and then combine the results back together. In the PyData/R world this is often referred to as the split-apply-combine strategy (first coined by Hadley Wickham) and is used widely throughout the Pandas ecosystem.

Image courtesy of swcarpentry.github.io

Dask leverages this idea using a similarly catchy name: apply-concat-apply or aca for short. Here we’ll explore the aca strategy in both simple and complex operations.

First, recall that a Dask DataFrame is a collection of DataFrame objects (e.g. each partition of a Dask DataFrame is a Pandas DataFrame). For example, let’s say we have the following Pandas DataFrame:

>>> import pandas as pd
>>> df = pd.DataFrame(dict(a=[1, 1, 2, 3, 3, 1, 1, 2, 3, 3, 99, 10, 1],
...                        b=[1, 3, 10, 3, 2, 1, 3, 10, 3, 3, 12, 0, 9],
...                        c=[2, 4, 5, 2, 3, 5, 2, 3, 9, 2, 44, 33, 2]))
>>> df
     a   b   c
0    1   1   2
1    1   3   4
2    2  10   5
3    3   3   2
4    3   2   3
5    1   1   5
6    1   3   2
7    2  10   3
8    3   3   9
9    3   3   2
10  99  12  44
11  10   0  33
12   1   9   2

To create a Dask DataFrame with three partitions from this data, we could partition df between the indices of: (0, 4), (5, 9), and (10, 12). We can perform this partitioning with Dask by using the from_pandas function with npartitions=3:

>>> import dask.dataframe as dd
>>> ddf = dd.from_pandas(df, npartitions=3)

The 3 partitions are simply 3 individual Pandas DataFrames:

>>> ddf.partitions[0].compute()
   a   b  c
1   1  2
1   3  4
2  10  5
3   3  2
3   2  3

Apply-concat-apply

When Dask applies a function and/or algorithm (e.g. sum, mean, etc.) to a Dask DataFrame, it does so by applying that operation to all the constituent partitions independently, collecting (or concatenating) the outputs into intermediary results, and then applying the operation again to the intermediary results to produce a final result. Internally, Dask re-uses the same apply-concat-apply methodology for many of its internal DataFrame calculations.

Let’s break down how Dask computes ddf.groupby(['a', 'b']).c.sum() by going through each step in the aca process. We’ll begin by splitting our df Pandas DataFrame into three partitions:

>>> df_1 = df[:5]
>>> df_2 = df[5:10]
>>> df_3 = df[-3:]

Apply

Next we perform the same groupby(['a', 'b']).c.sum() operation on each of our three partitions:

>>> sr1 = df_1.groupby(['a', 'b']).c.sum()
>>> sr2 = df_2.groupby(['a', 'b']).c.sum()
>>> sr3 = df_3.groupby(['a', 'b']).c.sum()

These operations each produce a Series with a MultiIndex:

>>> sr1 a b 1 1 2 3 4 2 10 5 3 2 3 3 2 Name: c, dtype: int64	>>> sr2 a b 1 1 5 3 2 2 10 3 3 3 11 Name: c, dtype: int64	>>> sr3 a b 1 9 2 10 0 33 99 12 44 Name: c, dtype: int64

The conCat!

After the first apply, the next step is to concatenate the intermediate sr1, sr2, and sr3 results. This is fairly straightforward to do using the Pandas concat function:

>>> sr_concat = pd.concat([sr1, sr2, sr3])
>>> sr_concat
a   b
 1      2
    4
 10     5
 2      3
    2
 1      5
    2
 10     3
 3     11
 9      2
0     33
12    44
Name: c, dtype: int64

Apply Redux

Our final step is to apply the same groupby(['a', 'b']).c.sum() operation again on the concatenated sr_concat Series. However we no longer have columns a and b, so how should we proceed?

Zooming out a bit, our goal is to add the values in the column which have the same index. For example, there are two rows with the index (1, 1) with corresponding values: 2, 5. So how can we groupby the indices with the same value? A MutliIndex uses levels to define what the value is at a give index. Dask determines and uses these levels in the final apply step of the apply-concat-apply calculation. In our case, the level is [0, 1], that is, we want both the index at the 0th level and the 1st level and if we group by both, 0, 1, we will have effectively grouped the same indices together:

>>> total = sr_concat.groupby(level=[0, 1]).sum()

>>> total
a   b
1   1      7
    3      6
    9      2
2   10     8
3   2      3
    3     13
10  0     33
99  12    44
Name: c, dtype: int64

>>> ddf.groupby(['a', 'b']).c.sum().compute()
a   b
1   1      7
    3      6
2   10     8
3   2      3
    3     13
1   9      2
10  0     33
99  12    44
Name: c, dtype: int64

>>> df.groupby(['a', 'b']).c.sum()
a   b
1   1      7
    3      6
    9      2
2   10     8
3   2      3
    3     13
10  0     33
99  12    44
Name: c, dtype: int64

Additionally, we can easily examine the steps of this apply-concat-apply calculation by visualizing the task graph for the computation:

>>> ddf.groupby(['a', 'b']).c.sum().visualize()

sum is rather a straight-forward calculation. What about something a bit more complex like mean?

>>> ddf.groupby(['a', 'b']).c.mean().visualize()

Mean is a good example of an operation which doesn’t directly fit in the aca model – concatenating mean values and taking the mean again will yield incorrect results. Like any style of computation: vectorization, Map/Reduce, etc., we sometime need to creatively fit the computation to the style/mode. In the case of aca we can often break down the calculation into constituent parts. For mean, this would be sum and count:

\[ \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}\]

From the task graph above, we can see that two independent tasks for each partition: series-groupby-count-chunk and series-groupby-sum-chunk. The results are then aggregated into two final nodes: series-groupby-count-agg and series-groupby-sum-agg and then we finally calculate the mean: total sum / total count.

Building GPU Groupby-Aggregations for Dask

2019-03-04T00:00:00+00:00

We’ve sufficiently aligned Dask DataFrame and cuDF to get groupby aggregations like the following to work well.

df.groupby('x').y.mean()

This post describes the kind of work we had to do as a model for future development.

Plan

As outlined in a previous post, Dask, Pandas, and GPUs: first steps, our plan to produce distributed GPU dataframes was to combine Dask DataFrame with cudf. In particular, we had to

change Dask DataFrame so that it would parallelize not just around the Pandas DataFrames that it works with today, but around anything that looked enough like a Pandas DataFrame
change cuDF so that it would look enough like a Pandas DataFrame to fit within the algorithms in Dask DataFrame

Changes

On the Dask side this mostly meant replacing

Replacing isinstance(df, pd.DataFrame) checks with is_dataframe_like(df) checks (after defining a suitable is_dataframe_like/is_series_like/is_index_like functions
Avoiding some more exotic functionality in Pandas, and instead trying to use more common functionality that we can expect to be in most DataFrame implementations

On the cuDF side this means making dozens of tiny changes to align the cuDF API to the Pandas API, and to add in missing features.

Dask Changes:
- Remove explicit pandas checks and provide cudf lazy registration #4359
- Replace isinstance(…, pandas) with is_dataframe_like #4375
- Add has_parallel_type
- Lazily register more cudf functions and move to backends file #4396
- Avoid checking against types in is_dataframe_like #4418
- Replace cudf-specific code with dask-cudf import #4470
- Avoid groupby.agg(callable) in groupby-var #4482 – this one is notable in that by simplifying our Pandas usage we actually got a significant speedup on the Pandas side.
cuDF Changes:

I don’t really expect anyone to go through all of those issues, but my hope is that by skimming over the issue titles people will get a sense for the kinds of changes we’re making here. It’s a large number of small things.

Also, kudos to Thomson Comer who solved most of the cuDF issues above.

There are still some pending issues

Square Root #1055, needed for groupby-std
cuDF needs multi-index support for columns #483, needed for:
```
gropuby.agg({'x': ['sum', mean'], 'y': ['min', 'max']})
```

But things mostly work

But generally things work pretty well today:

In [1]: import dask_cudf

In [2]: df = dask_cudf.read_csv('yellow_tripdata_2016-*.csv')

In [3]: df.groupby('passenger_count').trip_distance.mean().compute()
Out[3]: <cudf.Series nrows=10 >

In [4]: _.to_pandas()
Out[4]:
0    0.625424
1    4.976895
2    4.470014
3    5.955262
4    4.328076
5    3.079661
6    2.998077
7    3.147452
8    5.165570
9    5.916169
dtype: float64

Experience

First, most of this work was handled by the cuDF developers (which may be evident from the relative lengths of the issue lists above). When we started this process it felt like a never-ending stream of tiny issues. We weren’t able to see the next set of issues until we had finished the current set. Fortunately, most of them were pretty easy to fix. Additionally, as we went on, it seemed to get a bit easier over time.

Additionally, lots of things work other than groupby-aggregations as a result of the changes above. From the perspective of someone accustomed to Pandas, The cuDF library is starting to feel more reliable. We hit missing functionality less frequently when using cuDF on other operations.

What’s next?

More recently we’ve been working on the various join/merge operations in Dask DataFrame like indexed joins on a sorted column, joins between large and small dataframes (a common special case) and so on. Getting these algorithms from the mainline Dask DataFrame codebase to work with cuDF is resulting in a similar set of issues to what we saw above with groupby-aggregations, but so far the list is much smaller. We hope that this is a trend as we continue on to other sets of functionality into the future like I/O, time-series operations, rolling windows, and so on.

Single-Node Multi-GPU Dataframe Joins

2019-01-29T00:00:00+00:00

We experiment with single-node multi-GPU joins using cuDF and Dask. We find that the in-GPU computation is faster than communication. We also present context and plans for near-future work, including improving high performance communication in Dask with UCX.

Here is a notebook of the experiment in this post

Introduction

In a recent post we showed how Dask + cuDF could accelerate reading CSV files using multiple GPUs in parallel. That operation quickly became bound by the speed of our disk after we added a few GPUs. Now we try a very different kind of operation, multi-GPU joins.

This workload can be communication-heavy, especially if the column on which we are joining is not sorted nicely, and so provides a good example on the other extreme from parsing CSV.

Benchmark

Construct random data using the CPU

Here we use Dask array and Dask dataframe to construct two random tables with a shared id column. We can play with the number of rows of each table and the number of keys to make the join challenging in a variety of ways.

import dask.array as da
import dask.dataframe as dd

n_rows = 1000000000
n_keys = 5000000

left = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='x'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1)

n_rows = 10000000

right = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='y'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1)

Send to the GPUs

We have two Dask dataframes composed of many Pandas dataframes of our random data. We now map the cudf.from_pandas function across these to make a Dask dataframe of cuDF dataframes.

import dask
import cudf

gleft = left.map_partitions(cudf.from_pandas)
gright = right.map_partitions(cudf.from_pandas)

gleft, gright = dask.persist(gleft, gright)  # persist data in device memory

What’s nice here is that there wasn’t any special dask_pandas_dataframe_to_dask_cudf_dataframe function. Dask composed nicely with cuDF. We didn’t need to do anything special to support it.

We’ll also persisted the data in device memory.

After this, simple operations are easy and fast and use our eight GPUs.

>>> gleft.x.sum().compute()  # this takes 250ms
500004719.254711

Join

We’ll use standard Pandas syntax to merge the datasets, persist the result in RAM, and then wait

out = gleft.merge(gright, on=['id'])  # this is lazy

Profile and analyze results

We now look at the Dask diagnostic plots for this computation.

Task stream and communication

When we look at Dask’s task stream plot we see that each of our eight threads (each of which manages a single GPU) spent most of its time in communication (red is communication time). The actual merge and concat tasks are quite fast relative to the data transfer time.

That’s not too surprising. For this computation I’ve turned off any attempt to communicate between devices (more on this below) so the data is being moved from the GPU to the CPU memory, then serialized and put onto a TCP socket. We’re moving tens of GB on a single machine, so we’re seeing about 1GB/s total throughput of the system, which is typical for TCP-on-localhost in Python.

Flamegraph of computation

We can also look more deeply at the computational costs in Dask’s flamegraph-style plot. This shows which lines of our functions were taking up the most time (down to the Python level at least).

This Flame graph shows which lines of cudf code we spent time on while computing (excluding the main communication costs mentioned above). It may be interesting for those trying to further optimize performance. It shows that most of our costs are in memory allocation. Like communication, this has actually also been fixed in RAPIDS’ optional memory management pool, it just isn’t default yet, so I didn’t use it here.

Plans for efficient communication

The cuDF library actually has a decent approach to single-node multi-GPU communication that I’ve intentionally turned off for this experiment. That approach cleverly used Dask to communicate device pointer information using Dask’s normal channels (this is small and fast) and then used that information to initiate a side-channel communication for the bulk of the data. This approach was effective, but somewhat fragile. I’m inclined to move on for it in favor of …

UCX. The UCX project provides a single API that wraps around several transports like TCP, Infiniband, shared memory, and also GPU-specific transports. UCX claims to find the best way to communicate data between two points given the hardware available. If Dask were able to use this for communication then it would provide both efficient GPU-to-GPU communication on a single machine, and also efficient inter-machine communication when efficient networking hardware like Infiniband was present, even outside the context of GPUs.

There is some work we need to do here:

We need to make a Python wrapper around UCX
We need to make an optional Dask Comm around this ucx-py library that allows users to specify endpoints like ucx://path-to-scheduler
We need to make Python memoryview-like objects that refer to device memory
…

This work is already in progress by Akshay Vekatesh, who works on UCX, and Tom Augspurger a core Dask/Pandas developer. I suspect that they’ll write about it soon. I’m looking forward to seeing what comes of it, both for Dask and for high performance Python generally.

It’s worth pointing out that this effort won’t just help GPU users. It should help anyone on advanced networking hardware, including the mainstream scientific HPC community.

Summary

Single-node Mutli-GPU joins have a lot of promise. In fact, earlier RAPIDS developers got this running much faster than I was able to do above through the clever communication tricks I briefly mentioned. The main purpose of this post is to provide a benchmark for joins that we can use in the future, and to highlight when communication can be essential in parallel computing.

Now that GPUs have accelerated the computation time of each of our chunks of work we increasingly find that other systems become the bottleneck. We didn’t care as much about communication before because computational costs were comparable. Now that computation is an order of magnitude cheaper, other aspects of our stack become much more important.

I’m looking forward to seeing where this goes.

Come help!

If the work above sounds interesting to you then come help! There is a lot of low-hanging and high impact work to do.

If you’re interested in being paid to focus more on these topics, then consider applying for a job. NVIDIA’s RAPIDS team is looking to hire engineers for Dask development with GPUs and other data analytics library development projects.

Senior Library Software Engineer - RAPIDS

Extension Arrays in Dask DataFrame

2019-01-22T00:00:00+00:00

This work is supported by Anaconda Inc

Dask DataFrame works well with pandas’ new Extension Array interface, including third-party extension arrays. This lets Dask

easily support pandas’ new extension arrays, like their new nullable integer array
support third-party extension array arrays, like cyberpandas’s IPArray

Background

Pandas 0.23 introduced the ExtensionArray, a way to store things other than a simple NumPy array in a DataFrame or Series. Internally pandas uses this for data types that aren’t handled natively by NumPy like datetimes with timezones, Categorical, or (the new!) nullable integer arrays.

>>> s = pd.Series(pd.date_range('2000', periods=4, tz="US/Central"))
>>> s
0   2000-01-01 00:00:00-06:00
1   2000-01-02 00:00:00-06:00
2   2000-01-03 00:00:00-06:00
3   2000-01-04 00:00:00-06:00
dtype: datetime64[ns, US/Central]

dask.dataframe has always supported the extension types that pandas defines.

>>> import dask.dataframe as dd
>>> dd.from_pandas(s, npartitions=2)
Dask Series Structure:
npartitions=2
0    datetime64[ns, US/Central]
2                           ...
3                           ...
dtype: datetime64[ns, US/Central]
Dask Name: from_pandas, 2 tasks

The Challenge

Newer versions of pandas allow third-party libraries to write custom extension arrays. These arrays can be placed inside a DataFrame or Series, and work just as well as any extension array defined within pandas itself. However, third-party extension arrays provide a slight challenge for Dask.

Recall: dask.dataframe is lazy. We use a familiar pandas-like API to build up a task graph, rather than executing immediately. But if Dask DataFrame is lazy, then how do things like the following work?

>>> df = pd.DataFrame({"A": [1, 2], 'B': [3, 4]})
>>> ddf = dd.from_pandas(df, npartitions=2)
>>> ddf[['B']].columns
Index(['B'], dtype='object')

ddf[['B']] (lazily) selects the column 'B' from the dataframe. But accessing .columns immediately returns a pandas Index object with just the selected columns.

No real computation has happened (you could just as easily swap out the from_pandas for a dd.read_parquet on a larger-than-memory dataset, and the behavior would be the same). Dask is able to do these kinds of “metadata-only” computations, where the output depends only on the columns and the dtypes, without executing the task graph. Internally, Dask does this by keeping a pair of dummy pandas DataFrames on each Dask DataFrame.

>>> ddf._meta
Empty DataFrame
Columns: [A, B]
Index: []

>>> ddf._meta_nonempty
ddf._meta_nonempty
   A  B
0  1  1
1  1  1

We need the _meta_nonempty, since some operations in pandas behave differently on an Empty DataFrame than on a non-empty one (either by design or, occasionally, a bug in pandas).

The issue with third-party extension arrays is that Dask doesn’t know what values to put in the _meta_nonempty. We’re quite happy to do it for each NumPy dtype and each of pandas’ own extension dtypes. But any third-party library could create an ExtensionArray for any type, and Dask would have no way of knowing what’s a valid value for it.

The Solution

Rather than Dask guessing what values to use for the _meta_nonempty, extension array authors (or users) can register their extension dtype with Dask. Once registered, Dask will be able to generate the _meta_nonempty, and things should work fine from there. For example, we can register the dummy DecimalArray that pandas uses for testing (this isn’t part of pandas’ public API) with Dask.

from decimal import Decimal
from pandas.tests.extension.decimal import DecimalArray, DecimalDtype

# The actual registration that would be done in the 3rd-party library
from dask.dataframe.extensions import make_array_nonempty


@make_array_nonempty.register(DecimalDtype)
def _(dtype):
    return DecimalArray._from_sequence([Decimal('0'), Decimal('NaN')],
                                       dtype=dtype)

Now users of that extension type can place those arrays inside a Dask DataFrame or Series.

>>> df = pd.DataFrame({"A": DecimalArray([Decimal('1.0'), Decimal('2.0'),
...                                       Decimal('3.0')])})

>>> ddf = dd.from_pandas(df, 2)
>>> ddf
Dask DataFrame Structure:
                     A
npartitions=1
0              decimal
2                  ...
Dask Name: from_pandas, 1 tasks

>>> ddf.dtypes
A    decimal
dtype: object

And from there, the usual operations just as they would in pandas.

>>> from random import choices
>>> df = pd.DataFrame({"A": DecimalArray(choices([Decimal('1.0'),
...                                               Decimal('2.0')],
...                                              k=100)),
...                    "B": np.random.choice([0, 1, 2, 3], size=(100,))})
>>> ddf = dd.from_pandas(df, 2)
In [35]: ddf.groupby("A").B.mean().compute()
Out[35]:
A
1.0    1.50
2.0    1.48
Name: B, dtype: float64

The Real Lesson

It’s neat that Dask now supports extension arrays. But to me, the exciting thing is just how little work this took. The PR implementing support for third-party extension arrays is quite short, just defining the object that third-parties register with, and using it to generate the data when dtype is detected. Supporting the three new extension arrays in pandas 0.24.0 (IntegerArray, PeriodArray, and IntervalArray), takes a handful of lines of code

@make_array_nonempty.register(pd.Interval):
def _(dtype):
    return IntervalArray.from_breaks([0, 1, 2], closed=dtype.closed)


@make_array_nonempty.register(pd.Period):
def _(dtype):
    return period_array([2000, 2001], freq=dtype.freq)


@make_array_nonempty.register(_IntegerDtype):
def _(dtype):
    return integer_array([0, None], dtype=dtype)

Dask benefits directly from improvements made to pandas. Dask didn’t have to build out a new parallel extension array interface, and reimplement all the new extension arrays using the parallel interface. We just re-used what pandas already did, and it fits into the existing Dask structure.

For third-party extension array authors, like cyberpandas, the work is similarly minimal. They don’t need to re-implement everything from the ground up, just to play well with Dask.

This highlights the importance of one of the Dask project’s core values: working with the community. If you visit dask.org, you’ll see phrases like

Integrates with existing projects

and

Built with the broader community

At the start of Dask, the developers could have gone off and re-written pandas or NumPy from scratch to be parallel friendly (though we’d probably still be working on that part today, since that’s such a massive undertaking). Instead, the Dask developers worked with the community, occasionally nudging it in directions that would help out dask. For example, many places in pandas held the GIL, preventing thread-based parallelism. Rather than abandoning pandas, the Dask and pandas developers worked together to release the GIL where possible when it was a bottleneck for dask.dataframe. This benefited Dask and anyone else trying to do thread-based parallelism with pandas DataFrames.

And now, when pandas introduces new features like nullable integers, dask.dataframe just needs to register it as an extension type and immediately benefits from it. And third-party extension array authors can do the same for their extension arrays.

If you’re writing an ExtensionArray, make sure to add it to the pandas ecosystem page, and register it with Dask!