Dask Working Notes - Posts tagged IO

Do you need consistent environments between the client, scheduler and workers?

2023-04-14T00:00:00+00:00

Update May 3rd 2023: Clarify GPU recommendations.

With the release 2023.4.0 of dask and distributed we are making a change which may require the Dask scheduler to have consistent software and hardware capabilities as the client and workers.

It has always been recommended that your client and workers have a consistent software and hardware environment so that data structures and dependencies can be pickled and passed between them. However recent changes to the Dask scheduler mean that we now also require your scheduler to have the same consistent environment as everything else.

For most users, this change should go unnoticed as it is common to run all Dask components in the same conda environment or docker image and typically on homogenous machines.

However, for folks who may have optimized their schedulers to use cut-down environments, or for users with specialized hardware such as GPUs available on their client/workers but not the scheduler there may be some impact.

What will the impact be?

If you run into errors such as "RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments." please ensure your software environment is consistent between your client, scheduler and workers.

If you are passing GPU objects between the client and workers we now recommend that your scheduler has a GPU too. This recommendation is just so that GPU-backed objects contained in Dask graphs can be deserialized on the scheduler if necessary. Typically the GPU available to the scheduler doesn’t need to be as powerful as long as it has similar CUDA compute capabilities. For example for cost optimization reasons you may want to use A100s on your client and workers and a T4 on your scheduler.

Users who do not have a GPU on the client and are leveraging GPU workers shouldn’t run into this as the GPU objects will only exist on the workers.

Why are we doing this?

The reason we now suggest that you have the same hardware/software capabilities on the scheduler is that we are giving the scheduler the ability to deserialize graphs before distributing them to the workers. This will allow the scheduler to make smarter scheduling decisions in the future by having a better understanding of the operation it is performing.

The downside to this is that graphs can contain complex Python objects created by any number of dependencies on the client side, so in order for the scheduler to deserialize them it needs to have the same libraries installed. Equally, if the client-side packages create GPU objects then the scheduler will also need one.

We are sure you’ll agree that this breakage for a small percentage of users will be worth it for the long-term improvements to Dask.

Deep Dive into creating a Dask DataFrame Collection with from_map

2023-04-12T00:00:00+00:00

Dask DataFrame provides dedicated IO functions for several popular tabular-data formats, like CSV and Parquet. If you are working with a supported format, then the corresponding function (e.g read_csv) is likely to be the most reliable way to create a new Dask DataFrame collection. For other workflows, from_map now offers a convenient way to define a DataFrame collection as an arbitrary function mapping. While these kinds of workflows have historically required users to adopt the Dask Delayed API, from_map now makes custom collection creation both easier and more performant.

The from_map API was added to Dask DataFrame in v2022.05.1 with the intention of replacing from_delayed as the recommended means of custom DataFrame creation. At its core, from_map simply converts each element of an iterable object (inputs) into a distinct Dask DataFrame partition, using a common function (func):

dd.from_map(func: Callable, iterable: Iterable) -> dd.DataFrame

The overall behavior is essentially the Dask DataFrame equivalent of the standard-Python map function:

map(func: Callable, iterable: Iterable) -> Iterator

Note that both from_map and map actually support an arbitrary number of iterable inputs. However, we will only focus on the use of a single iterable argument in this post.

A simple example

To better understand the behavior of from_map, let’s consider the simple case that we want to interact with Feather-formatted data created with the following Pandas code:

import pandas as pd

size = 3
paths = ["./data.0.feather", "./data.1.feather"]
for i, path in enumerate(paths):
    index = range(i * size, i * size + size)
    a = [i] * size
    b = list("xyz")
    df = pd.DataFrame({"A": a, "B": b, "index": index})
    df.to_feather(path)

Since Dask does not yet offer a dedicated read_feather function (as of dask-2023.3.1), most users would assume that the only option to create a Dask DataFrame collection is to use dask.delayed. The “best practice” for creating a collection in this case, however, is to wrap pd.read_feather or cudf.read_feather in a from_map call like so:

>>> import dask.dataframe as dd
>>> ddf = dd.from_map(pd.read_feather, paths)
>>> ddf
Dask DataFrame Structure:
                   A       B  index
npartitions=2
               int64  object  int64
                 ...     ...    ...
                 ...     ...    ...
Dask Name: read_feather, 1 graph layer

Which produces the following Pandas (or cuDF) object after computation:

>>> ddf.compute()
   A  B  index
0  x      0
0  y      1
0  z      2
1  x      3
1  y      4
1  z      5

Although the same output can be achieved using the conventional dd.from_delayed strategy, using from_map will improve the available opportunities for task-graph optimization within Dask.

Performance considerations: Specifying `meta` and `divisions`

Although func and iterable are the only required arguments to from_map, one can significantly improve the overall performance of a workflow by specifying optional arguments like meta and divisions.

Due to the lazy nature of Dask DataFrame, each collection is required to carry around a schema (column name and dtype information) in the form of an empty Pandas (or cuDF) object. If meta is not directly provided to the from_map function, the schema will need to be populated by eagerly materializing the first partition, which can increase the apparent latency of the from_map API call itself. For this reason, it is always recommended to specify an explicit meta argument if the expected column names and dtypes are known a priori.

While passing in a meta argument is likely to reduce thefrom_map API call latency, passing in a divisions argument makes it possible to reduce the end-to-end compute time. This is because, by specifying divisions, we are allowing Dask DataFrame to track useful per-partition min/max statistics. Therefore, if the overall workflow involves grouping or joining on the index, Dask can avoid the need to perform unnecessary shuffling operations.

Using `from_map` to implement a custom API

Although it is currently difficult to automatically extract division information from the metadata of an arbitrary Feather dataset, from_map makes it relatively easy to implement your own highly-functional read_feather API using PyArrow. For example, the following code is all that one needs to enable lazy Feather IO with both column projection and index selection:

def from_arrow(table):
    """(Optional) Utility to enforce 'backend' configuration"""
    from dask import config

    if config.get("dataframe.backend") == "cudf":
        import cudf

        return cudf.DataFrame.from_arrow(table)
    else:
        return table.to_pandas()


def read_feather(paths, columns=None, index=None):
    """Create a Dask DataFrame from Feather files

    Example of a "custom" `from_map` IO function

    Parameters
    ----------
    paths: list
        List of Feather-formatted paths. Each path will
        be mapped to a distinct DataFrame partition.
    columns: list or None, default None
        Optional list of columns to select from each file.
    index: str or None, default None
        Optional column name to set as the DataFrame index.

    Returns
    -------
    dask.dataframe.DataFrame
    """
    import dask.dataframe as dd
    import pyarrow.dataset as ds

    # Step 1: Extract `meta` from the dataset
    dataset = ds.dataset(paths, format="feather")
    meta = from_arrow(dataset.schema.empty_table())
    meta = meta.set_index(index) if index else meta
    columns = columns or list(meta.columns)
    meta = meta[columns]

    # Step 2: Define the `func` argument
    def func(frag, columns=None, index=None):
        # Create a Pandas DataFrame from a dataset fragment
        # NOTE: In practice, this function should
        # always be defined outside `read_feather`
        assert columns is not None
        read_columns = columns
        if index and index not in columns:
            read_columns = columns + [index]
        df = from_arrow(frag.to_table(columns=read_columns))
        df = df.set_index(index) if index else df
        return df[columns] if columns else df

    # Step 3: Define the `iterable` argument
    iterable = dataset.get_fragments()

    # Step 4: Call `from_map`
    return dd.from_map(
        func,
        iterable,
        meta=meta,
        index=index,  # `func` kwarg
        columns=columns,  # `func` kwarg
    )

Here we see that using from_map to enable completely-lazy collection creation only requires four steps. First, we use pyarrow.dataset to define a meta argument for from_map, so that we can avoid the unnecessary overhead of an eager read operation. For some file formats and/or applications, it may also be possible to calculate divisions at this point. However, as explained above, such information is not readily available for this particular example.

The second step is to define the underlying function (func) that we will use to produce each of our final DataFrame partitions. Third, we define one or more iterable objects containing the unique information needed to produce each partition (iterable). In this case, the only iterable object corresponds to a generator of pyarrow.dataset fragments, which is essentially a wrapper around the input path list.

The fourth and final step is to use the final func, interable, and meta information to call the from_map API. Note that we also use this opportunity to specify additional key-word arguments, like columns and index. In contrast to the iterable positional arguments, which are always mapped to func, these key-word arguments will be broadcasted.

Using theread_feather implementation above, it becomes both easy and efficient to convert an arbitrary Feather dataset into a lazy Dask DataFrame collection:

>>> ddf = read_feather(paths, columns=["A"], index="index")
>>> ddf
Dask DataFrame Structure:
                   A
npartitions=2
               int64
                 ...
                 ...
Dask Name: func, 1 graph layer
>>> ddf.compute()
       A
index
0      0
1      0
2      0
3      1
4      1
5      1

Advanced: Enhancing column projection

Although a read_feather implementation like the one above is likely to meet the basic needs of most applications, it is certainly possible that users will often leave out the column argument in practice. For example:

a = read_feather(paths)["A"]

For code like this, as the implementation currently stands, each IO task would be forced to read in an entire Feather file, and then select the ”A” column from a Pandas/cuDF DataFrame only after it had already been read into memory. The additional overhead is insignificant for the toy-dataset used here. However, avoiding this kind of unnecessary IO can lead to dramatic performance improvements in real-world applications.

So, how can we modify our read_feather implementation to take advantage of external column-projection operations (like ddf["A"])? The good news is that from_map is already equipped with the necessary graph-optimization hooks to handle this, so long as the func object satisfies the DataFrameIOFunction protocol:

@runtime_checkable
class DataFrameIOFunction(Protocol):
    """DataFrame IO function with projectable columns
    Enables column projection in ``DataFrameIOLayer``.
    """

    @property
    def columns(self):
        """Return the current column projection"""
        raise NotImplementedError

    def project_columns(self, columns):
        """Return a new DataFrameIOFunction object
        with a new column projection
        """
        raise NotImplementedError

    def __call__(self, *args, **kwargs):
        """Return a new DataFrame partition"""
        raise NotImplementedError

That is, all we need to do is change “Step 2” of our implementation to use the following code instead:

    from dask.dataframe.io.utils import DataFrameIOFunction

    class ReadFeather(DataFrameIOFunction):
        """Create a Pandas/cuDF DataFrame from a dataset fragment"""
        def __init__(self, columns, index):
            self._columns = columns
            self.index = index

        @property
        def columns(self):
            return self._columns

        def project_columns(self, columns):
            # Replace this object with one that will only read `columns`
            if columns != self.columns:
                return ReadFeather(columns, self.index)
            return self

        def __call__(self, frag):
            # Same logic as original `func`
            read_columns = self.columns
            if index and self.index not in self.columns:
                read_columns = self.columns + [self.index]
            df = from_arrow(frag.to_table(columns=read_columns))
            df = df.set_index(self.index) if self.index else df
            return df[self.columns] if self.columns else df

    func = ReadFeather(columns, index)

Conclusion

It is now easier than ever to create a Dask DataFrame collection from an arbitrary data source. Although the dask.delayed API has already enabled similar functionality for many years, from_map now makes it possible to implement a custom IO function without sacrificing any of the high-level graph optimizations leveraged by the rest of the Dask DataFrame API.

Start experimenting with from_map today, and let us know how it goes!

Extracting fsspec from Dask

2019-07-23T00:00:00+00:00

fsspec, the new base for file system operations in Dask, Intake, s3fs, gcsfs and others, is now available as a stand-alone interface and central place to develop new backends and file operations. Although it was developed as part of Dask, you no longer need Dask to use this functionality.

Introduction

Over the past few years, Dask’s IO capability has grown gradually and organically, to include a number of file-formats, and the ability to access data seamlessly on various remote/cloud data systems. This has been achieved through a number of sister packages for viewing cloud resources as file systems, and dedicated code in dask.bytes. Some of the storage backends, particularly s3fs, became immediately useful outside of Dask too, and were picked up as optional dependencies by pandas, xarray and others.

For the sake of consolidating the behaviours of the various backends, providing a single reference specification for any new backends, and to make this set of file system operations available even without Dask, I created fsspec. This last week, Dask changed to use fsspec directly for its IO needs, and I would like to describe in detail here the benefits of this change.

Although this was done initially to easy the maintenance burden, the important takeaway is that we want to make file systems operations easily available to the whole pydata ecosystem, with or without Dask.

History

The first file system I wrote was hdfs3, a thin wrapper around the libhdfs3 C library. At the time, Dask had acquired the ability to run on a distributed cluster, and HDFS was the most popular storage solution for these (in the commercial world, at least), so a solution was required. The python API closely matched the C one, which in turn followed the Java API and posix standards. Fortunately, python already has a file-like standard, so providing objects that implemented that was enough to make remote bytes available to many packages.

Pretty soon, it became apparent that cloud resources would be at least as important as in-cluster file systems, and so followed s3fs, adlfs, and gcsfs. Each followed the same pattern, but with some specific code for the given interface, and improvements based on the experience of the previous interfaces. During this time, Dask’s needs also evolved, due to more complex file formats such as parquet. Code to interface to the different backends and adapt their methods ended up in the Dask repository.

In the meantime, other file system interfaces arrived, particularly pyarrow’s, which had its own HDFS implementation and direct parquet reading. But we would like all of the tools in the ecosystem to work together well, so that Dask can read parquet using either engine from any of the storage backends.

Code duplication

Copying an interface, adapting it and releasing it, as I did with each iteration of the file system, is certainly a quick way to get a job done. However, when you then want to change the behaviour, or add new functionality, it turns out you need to repeat the work in each place (violating the DRY principle) or have the interfaces diverge slowly. Good examples of this were glob and walk, which supported various options for the former, and returned different things (list, versions dir/files iterator) for the latter.

>>> fs = dask.bytes.local.LocalFileSystem()
>>> fs.walk('/home/path/')
<iterator of tuples>


>>> fs = s3fs.S3FileSystme()
>>> fs.walk('bucket/path')
[list of filenames]

We found that, for Dask’s needs, we needed to build small wrapper classes to ensure compatible APIs to all backends, as well as a class for operating on the local file system with the same interface, and finally a registry for all of these with various helper functions. Very little of this was specific to Dask, with only a couple of functions concerning themselves with building graphs and deferred execution. It did, however, raise the important issue that file systems should be serializable and that there should be a way to specify a file to be opened, which is also serializable (and ideally supports transparent text and compression).

New file systems

I already mentioned the effort to make a local file system class which met the same interface as the other ones which already existed. But there are more options that Dask users (and others) might want, such as ssh, ftp, http, in-memory, and so on. Following requests from users to handle these options, we started to write more file system interfaces, which all lived within dask.bytes; but it was unclear whether they should only support very minimal functionality, just enough to get something done from Dask, or a full set of file operations.

The in-memory file system, in particular, existed in an extremely long-lived PR - it’s not clear how useful such a thing is to Dask, when each worker has it’s own memory, and so sees a different state of the “file system”.

Consolidation

file system Spec, later fsspec, was born out of a desire to codify and consolidate the behaviours of the storage backends, reduce duplication, and provide the same functionality to all backends. In the process, it became much easier to write new implementation classes: see the implementation, which include interesting and highly experimental options such as the CachingFileSystem, which makes local copies of every remote read, for faster access the second time around. However, more important main-stream implementations also took shape, such as FTP, SSH, Memory and webHDFS (the latter being the best bet for accessing HDFS from outside the cluster, following all the problems building and authenticating with hdfs3).

Furthermore, the new repository gave the opportunity to implement new features, which would then have further-reaching applicability than if they had been done in just selected repositories. Examples include FUSE mounting, dictionary-style key-value views on file systems (such as used by zarr), and transactional writing of files. All file systems are serializable and pyarrow-compliant.

Usefulness

Eventually it dawned on my that the operations offered by the file system classes are very useful for people not using Dask too. Indeed, s3fs, for example, sees plenty of use stand-alone, or in conjunction with something like fastparquet, which can accept file system functions to its method, or pandas.

So it seemed to make sense to have a particular repo to write out the spec that a Dask-compliant file system should adhere to, and I found that I could factor out a lot of common behaviour from the existing implementations, provide functionality that had existed in only some to all, and generally improve every implementation along the way.

However, it was when considering fsspec in conjunction with Intake that I realised how generally useful a stand-alone file system package can be: the PR implemented a generalised file selector that can browse files in any file system that we have available, even being able, for instance, to view a remote zip-file on S3 as a browseable file system. Note that, similar to the general thrust of this blog, the file selector itself need not live in the Intake repo and will eventually become either its own thing, or an optional feature of fsspec. You shouldn’t need Intake either just to get generalised file system operations.

Final Thoughts

This work is not quite on the level of “protocol standards” such as the well-know python buffer protocol, but I think it is a useful step in making data in various storage services available to people, since you can operate on each with the same API, expect the same behaviour, and create real python file-like objects to pass to other functions. Having a single central repo like this offers an obvious place to discuss and amend the spec, and build extra functionality onto it.

Many improvements remain to be done, such as support for globstrings in more functions, or a single file system which can dispatch to the various backends depending on the form of the URL provided; but there is now an obvious place for all of this to happen.