Dask Working Notes - Posted in 2016

Dask Development Log

2016-12-24T00:00:00+00:00

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-11 and 2016-12-18. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of last week:

Cleanup of load balancing
Found cause of worker lag
Initial Spark/Dask Dataframe comparisons
Benchmarks with asv

The last two weeks saw several disruptive changes to the scheduler and workers. This resulted in an overall performance degradation on messy workloads when compared to the most recent release, which stopped bleeding-edge users from using recent dev builds. This has been resolved, and bleeding-edge git-master is back up to the old speed and then some.

As a visual aid, this is what bad (or in this case random) load balancing looks like:

Identified and removed worker lag

For a while there have been significant gaps of 100ms or more between successive tasks in workers, especially when using Pandas. This was particularly odd because the workers had lots of backed up work to keep them busy (thanks to the nice load balancing from before). The culprit here was the calculation of the size of the intermediate on object dtype dataframes.

Explaining this in greater depth, recall that to schedule intelligently, the workers calculate the size in bytes of every intermediate result they produce. Often this is quite fast, for example for numpy arrays we can just multiply the number of elements by the dtype itemsize. However for object dtype arrays or dataframes (which are commonly used for text) it can take a long while to calculate an accurate result here. Now we no longer calculuate an accurate result, but instead take a fairly pessimistic guess. The gaps between tasks shrink considerably.

Although there is still a significant bit of lag around 10ms long between tasks on these workloads (see zoomed version on the right). On other workloads we’re able to get inter-task lag down to the tens of microseconds scale. While 10ms may not sound like a long time, when we perform very many very short tasks this can quickly become a bottleneck.

Anyway, this change reduced shuffle overhead by a factor of two. Things are starting to look pretty snappy for many-small-task workloads.

Initial Spark/Dask Dataframe Comparisons

I would like to run a small benchmark comparing Dask and Spark DataFrames. I spent a bit of the last couple of days using Spark locally on the NYC Taxi data and futzing with cluster deployment tools to set up Spark clusters on EC2 for basic benchmarking. I ran across flintrock, which has been highly recommended to me a few times.

I’ve been thinking about how to do benchmarks in an unbiased way. Comparative benchmarks are useful to have around to motivate projects to grow and learn from each other. However in today’s climate where open source software developers have a vested interest, benchmarks often focus on a projects’ strengths and hide their deficiencies. Even with the best of intentions and practices, a developer is likely to correct for deficiencies on the fly. They’re much more able to do this for their own project than for others’. Benchmarks end up looking more like sales documents than trustworthy research.

My tentative plan is to reach out to a few Spark devs and see if we can collaborate on a problem set and hardware before running computations and comparing results.

Benchmarks with airspeed velocity

Rich Postelnik is building on work from Tom Augspurger to build out benchmarks for Dask using airspeed velocity at dask-benchmarks. Building out benchmarks is a great way to get involved if anyone is interested.

Pre-pre-release

I intend to publish a pre-release for a 0.X.0 version bump of dask/dask and dask/distributed sometime next week.

Dask Development Log

2016-12-18T00:00:00+00:00

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Themes of last week:

Benchmarking new scheduler and worker on larger systems
Kubernetes and Google Container Engine
Fastparquet on S3

In the last two weeks we rewrote a significant fraction of the worker and scheduler. This enables future growth, but also resulted in a loss of our load balancing and work stealing algorithms (the old one no longer made sense in the context of the new system.) Careful dynamic load balancing is essential to running atypical workloads (which are surprisingly typical among Dask users) so rebuilding this has been all-consuming this week for me personally.

Briefly, Dask initially assigns tasks to workers taking into account the expected runtime of the task, the size and location of the data that the task needs, the duration of other tasks on every worker, and where each piece of data sits on all of the workers. Because the number of tasks can grow into the millions and the number of workers can grow into the thousands, Dask needs to figure out a near-optimal placement in near-constant time, which is hard. Furthermore, after the system runs for a while, uncertainties in our estimates build, and we need to rebalance work from saturated workers to idle workers relatively frequently. Load balancing intelligently and responsively is essential to a satisfying user experience.

We have a decently strong test suite around these behaviors, but it’s hard to be comprehensive on performance-based metrics like this, so there has also been a lot of benchmarking against real systems to identify new failure modes. We’re doing what we can to create isolated tests for every failure mode that we find to make future rewrites retain good behavior.

Generally working on the Dask distributed scheduler has taught me the brittleness of unit tests. As we have repeatedly rewritten internals while maintaining the same external API our testing strategy has evolved considerably away from fine-grained unit tests to a mixture of behavioral integration tests and a very strict runtime validation system.

Rebuilding the load balancing algorithms has been high priority for me personally because these performance issues inhibit current power-users from using the development version on their problems as effectively as with the latest release. I’m looking forward to seeing load-balancing humming nicely again so that users can return to git-master and so that I can return to handling a broader base of issues. (Sorry to everyone I’ve been ignoring the last couple of weeks).

Test deployments on Google Container Engine

I’ve personally started switching over my development cluster from Amazon’s EC2 to Google’s Container Engine. Here are some pro’s and con’s from my particular perspective. Many of these probably have more to do with how I use each particular tool rather than intrinsic limitations of the service itself.

In Google’s Favor

Native and immediate support for Kubernetes and Docker, the combination of which allows me to more quickly and dynamically create and scale clusters for different experiments.
Dynamic scaling from a single node to a hundred nodes and back ten minutes later allows me to more easily run a much larger range of scales.
I like being charged by the minute rather than by the hour, especially given the ability to dynamically scale up
Authentication and billing feel simpler

In Amazon’s Favor

I already have tools to launch Dask on EC2
All of my data is on Amazon’s S3
I have nice data acquisition tools, s3fs, for S3 based on boto3. Google doesn’t seem to have a nice Python 3 library for accessing Google Cloud Storage :(

I’m working from Olivier Grisel’s repository docker-distributed although updating to newer versions and trying to use as few modifications from naive deployment as possible. My current branch is here. I hope to have something more stable for next week.

Fastparquet on S3

We gave fastparquet and Dask.dataframe a spin on some distributed S3 data on Friday. I was surprised that everything seemed to work out of the box. Martin Durant, who built both fastparquet and s3fs has done some nice work to make sure that all of the pieces play nicely together. We ran into some performance issues pulling bytes from S3 itself. I expect that there will be some tweaking over the next few weeks.

Dask Development Log

2016-12-12T00:00:00+00:00

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-05 and 2016-12-12. Nothing here is stable or ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of last week:

Dask.array without known chunk sizes
Import time
Fastparquet blogpost and feedback
Scheduler improvements for 1000+ worker clusters
Channels and inter-client communication
New dependencies?

Dask arrays can now work even in situations where we don’t know the exact chunk size. This is particularly important because it allows us to convert dask.dataframes to dask.arrays in a standard analysis cycle that includes both data preparation and statistical or machine learning algorithms.

x = df.values

x = df.to_records()

This work was motivated by the work of Christopher White on building scalable solvers for problems like logistic regression and generalized linear models over at dask-glm.

As a pleasant side effect we can now also index dask.arrays with dask.arrays (a previous limitation)

x[x > 0]

and mutate dask.arrays in certain cases with setitem

x[x > 0] = 0

Both of which are frequntly requested.

However, there are still holes in this implementation and many operations (like slicing) generally don’t work on arrays without known chunk sizes. We’re increasing capability here but blurring the lines of what is possible and what is not possible, which used to be very clear.

Import time

Import times had been steadily climbing for a while, rising above one second at times. These were reduced by Antoine Pitrou down to a more reasonable 300ms.

FastParquet blogpost and feedback

Martin Durant has built a nice Python Parquet library here: http://fastparquet.readthedocs.io/en/latest/ and released a blogpost about it last week here: https://www.continuum.io/blog/developer-blog/introducing-fastparquet

Since then we’ve gotten some good feedback and error reports (non-string column names etc.) Martin has been optimizing performance and recently adding append support.

Scheduler optimizations for 1000+ worker clusters

The recent refactoring of the scheduler and worker exposed new opportunities for performance and for measurement. One of the 1000+ worker deployments here in NYC was kind enough to volunteer some compute time to run some experiments. It was very fun having all of the Dask/Bokeh dashboards up at once (there are now half a dozen of these things) giving live monitoring information on a thousand-worker deployment. It’s stunning how clearly performance issues present themselves when you have the right monitoring system.

Anyway, this lead to better sequentialization when handling messages, greatly reduced open file handle requirements, and the use of cytoolz over toolz in a few critical areas.

I intend to try this experiment again this week, now with new diagnostics. To aid in that we’ve made it very easy to turn timings and counters automatically into live Bokeh plots. It now takes literally one line of code to add a new plot to these pages (left: scheduler right: worker)

Already we can see that the time it takes to connect between workers is absurdly high in the 10ms to 100ms range, highlighting an important performance flaw.

This depends on an experimental project, crick, by Jim Crist that provides a fast T-Digest implemented in C (see also Ted Dunning’s implementation.

Channels and inter-worker communication

I’m starting to experiment with mechanisms for inter-client communication of futures. This enables both collaborative workflows (two researchers sharing the same cluster) and also complex workflows in which tasks start other tasks in a more streaming setting.

We added a simple mechanism to share a rolling buffer of futures between clients:

# Client 1
c = Client('scheduler:8786')
x = c.channel('x')

future = c.submit(inc, 1)
x.put(future)

# Client 1
c = Client('scheduler:8786')
x = c.channel('x')

future = next(iter(x))

Additionally, this relatively simple mechanism was built external to the scheduler and client, establishing a pattern we can repeat in the future for more complex inter-client communication systems. Generally I’m on the lookout for other ways to make the system more extensible. This range of extension requests for the scheduler is somewhat large these days and we’d like to find ways to keep these expansions maintainable going forward.

dask/distributed#729

New dependency: Sorted collections

The scheduler is now using the sortedcollections module, which is based off of sortedcontainers which is a pure-Python library offering sorted containers SortedList, SortedSet, ValueSortedDict, etc. at C-extensions speeds.

So far I’m pretty sold on these libraries. I encourage other library maintainers to consider them.

Dask Development Log

2016-12-05T00:00:00+00:00

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

Dask has been active lately due to a combination of increased adoption and funded feature development by private companies. This increased activity is great, however an unintended side effect is that I have spent less time writing about development and engaging with the broader community. To address this I hope to write one blogpost a week about general development. These will not be particularly polished, nor will they announce ready-to-use features for users, however they should increase transparency and hopefully better engage the developer community.

So themes of last week

Embedded Bokeh servers for the Workers
Smarter workers
An overhauled scheduler that is slightly simpler overall (thanks to the smarter workers) but with more clever work stealing
Fastparquet

The distributed scheduler’s web diagnostic page is one of Dask’s more flashy features. It shows the passage of every computation on the cluster in real time. These diagnostics are invaluable for understanding performance both for users and for core developers.

I intend to focus on worker performance soon, so I decided to attach a Bokeh server to every worker to serve web diagnostics about that worker. To make this easier, I also learned how to embed Bokeh servers inside of other Tornado applications. This has reduced the effort to create new visuals and expose real time information considerably and I can now create a full live visualization in around 30 minutes. It is now faster for me to build a new diagnostic than to grep through logs. It’s pretty useful.

Here are some screenshots. Nothing too flashy, but this information is highly valuable to me as I measure bandwidths, delays of various parts of the code, how workers send data between each other, etc..

To be clear, these diagnostic pages aren’t polished in any way. There’s lots missing, it’s just what I could get done in a day. Still, everyone running a Tornado application should have an embedded Bokeh server running. They’re great for rapidly pushing out visually rich diagnostics.

Smarter Workers and a Simpler Scheduler

Previously the scheduler knew everything and the workers were fairly simple-minded. Now we’ve moved some of the knowledge and responsibility over to the workers. Previously the scheduler would give just enough work to the workers to keep them occupied. This allowed the scheduler to make better decisions about the state of the entire cluster. By delaying committing a task to a worker until the last moment we made sure that we were making the right decision. However, this also means that the worker sometimes has idle resources, particularly network bandwidth, when it could be speculatively preparing for future work.

Now we commit all ready-to-run tasks to a worker immediately and that worker has the ability to pipeline those tasks as it sees fit. This is better locally but slightly worse globally. To counter balance this we’re now being much more aggressive about work stealing and, because the workers have more information, they can manage some of the administrative costs of works stealing themselves. Because this isn’t bound to run on just the scheduler we can use more expensive algorithms than when when did everything on the scheduler.

There were a few motivations for this change:

Dataframe performance was bound by keeping the worker hardware fully occupied, which we weren’t doing. I expect that these changes will eventually yield something like a 30% speedup.
Users on traditional job scheduler machines (SGE, SLURM, TORQUE) and users who like GPUS, both wanted the ability to tag tasks with specific resource constraints like “This consumes one GPU” or “This task requires a 5GB of RAM while running” and ensure that workers would respect those constraints when running tasks. The old workers weren’t complex enough to reason about these constraints. With the new workers, adding this feature was trivial.
By moving logic from the scheduler to the worker we’ve actually made them both easier to reason about. This should lower barriers for contributors to get into the core project.

Dataframe algorithms

Approximate nunique and multiple-output-partition groupbys landed in master last week. These arose because some power-users had very large dataframes that weree running into scalability limits. Thanks to Mike Graham for the approximate nunique algorithm. This has also pushed hashing changes upstream to Pandas.

Fast Parquet

Martin Durant has been working on a Parquet reader/writer for Python using Numba. It’s pretty slick. He’s been using it on internal Continuum projects for a little while and has seen both good performance and a very Pythonic experience for what was previously a format that was pretty inaccessible.

He’s planning to write about this in the near future so I won’t steal his thunder. Here is a link to the documentation: fastparquet.readthedocs.io

Dask Cluster Deployments

2016-09-22T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

All code in this post is experimental. It should not be relied upon. For people looking to deploy dask.distributed on a cluster please refer instead to the documentation instead.

Dask is deployed today on the following systems in the wild:

SGE
SLURM,
Torque
Condor
LSF
Mesos
Marathon
Kubernetes
SSH and custom scripts
… there may be more. This is what I know of first-hand.

These systems provide users access to cluster resources and ensure that many distributed services / users play nicely together. They’re essential for any modern cluster deployment.

The people deploying Dask on these cluster resource managers are power-users; they know how their resource managers work and they read the documentation on how to setup Dask clusters. Generally these users are pretty happy; however we should reduce this barrier so that non-power-users with access to a cluster resource manager can use Dask on their cluster just as easily.

Unfortunately, there are a few challenges:

Several cluster resource managers exist, each with significant adoption. Finite developer time stops us from supporting all of them.
Policies for scaling out vary widely. For example we might want a fixed number of workers, or we might want workers that scale out based on current use. Different groups will want different solutions.
Individual cluster deployments are highly configurable. Dask needs to get out of the way quickly and let existing technologies configure themselves.

This post talks about some of these issues. It does not contain a definitive solution.

For example, both Olivier Griesl (INRIA, scikit-learn) and Tim O’Donnell (Mount Sinai, Hammer lab) publish instructions on how to deploy Dask.distributed on Kubernetes.

These instructions are well organized. They include Dockerfiles, published images, Kubernetes config files, and instructions on how to interact with cloud providers’ infrastructure. Olivier and Tim both obviously know what they’re doing and care about helping others to do the same.

Tim (who came second) wasn’t aware of Olivier’s solution and wrote up his own. Tim was capable of doing this but many beginners wouldn’t be.

One solution would be to include a prominent registry of solutions like these within Dask documentation so that people can find quality references to use as starting points. I’ve started a list of resources here: dask/distributed #547 comments pointing to other resources would be most welcome..

However, even if Tim did find Olivier’s solution I suspect he would still need to change it. Tim has different software and scalability needs than Olivier. This raises the question of “What should Dask provide and what should it leave to administrators?” It may be that the best we can do is to support copy-paste-edit workflows.

What is Dask-specific, resource-manager specific, and what needs to be configured by hand each time?

Adaptive Deployments

In order to explore this topic of separable solutions I built a small adaptive deployment system for Dask.distributed on Marathon, an orchestration platform on top of Mesos.

This solution does two things:

It scales a Dask cluster dynamically based on the current use. If there are more tasks in the scheduler then it asks for more workers.
It deploys those workers using Marathon.

To encourage replication, these two different aspects are solved in two different pieces of code with a clean API boundary.

A backend-agnostic piece for adaptivity that says when to scale workers up and how to scale them down safely
A Marathon-specific piece that deploys or destroys dask-workers using the Marathon HTTP API

This combines a policy, adaptive scaling, with a backend, Marathon such that either can be replaced easily. For example we could replace the adaptive policy with a fixed one to always keep N workers online, or we could replace Marathon with Kubernetes or Yarn.

My hope is that this demonstration encourages others to develop third party packages. The rest of this post will be about diving into this particular solution.

Adaptivity

The distributed.deploy.Adaptive class wraps around a Scheduler and determines when we should scale up and by how many nodes, and when we should scale down specifying which idle workers to release.

The current policy is fairly straightforward:

If there are unassigned tasks or any stealable tasks and no idle workers, or if the average memory use is over 50%, then increase the number of workers by a fixed factor (defaults to two).
If there are idle workers and the average memory use is below 50% then reclaim the idle workers with the least data on them (after moving data to nearby workers) until we’re near 50%

Think this policy could be improved or have other thoughts? Great. It was easy to implement and entirely separable from the main code so you should be able to edit it easily or create your own. The current implementation is about 80 lines (source).

However, this Adaptive class doesn’t actually know how to perform the scaling. Instead it depends on being handed a separate object, with two methods, scale_up and scale_down:

class MyCluster(object):
    def scale_up(n):
        """
        Bring the total count of workers up to ``n``

        This function/coroutine should bring the total number of workers up to
        the number ``n``.
        """
        raise NotImplementedError()

    def scale_down(self, workers):
        """
        Remove ``workers`` from the cluster

        Given a list of worker addresses this function should remove those
        workers from the cluster.
        """
        raise NotImplementedError()

This cluster object contains the backend-specific bits of how to scale up and down, but none of the adaptive logic of when to scale up and down. The single-machine LocalCluster object serves as reference implementation.

So we combine this adaptive scheme with a deployment scheme. We’ll use a tiny Dask-Marathon deployment library available here

from dask_marathon import MarathonCluster
from distributed import Scheduler
from distributed.deploy import Adaptive

s = Scheduler()
mc = MarathonCluster(s, cpus=1, mem=4000,
                     docker_image='mrocklin/dask-distributed')
ac = Adaptive(s, mc)

This combines a policy, Adaptive, with a deployment scheme, Marathon in a composable way. The Adaptive cluster watches the scheduler and calls the scale_up/down methods on the MarathonCluster as necessary.

Marathon code

Because we’ve isolated all of the “when” logic to the Adaptive code, the Marathon specific code is blissfully short and specific. We include a slightly simplified version below. There is a fair amount of Marathon-specific setup in the constructor and then simple scale_up/down methods below:

from marathon import MarathonClient, MarathonApp
from marathon.models.container import MarathonContainer


class MarathonCluster(object):
    def __init__(self, scheduler,
                 executable='dask-worker',
                 docker_image='mrocklin/dask-distributed',
                 marathon_address='http://localhost:8080',
                 name=None, cpus=1, mem=4000, **kwargs):
        self.scheduler = scheduler

        # Create Marathon App to run dask-worker
        args = [
            executable,
            scheduler.address,
            '--nthreads', str(cpus),
            '--name', '$MESOS_TASK_ID',  # use Mesos task ID as worker name
            '--worker-port', '$PORT_WORKER',
            '--nanny-port', '$PORT_NANNY',
            '--http-port', '$PORT_HTTP'
        ]

        ports = [{'port': 0,
                  'protocol': 'tcp',
                  'name': name}
                 for name in ['worker', 'nanny', 'http']]

        args.extend(['--memory-limit',
                     str(int(mem * 0.6 * 1e6))])

        kwargs['cmd'] = ' '.join(args)
        container = MarathonContainer({'image': docker_image})

        app = MarathonApp(instances=0,
                          container=container,
                          port_definitions=ports,
                          cpus=cpus, mem=mem, **kwargs)

        # Connect and register app
        self.client = MarathonClient(marathon_address)
        self.app = self.client.create_app(name or 'dask-%s' % uuid.uuid4(), app)

    def scale_up(self, instances):
        self.client.scale_app(self.app.id, instances=instances)

    def scale_down(self, workers):
        for w in workers:
            self.client.kill_task(self.app.id,
                                  self.scheduler.worker_info[w]['name'],
                                  scale=True)

This isn’t trivial, you need to know about Marathon for this to make sense, but fortunately you don’t need to know much else. My hope is that people familiar with other cluster resource managers will be able to write similar objects and will publish them as third party libraries as I have with this Marathon solution here: mrocklin/dask-marathon (thanks goes to Ben Zaitlen for setting up a great testing harness for this and getting everything started.)

Adaptive Policies

Similarly, we can design new policies for deployment. You can read more about the policies for the Adaptive class in the documentation or the source (about eighty lines long). I encourage people to implement and use other policies and contribute back those policies that are useful in practice.

Final thoughts

We laid out a problem

How does a distributed system support a variety of cluster resource managers and a variety of scheduling policies while remaining sensible?

We proposed two solutions:

Maintain a registry of links to solutions, supporting copy-paste-edit practices
Develop an API boundary that encourages separable development of third party libraries.

It’s not clear that either solution is sufficient, or that the current implementation of either solution is any good. This is is an important problem though as Dask.distributed is, today, still mostly used by super-users. I would like to engage community creativity here as we search for a good solution.

Dask and Celery

2016-09-13T00:00:00+00:00

This post compares two Python distributed task processing systems, Dask.distributed and Celery.

Disclaimer: technical comparisons are hard to do well. I am biased towards Dask and ignorant of correct Celery practices. Please keep this in mind. Critical feedback by Celery experts is welcome.

Celery is a distributed task queue built in Python and heavily used by the Python community for task-based workloads.

Dask is a parallel computing library popular within the PyData community that has grown a fairly sophisticated distributed task scheduler. This post explores if Dask.distributed can be useful for Celery-style problems.

Comparing technical projects is hard both because authors have bias, and also because the scope of each project can be quite large. This allows authors to gravitate towards the features that show off our strengths. Fortunately a Celery user asked how Dask compares on Github and they listed a few concrete features:

Handling multiple queues
Canvas (celery’s workflow)
Rate limiting
Retrying

These provide an opportunity to explore the Dask/Celery comparision from the bias of a Celery user rather than from the bias of a Dask developer.

In this post I’ll point out a couple of large differences, then go through the Celery hello world in both projects, and then address how these requested features are implemented or not within Dask. This anecdotal comparison over a few features should give us a general comparison.

First, the biggest difference (from my perspective) is that Dask workers hold onto intermediate results and communicate data between each other while in Celery all results flow back to a central authority. This difference was critical when building out large parallel arrays and dataframes (Dask’s original purpose) where we needed to engage our worker processes’ memory and inter-worker communication bandwidths. Computational systems like Dask do this, more data-engineering systems like Celery/Airflow/Luigi don’t. This is the main reason why Dask wasn’t built on top of Celery/Airflow/Luigi originally.

That’s not a knock against Celery/Airflow/Luigi by any means. Typically they’re used in settings where this doesn’t matter and they’ve focused their energies on several features that Dask similarly doesn’t care about or do well. Tasks usually read data from some globally accessible store like a database or S3 and either return very small results, or place larger results back in the global store.

The question on my mind is now is Can Dask be a useful solution in more traditional loose task scheduling problems where projects like Celery are typically used? What are the benefits and drawbacks?

Hello World

To start we do the First steps with Celery walk-through both in Celery and Dask and compare the two:

Celery

I follow the Celery quickstart, using Redis instead of RabbitMQ because it’s what I happen to have handy.

# tasks.py

from celery import Celery

app = Celery('tasks', broker='redis://localhost', backend='redis')

@app.task
def add(x, y):
    return x + y

redis-server
celery -A tasks worker --loglevel=info

In [1]: from tasks import add

In [2]: %time add.delay(1, 1).get()  # submit and retrieve roundtrip
CPU times: user 60 ms, sys: 8 ms, total: 68 ms
Wall time: 567 ms
Out[2]: 2

In [3]: %%time
...: futures = [add.delay(i, i) for i in range(1000)]
...: results = [f.get() for f in futures]
...:
CPU times: user 888 ms, sys: 72 ms, total: 960 ms
Wall time: 1.7 s

Dask

We do the same workload with dask.distributed’s concurrent.futures interface, using the default single-machine deployment.

In [1]: from distributed import Client

In [2]: c = Client()

In [3]: from operator import add

In [4]: %time c.submit(add, 1, 1).result()
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 20.7 ms
Out[4]: 2

In [5]: %%time
...: futures = [c.submit(add, i, i) for i in range(1000)]
...: results = c.gather(futures)
...:
CPU times: user 328 ms, sys: 12 ms, total: 340 ms
Wall time: 369 ms

Comparison

Functions: In Celery you register computations ahead of time on the server. This is good if you know what you want to run ahead of time (such as is often the case in data engineering workloads) and don’t want the security risk of allowing users to run arbitrary code on your cluster. It’s less pleasant on users who want to experiment. In Dask we choose the functions to run on the user side, not on the server side. This ends up being pretty critical in data exploration but may be a hinderance in more conservative/secure compute settings.
Setup: In Celery we depend on other widely deployed systems like RabbitMQ or Redis. Dask depends on lower-level Torando TCP IOStreams and Dask’s own custom routing logic. This makes Dask trivial to set up, but also probably less durable. Redis and RabbitMQ have both solved lots of problems that come up in the wild and leaning on them inspires confidence.
Performance: They both operate with sub-second latencies and millisecond-ish overheads. Dask is marginally lower-overhead but for data engineering workloads differences at this level are rarely significant. Dask is an order of magnitude lower-latency, which might be a big deal depending on your application. For example if you’re firing off tasks from a user clicking a button on a website 20ms is generally within interactive budget while 500ms feels a bit slower.

Simple Dependencies

The question asked about Canvas, Celery’s dependency management system.

Often tasks depend on the results of other tasks. Both systems have ways to help users express these dependencies.

Celery

The apply_async method has a link= parameter that can be used to call tasks after other tasks have run. For example we can compute (1 + 2) + 3 in Celery as follows:

add.apply_async((1, 2), link=add.s(3))

Dask.distributed

With the Dask concurrent.futures API, futures can be used within submit calls and dependencies are implicit.

x = c.submit(add, 1, 2)
y = c.submit(add, x, 3)

We could also use the dask.delayed decorator to annotate arbitrary functions and then use normal-ish Python.

@dask.delayed
def add(x, y):
    return x + y

x = add(1, 2)
y = add(x, 3)
y.compute()

Comparison

I prefer the Dask solution, but that’s subjective.

Complex Dependencies

Celery

Celery includes a rich vocabulary of terms to connect tasks in more complex ways including groups, chains, chords, maps, starmaps, etc.. More detail here in their docs for Canvas, the system they use to construct complex workflows: http://docs.celeryproject.org/en/master/userguide/canvas.html

For example here we chord many adds and then follow them with a sum.

In [1]: from tasks import add, tsum  # I had to add a sum method to tasks.py

In [2]: from celery import chord

In [3]: %time chord(add.s(i, i) for i in range(100))(tsum.s()).get()
CPU times: user 172 ms, sys: 12 ms, total: 184 ms
Wall time: 1.21 s
Out[3]: 9900

Dask

Dask’s trick of allowing futures in submit calls actually goes pretty far. Dask doesn’t really need any additional primitives. It can do all of the patterns expressed in Canvas fairly naturally with normal submit calls.

In [4]: %%time
...: futures = [c.submit(add, i, i) for i in range(100)]
...: total = c.submit(sum, futures)
...: total.result()
...:
CPU times: user 52 ms, sys: 0 ns, total: 52 ms
Wall time: 60.8 ms

Or with Dask.delayed

futures = [add(i, i) for i in range(100)]
total = dask.delayed(sum)(futures)
total.result()

Multiple Queues

In Celery there is a notion of queues to which tasks can be submitted and that workers can subscribe. An example use case is having “high priority” workers that only process “high priority” tasks. Every worker can subscribe to the high-priority queue but certain workers will subscribe to that queue exclusively:

celery -A my-project worker -Q high-priority  # only subscribe to high priority
celery -A my-project worker -Q celery,high-priority  # subscribe to both
celery -A my-project worker -Q celery,high-priority
celery -A my-project worker -Q celery,high-priority

This is like the TSA pre-check line or the express lane in the grocery store.

Dask has a couple of topics that are similar or could fit this need in a pinch, but nothing that is strictly analogous.

First, for the common case above, tasks have priorities. These are typically set by the scheduler to minimize memory use but can be overridden directly by users to give certain tasks precedence over others.

Second, you can restrict tasks to run on subsets of workers. This was originally designed for data-local storage systems like the Hadoop FileSystem (HDFS) or clusters with special hardware like GPUs but can be used in the queues case as well. It’s not quite the same abstraction but could be used to achieve the same results in a pinch. For each task you can restrict the pool of workers on which it can run.

The relevant docs for this are here: http://distributed.readthedocs.io/en/latest/locality.html#user-control

Retrying Tasks

Celery allows tasks to retry themselves on a failure.

@app.task(bind=True)
def send_twitter_status(self, oauth, tweet):
    try:
        twitter = Twitter(oauth)
        twitter.update_status(tweet)
    except (Twitter.FailWhaleError, Twitter.LoginError) as exc:
        raise self.retry(exc=exc)

# Example from http://docs.celeryproject.org/en/latest/userguide/tasks.html#retrying

Sadly Dask currently has no support for this (see open issue). All functions are considered pure and final. If a task errs the exception is considered to be the true result. This could change though; it has been requested a couple of times now.

Until then users need to implement retry logic within the function (which isn’t a terrible idea regardless).

@app.task(bind=True)
def send_twitter_status(self, oauth, tweet, n_retries=5):
    for i in range(n_retries):
        try:
            twitter = Twitter(oauth)
            twitter.update_status(tweet)
            return
        except (Twitter.FailWhaleError, Twitter.LoginError) as exc:
            pass

Rate Limiting

Celery lets you specify rate limits on tasks, presumably to help you avoid getting blocked from hammering external APIs

@app.task(rate_limit='1000/h')
def query_external_api(...):
    ...

Dask definitely has nothing built in for this, nor is it planned. However, this could be done externally to Dask fairly easily. For example, Dask supports mapping functions over arbitrary Python Queues. If you send in a queue then all current and future elements in that queue will be mapped over. You could easily handle rate limiting in Pure Python on the client side by rate limiting your input queues. The low latency and overhead of Dask makes it fairly easy to manage logic like this on the client-side. It’s not as convenient, but it’s still straightforward.

>>> from queue import Queue

>>> q = Queue()

>>> out = c.map(query_external_api, q)
>>> type(out)
Queue

Final Thoughts

Based on this very shallow exploration of Celery, I’ll foolishly claim that Dask can handle Celery workloads, if you’re not diving into deep API. However all of that deep API is actually really important. Celery evolved in this domain and developed tons of features that solve problems that arise over and over again. This history saves users an enormous amount of time. Dask evolved in a very different space and has developed a very different set of tricks. Many of Dask’s tricks are general enough that they can solve Celery problems with a small bit of effort, but there’s still that extra step. I’m seeing people applying that effort to problems now and I think it’ll be interesting to see what comes out of it.

Going through the Celery API was a good experience for me personally. I think that there are some good concepts from Celery that can inform future Dask development.

Dask Distributed Release 1.13.0

2016-09-12T00:00:00+00:00

I’m pleased to announce a release of Dask’s distributed scheduler, dask.distributed, version 1.13.0.

conda install dask distributed -c conda-forge
or
pip install dask distributed --upgrade

The last few months have seen a number of important user-facing features:

Executor is renamed to Client
Workers can spill excess data to disk when they run out of memory
The Client.compute and Client.persist methods for dealing with dask collections (like dask.dataframe or dask.delayed) gain the ability to restrict sub-components of the computation to different parts of the cluster with a workers= keyword argument.
IPython kernels can be deployed on the worker and schedulers for interactive debugging.
The Bokeh web interface has gained new plots and improve the visual styling of old ones.

Additionally there are beta features in current development. These features are available now, but may change without warning in future versions. Experimentation and feedback by users comfortable with living on the bleeding edge is most welcome:

Clients can publish named datasets on the scheduler to share between them
Tasks can launch other tasks
Workers can restart themselves in new software environments provided by the user

There have also been significant internal changes. Other than increased performance these changes should not be directly apparent.

The scheduler was refactored to a more state-machine like architecture. Doc page
Short-lived connections are now managed by a connection pool
Work stealing has changed and grown more responsive: Doc page
General resilience improvements

The rest of this post will contain very brief explanations of the topics above. Some of these topics may become blogposts of their own at some point. Until then I encourage people to look at the distributed scheduler’s documentation which is separate from dask’s normal documentation and so may contain new information for some readers (Google Analytics reports about 5-10x the readership on http://dask.readthedocs.org than on http://distributed.readthedocs.org.

Rename Executor to Client

http://distributed.readthedocs.io/en/latest/api.html

The term Executor was originally chosen to coincide with the concurrent.futures Executor interface, which is what defines the behavior for the .submit, .map, .result methods and Future object used as the primary interface.

Unfortunately, this is the same term used by projects like Spark and Mesos for “the low-level thing that executes tasks on each of the workers” causing significant confusion when communicating with other communities or for transitioning users.

In response we rename Executor to a somewhat more generic term, Client to designate its role as the thing users interact with to control their computations.

>>> from distributed import Executor  # Old
>>> e = Executor()                    # Old

>>> from distributed import Client    # New
>>> c = Client()                      # New

Executor remains an alias for Client and will continue to be valid for some time, but there may be some backwards incompatible changes for internal use of executor= keywords within methods. Newer examples and materials will all use the term Client.

Workers Spill Excess Data to Disk

http://distributed.readthedocs.io/en/latest/worker.html#spill-excess-data-to-disk

When workers get close to running out of memory they can send excess data to disk. This is not on by default and instead requires adding the --memory-limit=auto option to dask-worker.

dask-worker scheduler:8786                      # Old
dask-worker scheduler:8786 --memory-limit=auto  # New

This will eventually become the default (and is now when using LocalCluster) but we’d like to see how things progress and phase it in slowly.

Generally this feature should improve robustness and allow the solution of larger problems on smaller clusters, although with a performance cost. Dask’s policies to reduce memory use through clever scheduling remain in place, so in the common case you should never need this feature, but it’s nice to have as a failsafe.

Enable restriction of valid workers for compute and persist methods

http://distributed.readthedocs.io/en/latest/locality.html#user-control

Expert users of the distributed scheduler will be aware of the ability to restrict certain tasks to run only on certain computers. This tends to be useful when dealing with GPUs or with special databases or instruments only available on some machines.

Previously this option was available only on the submit, map, and scatter methods, forcing people to use the more immedate interface. Now the dask collection interface functions compute and persist support this keyword as well.

IPython Integration

http://distributed.readthedocs.io/en/latest/ipython.html

You can start IPython kernels on the workers or scheduler and then access them directly using either IPython magics or the QTConsole. This tends to be valuable when things go wrong and you want to interactively debug on the worker nodes themselves.

Start IPython on the Scheduler

>>> client.start_ipython_scheduler()  # Start IPython kernel on the scheduler
>>> %scheduler scheduler.processing   # Use IPython magics to inspect scheduler
{'127.0.0.1:3595': ['inc-1', 'inc-2'],
 '127.0.0.1:53589': ['inc-2', 'add-5']}

Start IPython on the Workers

>>> info = e.start_ipython_workers()  # Start IPython kernels on all workers
>>> list(info)
['127.0.0.1:4595', '127.0.0.1:53589']
>>> %remote info['127.0.0.1:3595'] worker.active  # Use IPython magics
{'inc-1', 'inc-2'}

Bokeh Interface

http://distributed.readthedocs.io/en/latest/web.html

The Bokeh web interface to the cluster continues to evolve both by improving existing plots and by adding new plots and new pages.

For example the progress bars have become more compact and shrink down dynamically to respond to addiional bars.

And we’ve added in extra tables and plots to monitor workers, such as their memory use and current backlog of tasks.

Experimental Features

The features described below are experimental and may change without warning. Please do not depend on them in stable code.

Publish Datasets

http://distributed.readthedocs.io/en/latest/publish.html

You can now save collections on the scheduler, allowing you to come back to the same computations later or allow collaborators to see and work off of your results. This can be useful in the following cases:

There is a dataset from which you frequently base all computations, and you want that dataset always in memory and easy to access without having to recompute it each time you start work, even if you disconnect.
You want to send results to a colleague working on the same Dask cluster and have them get immediate access to your computations without having to send them a script and without them having to repeat the work on the cluster.

Example: Client One

from dask.distributed import Client
client = Client('scheduler-address:8786')

import dask.dataframe as dd
df = dd.read_csv('s3://my-bucket/*.csv')
df2 = df[df.balance < 0]
df2 = client.persist(df2)

>>> df2.head()
      name  balance
0    Alice     -100
1      Bob     -200
2  Charlie     -300
3   Dennis     -400
4    Edith     -500

client.publish_dataset(accounts=df2)

Example: Client Two

>>> from dask.distributed import Client
>>> client = Client('scheduler-address:8786')

>>> client.list_datasets()
['accounts']

>>> df = client.get_dataset('accounts')
>>> df.head()
      name  balance
0    Alice     -100
1      Bob     -200
2  Charlie     -300
3   Dennis     -400
4    Edith     -500

Launch Tasks from tasks

http://distributed.readthedocs.io/en/latest/task-launch.html

You can now submit tasks to the cluster that themselves submit more tasks. This allows the submission of highly dynamic workloads that can shape themselves depending on future computed values without ever checking back in with the original client.

This is accomplished by starting new local Clients within the task that can interact with the scheduler.

def func():
    from distributed import local_client
    with local_client() as c2:
        future = c2.submit(...)

c = Client(...)
future = c.submit(func)

There are a few straightforward use cases for this, like iterative algorithms with stoping criteria, but also many novel use cases including streaming and monitoring systems.

Restart Workers in Redeployable Python Environments

You can now zip up and distribute full Conda environments, and ask dask-workers to restart themselves, live, in that environment. This involves the following:

Create a conda environment locally (or any redeployable directory including a python executable)
Zip up that environment and use the existing dask.distributed network to copy it to all of the workers
Shut down all of the workers and restart them within the new environment

This helps users to experiment with different software environments with a much faster turnaround time (typically tens of seconds) than asking IT to install libraries or building and deploying Docker containers (which is also a fine solution). Note that they typical solution of uploading individual python scripts or egg files has been around for a while, see API docs for upload_file

Acknowledgements

Since version 1.12.0 on August 18th the following people have contributed commits to the dask/distributed repository

Dave Hirschfeld
dsidi
Jim Crist
Joseph Crail
Loïc Estève
Martin Durant
Matthew Rocklin
Min RK
Scott Sievert

Dask for Institutions

2016-08-16T00:00:00+00:00

This work is supported by Continuum Analytics

Institutions use software differently than individuals. Over the last few months I’ve had dozens of conversations about using Dask within larger organizations like universities, research labs, private companies, and non-profit learning systems. This post provides a very coarse summary of those conversations and extracts common questions. I’ll then try to answer those questions.

Note: some of this post will be necessarily vague at points. Some companies prefer privacy. All details here are either in public Dask issues or have come up with enough institutions (say at least five) that I’m comfortable listing the problem here.

Common story

Institution X, a university/research lab/company/… has many scientists/analysts/modelers who develop models and analyze data with Python, the PyData stack like NumPy/Pandas/SKLearn, and a large amount of custom code. These models/data sometimes grow to be large enough to need a moderately large amount of parallel computing.

Fortunately, Institution X has an in-house cluster acquired for exactly this purpose of accelerating modeling and analysis of large computations and datasets. Users can submit jobs to the cluster using a job scheduler like SGE/LSF/Mesos/Other.

However the cluster is still under-utilized and the users are still asking for help with parallel computing. Either users aren’t comfortable using the SGE/LSF/Mesos/Other interface, it doesn’t support sufficiently complex/dynamic workloads, or the interaction times aren’t good enough for the interactive use that users appreciate.

There was an internal effort to build a more complex/interactive/Pythonic system on top of SGE/LSF/Mesos/Other but it’s not particularly mature and definitely isn’t something that Institution X wants to pursue. It turned out to be a harder problem than expected to design/build/maintain such a system in-house. They’d love to find an open source solution that was well featured and maintained by a community.

The Dask.distributed scheduler looks like it’s 90% of the system that Institution X needs. However there are a few open questions:

How do we integrate dask.distributed with the SGE/LSF/Mesos/Other job scheduler?
How can we grow and shrink the cluster dynamically based on use?
How do users manage software environments on the workers?
How secure is the distributed scheduler?
Dask is resilient to worker failure, how about scheduler failure?
What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?
How do we handle multiple concurrent users and priorities?
How does this compare with Spark?

So for the rest of this post I’m going to answer these questions. As usual, few of answers will be of the form “Yes Dask can solve all of your problems.” These are open questions, not the questions that were easy to answer. We’ll get into what’s possible today and how we might solve these problems in the future.

How do we integrate dask.distributed with SGE/LSF/Mesos/Other?

It’s not difficult to deploy dask.distributed at scale within an existing cluster using a tool like SGE/LSF/Mesos/Other. In many cases there is already a researcher within the institution doing this manually by running dask-scheduler on some static node in the cluster and launching dask-worker a few hundred times with their job scheduler and a small job script.

The goal now is how to formalize this process for the individual version of SGE/LSF/Mesos/Other used within the institution while also developing and maintaining a standard Pythonic interface so that all of these tools can be maintained cheaply by Dask developers into the foreseeable future. In some cases Institution X is happy to pay for the development of a convenient “start dask on my job scheduler” tool, but they are less excited about paying to maintain it forever.

We want Python users to be able to say something like the following:

from dask.distributed import Executor, SGECluster

c = SGECluster(nworkers=200, **options)
e = Executor(c)

… and have this same interface be standardized across different job schedulers.

How can we grow and shrink the cluster dynamically based on use?

Alternatively, we could have a single dask.distributed deployment running 24/7 that scales itself up and down dynamically based on current load. Again, this is entirely possible today if you want to do it manually (you can add and remove workers on the fly) but we should add some signals to the scheduler like the following:

“I’m under duress, please add workers”
“I’ve been idling for a while, please reclaim workers”

and connect these signals to a manager that talks to the job scheduler. This removes an element of control from the users and places it in the hands of a policy that IT can tune to play more nicely with their other services on the same network.

How do users manage software environments on the workers?

Today Dask assumes that all users and workers share the exact same software environment. There are some small tools to send updated .py and .egg files to the workers but that’s it.

Generally Dask trusts that the full software environment will be handled by something else. This might be a network file system (NFS) mount on traditional cluster setups, or it might be handled by moving docker or conda environments around by some other tool like knit for YARN deployments or something more custom. For example Continuum sells proprietary software that does this.

Getting the standard software environment setup generally isn’t such a big deal for institutions. They typically have some system in place to handle this already. Where things become interesting is when users want to use drastically different environments from the system environment, like using Python 2 vs Python 3 or installing a bleeding-edge scikit-learn version. They may also want to change the software environment many times in a single session.

The best solution I can think of here is to pass around fully downloaded conda environments using the dask.distributed network (it’s good at moving large binary blobs throughout the network) and then teaching the dask-workers to bootstrap themselves within this environment. We should be able to tear everything down and restart things within a small number of seconds. This requires some work; first to make relocatable conda binaries (which is usually fine but is not always fool-proof due to links) and then to help the dask-workers learn to bootstrap themselves.

Somewhat related, Hussain Sultan of Capital One recently contributed a dask-submit command to run scripts on the cluster: http://distributed.readthedocs.io/en/latest/submitting-applications.html

How secure is the distributed scheduler?

Dask.distributed is incredibly insecure. It allows anyone with network access to the scheduler to execute arbitrary code in an unprotected environment. Data is sent in the clear. Any malicious actor can both steal your secrets and then cripple your cluster.

This is entirely the norm however. Security is usually handled by other services that manage computational frameworks like Dask.

For example we might rely on Docker to isolate workers from destroying their surrounding environment and rely on network access controls to protect data access.

Because Dask runs on Tornado, a serious networking library and web framework, there are some things we can do easily like enabling SSL, authentication, etc.. However I hesitate to jump into providing “just a little bit of security” without going all the way for fear of providing a false sense of security. In short, I have no plans to work on this without a lot of encouragement. Even then I would strongly recommend that institutions couple Dask with tools intended for security. I believe that is common practice for distributed computational systems generally.

Dask is resilient to worker failure, how about scheduler failure?

Workers can come and go. Clients can come and go. The state in the scheduler is currently irreplaceable and no attempt is made to back it up. There are a few things you could imagine here:

Backup state and recent events to some persistent storage so that state can be recovered in case of catastrophic loss
Have a hot failover node that gets a copy of every action that the scheduler takes
Have multiple peer schedulers operate simultaneously in a way that they can pick up slack from lost peers
Have clients remember what they have submitted and resubmit when a scheduler comes back online

Currently option 4 is currently the most feasible and gets us most of the way there. However options 2 or 3 would probably be necessary if Dask were to ever run as critical infrastructure in a giant institution. We’re not there yet.

As of recent work spurred on by Stefan van der Walt at UC Berkeley/BIDS the scheduler can now die and come back and everyone will reconnect. The state for computations in flight is entirely lost but the computational infrastructure remains intact so that people can resubmit jobs without significant loss of service.

Dask has a bit of a harder time with this topic because it offers a persistent stateful interface. This problem is much easier for distributed database projects that run ephemeral queries off of persistent storage, return the results, and then clear out state.

What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?

The short answer is no. Other than number of cores and available RAM all workers are considered equal to each other (except when the user explicitly specifies otherwise).

However this problem and problems like it have come up a lot lately. Here are a few examples of similar cases:

Multiple data centers geographically distributed around the country
Multiple racks within a single data center
Multiple workers that have GPUs that can move data between each other easily
Multiple processes on a single machine

Having some notion of hierarchical worker group membership or inter-worker preferred relationships is probably inevitable long term. As with all distributed scheduling questions the hard part isn’t deciding that this is useful, or even coming up with a sensible design, but rather figuring out how to make decisions on the sensible design that are foolproof and operate in constant time. I don’t personally see a good approach here yet but expect one to arise as more high priority use cases come in.

How do we handle multiple concurrent users and priorities?

There are several sub-questions here:

Can multiple users use Dask on my cluster at the same time?

Yes, either by spinning up separate scheduler/worker sets or by sharing the same set.

If they’re sharing the same workers then won’t they clobber each other’s data?

This is very unlikely. Dask is careful about naming tasks, so it’s very unlikely that the two users will submit conflicting computations that compute to different values but occupy the same key in memory. However if they both submit computations that overlap somewhat then the scheduler will nicely avoid recomputation. This can be very nice when you have many people doing slightly different computations on the same hardware. This works in the same way that Git works.

If they’re sharing the same workers then won’t they clobber each other’s resources?

Yes, this is definitely possible. If you’re concerned about this then you should give everyone their own scheduler/workers (which is easy and standard practice). There is not currently much user management built into Dask.

How does this compare with Spark?

At an institutional level Spark seems to primarily target ETL + Database-like computations. While Dask modules like Dask.bag and Dask.dataframe can happily play in this space this doesn’t seem to be the focus of recent conversations.

Recent conversations are almost entirely around supporting interactive custom parallelism (lots of small tasks with complex dependencies between them) rather than the big Map->Filter->Groupby->Join abstractions you often find in a database or Spark. That’s not to say that these operations aren’t hugely important; there is a lot of selection bias here. The people I talk to are people for whom Spark/Databases are clearly not an appropriate fit. They are tackling problems that are way more complex, more heterogeneous, and with a broader variety of users.

I usually describe this situation with an analogy comparing “Big data” systems to human transportation mechanisms in a city. Here we go:

A Database is like a train: it goes between a set of well defined points with great efficiency, speed, and predictability. These are popular and profitable routes that many people travel between (e.g. business analytics). You do have to get from home to the train station on your own (ETL), but once you’re in the database/train you’re quite comfortable.
Spark is like an automobile: it takes you door-to-door from your home to your destination with a single tool. While this may not be as fast as the train for the long-distance portion, it can be extremely convenient to do ETL, Database work, and some machine learning all from the comfort of a single system.
Dask is like an all-terrain-vehicle: it takes you out of town on rough ground that hasn’t been properly explored before. This is a good match for the Python community, which typically does a lot of exploration into new approaches. You can also drive your ATV around town and you’ll be just fine, but if you want to do thousands of SQL queries then you should probably invest in a proper database or in Spark.

Again, there is a lot of selection bias here, if what you want is a database then you should probably get a database. Dask is not a database.

This is also wildly over-simplifying things. Databases like Oracle have lots of ETL and analytics tools, Spark is known to go off road, etc.. I obviously have a bias towards Dask. You really should never trust an author of a project to give a fair and unbiased view of the capabilities of the tools in the surrounding landscape.

Conclusion

That’s a rough sketch of current conversations and open problems for “How Dask might evolve to support institutional use cases.” It’s really quite surprising just how prevalent this story is among the full spectrum from universities to hedge funds.

The problems listed above are by no means halting adoption. I’m not listing the 100 or so questions that are answered with “yes, that’s already supported quite well”. Right now I’m seeing Dask being adopted by individuals and small groups within various institutions. Those individuals and small groups are pushing that interest up the stack. It’s still several months before any 1000+ person organization adopts Dask as infrastructure, but the speed at which momentum is building is quite encouraging.

I’d also like to thank the several nameless people who exercise Dask on various infrastructures at various scales on interesting problems and have reported serious bugs. These people don’t show up on the GitHub issue tracker but their utility in flushing out bugs is invaluable.

As interest in Dask grows it’s interesting to see how it will evolve. Culturally Dask has managed to simultaneously cater to both the open science crowd as well as the private-sector crowd. The project gets both financial support and open source contributions from each side. So far there hasn’t been any conflict of interest (everyone is pushing in roughly the same direction) which has been a really fruitful experience for all involved I think.

Dask and Scikit-Learn -- Model Parallelism

2016-07-12T00:00:00+00:00

This post was written by Jim Crist. The original post lives at http://jcrist.github.io/dask-sklearn-part-1.html (with better styling)

This is the first of a series of posts discussing some recent experiments combining dask and scikit-learn. A small (and extremely alpha) library has been built up from these experiments, and can be found here.

Before we start, I would like to make the following caveats:

I am not a machine learning expert. Do not consider this a guide on how to do machine learning, the usage of scikit-learn below is probably naive.
All of the code discussed here is in flux, and shouldn’t be considered stable or robust. That said, if you know something about machine learning and want to help out, I’d be more than happy to receive issues or pull requests :).

There are several ways of parallelizing algorithms in machine learning. Some algorithms can be made to be data-parallel (either across features or across samples). In this post we’ll look instead at model-parallelism (use same data across different models), and dive into a daskified implementation of GridSearchCV.

Many machine learning algorithms have hyperparameters which can be tuned to improve the performance of the resulting estimator. A grid search is one way of optimizing these parameters — it works by doing a parameter sweep across a cartesian product of a subset of these parameters (the “grid”), and then choosing the best resulting estimator. Since this is fitting many independent estimators across the same set of data, it can be fairly easily parallelized.

Grid search with scikit-learn

In scikit-learn, a grid search is performed using the GridSearchCV class, and can (optionally) be automatically parallelized using joblib.

This is best illustrated with an example. First we’ll make an example dataset for doing classification against:

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=10000,
                           n_features=500,
                           n_classes=2,
                           n_redundant=250,
                           random_state=42)

To solve this classification problem, we’ll create a pipeline of a PCA and a LogisticRegression:

from sklearn import linear_model, decomposition
from sklearn.pipeline import Pipeline

logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca),
                       ('logistic', logistic)])

Both of these classes take several hyperparameters, we’ll do a grid-search across only a few of them:

#Parameters of pipelines can be set using ‘__’ separated parameter names:
grid = dict(pca__n_components=[50, 100, 250],
            logistic__C=[1e-4, 1.0, 1e4],
            logistic__penalty=['l1', 'l2'])

Finally, we can create an instance of GridSearchCV, and perform the grid search. The parameter n_jobs=-1 tells joblib to use as many processes as I have cores (8).

>>> from sklearn.grid_search import GridSearchCV
>>> estimator = GridSearchCV(pipe, grid, n_jobs=-1)
>>> %time estimator.fit(X, y)
CPU times: user 5.3 s, sys: 243 ms, total: 5.54 s
Wall time: 21.6 s

What happened here was:

An estimator was created for each parameter combination and test-train set (scikit-learn’s grid search also does cross validation across 3-folds by default).
Each estimator was fit on its corresponding set of training data
Each estimator was then scored on its corresponding set of testing data
The best set of parameters was chosen based on these scores
A new estimator was then fit on all of the data, using the best parameters

The corresponding best score, parameters, and estimator can all be found as attributes on the resulting object:

>>> estimator.best_score_
0.89290000000000003

>>> estimator.best_params_
{'logistic__C': 0.0001, 'logistic__penalty': 'l2', 'pca__n_components': 50}

>>> estimator.best_estimator_
Pipeline(steps=[('pca', PCA(copy=True, n_components=50, whiten=False)), ('logistic', LogisticRegression(C=0.0001, class_weight=None, dual=False,
        fit_intercept=True, intercept_scaling=1, max_iter=100,
        multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
        solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])<div class=md_output>

    {'logistic__C': 0.0001, 'logistic__penalty': 'l2', 'pca__n_components': 50}

Grid search with dask-learn

Here we’ll repeat the same fit using dask-learn. I’ve tried to match the scikit-learn interface as much as possible, although not everything is implemented. Here the only thing that really changes is the GridSearchCV import. We don’t need the n_jobs keyword, as this will be parallelized across all cores by default.

>>> from dklearn.grid_search import GridSearchCV as DaskGridSearchCV
>>> destimator = DaskGridSearchCV(pipe, grid)
>>> %time destimator.fit(X, y)

CPU times: user 16.3 s, sys: 1.89 s, total: 18.2 s
Wall time: 5.63 s

As before, the best score, parameters, and estimator can all be found as attributes on the object. Here we’ll just show that they’re equivalent:

>>> destimator.best_score_ == estimator.best_score_
True

>>> destimator.best_params_ == estimator.best_params_
True

>>> destimator.best_estimator_
Pipeline(steps=[('pca', PCA(copy=True, n_components=50, whiten=False)), ('logistic', LogisticRegression(C=0.0001, class_weight=None, dual=False,
        fit_intercept=True, intercept_scaling=1, max_iter=100,
        multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
        solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])<div class=md_output>

    {'logistic__C': 0.0001, 'logistic__penalty': 'l2', 'pca__n_components': 50}

Why is the dask version faster?

If you look at the times above, you’ll note that the dask version was ~4X faster than the scikit-learn version. This is not because we have optimized any of the pieces of the Pipeline, or that there’s a significant amount of overhead to joblib (on the contrary, joblib does some pretty amazing things, and I had to construct a contrived example to beat it this badly). The reason is simply that the dask version is doing less work.

This maybe best explained in pseudocode. The scikit-learn version of the above (in serial) looks something like (pseudocode):

for X_train, X_test, y_train, y_test in cv:
    for n in grid['pca__n_components']:
        for C in grid['logistic__C']:
            for penalty in grid['logistic__penalty']:
                # Create and fit a PCA on the input data
                pca = PCA(n_components=n).fit(X_train, y_train)
                # Transform both the train and test data
                X_train2 = pca.transform(X_train)
                X_test2 = pca.transform(X_test)
                # Create and fit a LogisticRegression on the transformed data
                logistic = LogisticRegression(C=C, penalty=penalty)
                logistic.fit(X_train2, y_train)
                # Score the total pipeline
                score = logistic.score(X_test2, y_test)
                # Save the score and parameters
                scores_and_params.append((score, n, C))

# Find the best set of parameters (for some definition of best)
find_best_parameters(scores)

This is looping through a cartesian product of the cross-validation sets and all the parameter combinations, and then creating and fitting a new estimator for each combination. While embarassingly parallel, this can also result in repeated work, as earlier stages in the pipeline are refit multiple times on the same parameter + data combinations.

In contrast, the dask version hashes all inputs (forming a sort of Merkle DAG), resulting in the intermediate results being shared. Keeping with the pseudocode above, the dask version might look like:

for X_train, X_test, y_train, y_test in cv:
    for n in grid['pca__n_components']:
        # Create and fit a PCA on the input data
        pca = PCA(n_components=n).fit(X_train, y_train)
        # Transform both the train and test data
        X_train2 = pca.transform(X_train)
        X_test2 = pca.transform(X_test)
        for C in grid['logistic__C']:
            for penalty in grid['logistic__penalty']:
                # Create and fit a LogisticRegression on the transformed data
                logistic = LogisticRegression(C=C, penalty=penalty)
                logistic.fit(X_train2, y_train)
                # Score the total pipeline
                score = logistic.score(X_test2, y_test)
                # Save the score and parameters
                scores_and_params.append((score, n, C, penalty))

# Find the best set of parameters (for some definition of best)
find_best_parameters(scores)

This can still be parallelized, but in a less straightforward manner - the graph is a bit more complicated than just a simple map-reduce pattern. Thankfully the dask schedulers are well equipped to handle arbitrary graph topologies. Below is a GIF showing how the dask scheduler (the threaded scheduler specifically) executed the grid search performed above. Each rectangle represents data, and each circle represents a task. Each is categorized by color:

Red means actively taking up resources. These are tasks executing in a thread, or intermediate results occupying memory
Blue means finished or released. These are already finished tasks, or data that’s been released from memory because it’s no longer needed

Looking at the trace, a few things stand out:

We do a good job sharing intermediates. Each step in a pipeline is only fit once given the same parameters/data, resulting in some intermediates having many dependent tasks.
The scheduler does a decent job of quickly finishing up tasks required to release data. This doesn’t matter as much here (none of the intermediates take up much memory), but for other workloads this is very useful. See Matt Rocklin’s excellent blogpost here for more discussion on this.

Distributed grid search using dask-learn

The schedulers used in dask are configurable. The default (used above) is the threaded scheduler, but we can just as easily swap it out for the distributed scheduler. Here I’ve just spun up two local workers to demonstrate, but this works equally well across multiple machines.

>>> from distributed import Executor

>>> # Create an Executor, and set it as the default scheduler
>>> exc = Executor('10.0.0.3:8786', set_as_default=True)
>>> exc
<Executor: scheduler="10.0.0.3:8786" processes=2 cores=8>

>>> %time destimator.fit(X, y)
CPU times: user 1.69 s, sys: 433 ms, total: 2.12 s
Wall time: 7.66 s

>>> %time destimator.fit(X, y)
CPU times: user 1.69 s, sys: 433 ms, total: 2.12 s
Wall time: 7.66 s

>>> (destimator.best_score_ == estimator.best_score_ and
...  destimator.best_params_ == estimator.best_params_)
True

Note that this is slightly slower than the threaded execution, so it doesn’t make sense for this workload, but for others it might.

What worked well

The code for doing this is quite short. There’s also an implementation of RandomizedSearchCV, which is only a few extra lines (hooray for good class hierarchies!). Instead of working with dask graphs directly, both implementations use dask.delayed wherever possible, which also makes the code easy to read.
Due to the internal hashing used in dask (which is extensible!), duplicate computations are avoided.
Since the graphs are separated from the scheduler, this works both locally and distributed with only a few extra lines.

Caveats and what could be better

The scikit-learn api makes use of mutation (est.fit(X, y) mutates est), while dask collections are mostly immutable. After playing around with a few different ideas, I settled on dask-learn estimators being immutable (except for grid-search, more on this in a bit). This made the code easier to reason about, but does mean that you need to do est = est.fit(X, y) when working with dask-learn estimators.
GridSearchCV posed a different problem. Due to the refit keyword, the implementation can’t be done in a single pass over the data. This means that we can’t build a single graph describing both the grid search and the refit, which prevents it from being done lazily. I debated removing this keyword, but decided in the end to make fit execute immediately. This means that there’s a bit of a disconnect between GridSearchCV and the other classes in the library, which I don’t like. On the other hand, it does mean that this version of GridSearchCV could be a drop-in for the sckit-learn one.
The approach presented here is nice, but is really only beneficial when there’s duplicate work to be avoided, and that duplicate work is expensive. Repeating the above with only a single estimator (instead of a pipeline) results in identical (or slightly worse) performance than joblib. Similarly, if the repeated steps are cheap the difference in performance is much smaller (try the above using SelectKBest instead of PCA).
The ability to swap easily from local to distributed execution is nice, but distributed also contains a joblib frontend that can do this just as easily.

Help

I am not a machine learning expert. Is any of this useful? Do you have suggestions for improvements (or better yet PRs for improvements :))? Please feel free to reach out in the comments below, or on github.

This work is supported by Continuum Analytics and the XDATA program as part of the Blaze Project.

Ad Hoc Distributed Random Forests

2016-04-20T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://www.youtube.com/watch?v=FkPlEqB8AnE

Dask.distributed lets you submit individual tasks to the cluster. We use this ability combined with Scikit Learn to train and run a distributed random forest on distributed tabular NYC Taxi data.

Our machine learning model does not perform well, but we do learn how to execute ad-hoc computations easily.

Motivation

In the past few posts we analyzed data on a cluster with Dask collections:

Often our computations don’t fit neatly into the bag, dataframe, or array abstractions. In these cases we want the flexibility of normal code with for loops, but still with the computational power of a cluster. With the dask.distributed task interface, we achieve something close to this.

Application: Naive Distributed Random Forest Algorithm

As a motivating application we build a random forest algorithm from the ground up using the single-machine Scikit Learn library, and dask.distributed’s ability to quickly submit individual tasks to run on the cluster. Our algorithm will look like the following:

Pull data from some external source (S3) into several dataframes on the cluster
For each dataframe, create and train one RandomForestClassifier
Scatter single testing dataframe to all machines
For each RandomForestClassifier predict output on test dataframe
Aggregate independent predictions from each classifier together by a majority vote. To avoid bringing too much data to any one machine, perform this majority vote as a tree reduction.

Data: NYC Taxi 2015

As in our blogpost on distributed dataframes we use the data on all NYC Taxi rides in 2015. This is around 20GB on disk and 60GB in RAM.

We predict the number of passengers in each cab given the other numeric columns like pickup and destination location, fare breakdown, distance, etc..

We do this first on a small bit of data on a single machine and then on the entire dataset on the cluster. Our cluster is composed of twelve m4.xlarges (4 cores, 15GB RAM each).

Disclaimer and Spoiler Alert: I am not an expert in machine learning. Our algorithm will perform very poorly. If you’re excited about machine learning you can stop reading here. However, if you’re interested in how to build distributed algorithms with Dask then you may want to read on, especially if you happen to know enough machine learning to improve upon my naive solution.

API: submit, map, gather

We use a small number of dask.distributed functions to build our computation:

futures = executor.scatter(data)                     # scatter data
future = executor.submit(function, *args, **kwargs)  # submit single task
futures = executor.map(function, sequence)           # submit many tasks
results = executor.gather(futures)                   # gather results
executor.replicate(futures, n=number_of_replications)

In particular, functions like executor.submit(function, *args) let us send individual functions out to our cluster thousands of times a second. Because these functions consume their own results we can create complex workflows that stay entirely on the cluster and trust the distributed scheduler to move data around intelligently.

Load Pandas from S3

First we load data from Amazon S3. We use the s3.read_csv(..., collection=False) function to load 178 Pandas DataFrames on our cluster from CSV data on S3. We get back a list of Future objects that refer to these remote dataframes. The use of collection=False gives us this list of futures rather than a single cohesive Dask.dataframe object.

from distributed import Executor, s3
e = Executor('52.91.1.177:8786')

dfs = s3.read_csv('dask-data/nyc-taxi/2015',
                  parse_dates=['tpep_pickup_datetime',
                               'tpep_dropoff_datetime'],
                  collection=False)
dfs = e.compute(dfs)

Each of these is a lightweight Future pointing to a pandas.DataFrame on the cluster.

>>> dfs[:5]
[<Future: status: finished, type: DataFrame, key: finalize-a06c3dd25769f434978fa27d5a4cf24b>,
 <Future: status: finished, type: DataFrame, key: finalize-7dcb27364a8701f45cb02d2fe034728a>,
 <Future: status: finished, type: DataFrame, key: finalize-b0dfe075000bd59c3a90bfdf89a990da>,
 <Future: status: finished, type: DataFrame, key: finalize-1c9bb25cefa1b892fac9b48c0aef7e04>,
 <Future: status: finished, type: DataFrame, key: finalize-c8254256b09ae287badca3cf6d9e3142>]

If we’re willing to wait a bit then we can pull data from any future back to our local process using the .result() method. We don’t want to do this too much though, data transfer can be expensive and we can’t hold the entire dataset in the memory of a single machine. Here we just bring back one of the dataframes:

>>> df = dfs[0].result()
>>> df.head()

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RateCodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount
0	2	2015-01-15 19:05:39	2015-01-15 19:23:42	1	1.59	-73.993896	40.750111	1	N	-73.974785	40.750618	1	12.0	1.0	0.5	3.25	0.3	17.05
1	1	2015-01-10 20:33:38	2015-01-10 20:53:28	1	3.30	-74.001648	40.724243	1	N	-73.994415	40.759109	1	14.5	0.5	0.5	2.00	0.3	17.80
2	1	2015-01-10 20:33:38	2015-01-10 20:43:41	1	1.80	-73.963341	40.802788	1	N	-73.951820	40.824413	2	9.5	0.5	0.5	0.00	0.3	10.80
3	1	2015-01-10 20:33:39	2015-01-10 20:35:31	1	0.50	-74.009087	40.713818	1	N	-74.004326	40.719986	2	3.5	0.5	0.5	0.00	0.3	4.80
4	1	2015-01-10 20:33:39	2015-01-10 20:52:58	1	3.00	-73.971176	40.762428	1	N	-74.004181	40.742653	2	15.0	0.5	0.5	0.00	0.3	16.30

Train on a single machine

To start lets go through the standard Scikit Learn fit/predict/score cycle with this small bit of data on a single machine.

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

df_train, df_test = train_test_split(df)

columns = ['trip_distance', 'pickup_longitude', 'pickup_latitude',
           'dropoff_longitude', 'dropoff_latitude', 'payment_type',
           'fare_amount', 'mta_tax', 'tip_amount', 'tolls_amount']

est = RandomForestClassifier(n_estimators=4)
est.fit(df_train[columns], df_train.passenger_count)

This builds a RandomForestClassifer with four decision trees and then trains it against the numeric columns in the data, trying to predict the passenger_count column. It takes around 10 seconds to train on a single core. We now see how well we do on the holdout testing data:

>>> est.score(df_test[columns], df_test.passenger_count)
0.65808188654721012

This 65% accuracy is actually pretty poor. About 70% of the rides in NYC have a single passenger, so the model of “always guess one” would out-perform our fancy random forest.

>>> from sklearn.metrics import accuracy_score
>>> import numpy as np
>>> accuracy_score(df_test.passenger_count,
...                np.ones_like(df_test.passenger_count))
0.70669390028780987

This is where my ignorance in machine learning really kills us. There is likely a simple way to improve this. However, because I’m more interested in showing how to build distributed computations with Dask than in actually doing machine learning I’m going to go ahead with this naive approach. Spoiler alert: we’re going to do a lot of computation and still not beat the “always guess one” strategy.

Fit across the cluster with executor.map

First we build a function that does just what we did before, builds a random forest and then trains it on a dataframe.

def fit(df):
    est = RandomForestClassifier(n_estimators=4)
    est.fit(df[columns], df.passenger_count)
    return est

Second we call this function on all of our training dataframes on the cluster using the standard e.map(function, sequence) function. This sends out many small tasks for the cluster to run. We use all but the last dataframe for training data and hold out the last dataframe for testing. There are more principled ways to do this, but again we’re going to charge ahead here.

train = dfs[:-1]
test = dfs[-1]

estimators = e.map(fit, train)

This takes around two minutes to train on all of the 177 dataframes and now we have 177 independent estimators, each capable of guessing how many passengers a particular ride had. There is relatively little overhead in this computation.

Predict on testing data

Recall that we kept separate a future, test, that points to a Pandas dataframe on the cluster that was not used to train any of our 177 estimators. We’re going to replicate this dataframe across all workers on the cluster and then ask each estimator to predict the number of passengers for each ride in this dataset.

e.replicate([test], n=48)

def predict(est, X):
    return est.predict(X[columns])

predictions = [e.submit(predict, est, test) for est in estimators]

Here we used the executor.submit(function, *args, **kwrags) function in a list comprehension to individually launch many tasks. The scheduler determines when and where to run these tasks for optimal computation time and minimal data transfer. As with all functions, this returns futures that we can use to collect data if we want in the future.

Developers note: we explicitly replicate here in order to take advantage of efficient tree-broadcasting algorithms. This is purely a performance consideration, everything would have worked fine without this, but the explicit broadcast turns a 30s communication+computation into a 2s communication+computation.

Aggregate predictions by majority vote

For each estimator we now have an independent prediction of the passenger counts for all of the rides in our test data. In other words for each ride we have 177 different opinions on how many passengers were in the cab. By averaging these opinions together we hope to achieve a more accurate consensus opinion.

For example, consider the first four prediction arrays:

>>> a_few_predictions = e.gather(predictions[:4])  # remote futures -> local arrays
>>> a_few_predictions
[array([1, 2, 1, ..., 2, 2, 1]),
 array([1, 1, 1, ..., 1, 1, 1]),
 array([2, 1, 1, ..., 1, 1, 1]),
 array([1, 1, 1, ..., 1, 1, 1])]

For the first ride/column we see that three of the four predictions are for a single passenger while one prediction disagrees and is for two passengers. We create a consensus opinion by taking the mode of the stacked arrays:

from scipy.stats import mode
import numpy as np

def mymode(*arrays):
    array = np.stack(arrays, axis=0)
    return mode(array)[0][0]

>>> mymode(*a_few_predictions)
array([1, 1, 1, ..., 1, 1, 1])

And so when we average these four prediction arrays together we see that the majority opinion of one passenger dominates for all of the six rides visible here.

Tree Reduction

We could call our mymode function on all of our predictions like this:

>>> mode_prediction = e.submit(mymode, *predictions)  # this doesn't scale well

Unfortunately this would move all of our results to a single machine to compute the mode there. This might swamp that single machine.

Instead we batch our predictions into groups of size 10, average each group, and then repeat the process with the smaller set of predictions until we have only one left. This sort of multi-step reduction is called a tree reduction. We can write it up with a couple nested loops and executor.submit. This is only an approximation of the mode, but it’s a much more scalable computation. This finishes in about 1.5 seconds.

from toolz import partition_all

while len(predictions) > 1:
    predictions = [e.submit(mymode, *chunk)
                   for chunk in partition_all(10, predictions)]

result = e.gather(predictions)[0]

>>> result
array([1, 1, 1, ..., 1, 1, 1])

Final Score

Finally, after completing all of our work on our cluster we can see how well our distributed random forest algorithm does.

>>> accuracy_score(result, test.result().passenger_count)
0.67061974451423045

Still worse than the naive “always guess one” strategy. This just goes to show that, no matter how sophisticated your Big Data solution is, there is no substitute for common sense and a little bit of domain expertise.

What didn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

Clearly this would have benefited from more machine learning knowledge. What would have been a good approach for this problem?
I’ve been thinking a bit about memory management of replicated data on the cluster. In this exercise we specifically replicated out the test data. Everything would have worked fine without this step but it would have been much slower as every worker gathered data from the single worker that originally had the test dataframe. Replicating data is great until you start filling up distributed RAM. It will be interesting to think of policies about when to start cleaning up redundant data and when to keep it around.
Several people from both open source users and Continuum customers have asked about a general Dask library for machine learning, something akin to Spark’s MLlib. Ideally a future Dask.learn module would leverage Scikit-Learn in the same way that Dask.dataframe leverages Pandas. It’s not clear how to cleanly break up and parallelize Scikit-Learn algorithms.

Conclusion

This blogpost gives a concrete example using basic task submission with executor.map and executor.submit to build a non-trivial computation. This approach is straightforward and not restrictive. Personally this interface excites me more than collections like Dask.dataframe; there is a lot of freedom in arbitrary task submission.

Fast Message Serialization

2016-04-14T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Very high performance isn’t about doing one thing well, it’s about doing nothing poorly.

This week I optimized the inter-node communication protocol used by dask.distributed. It was a fun exercise in optimization that involved several different and unexpected components. I separately had to deal with Pickle, NumPy, Tornado, MsgPack, and compression libraries.

This blogpost is not advertising any particular functionality, rather it’s a story of the problems I ran into when designing and optimizing a protocol to quickly send both very small and very large numeric data between machines on the Python stack.

We care very strongly about both the many small messages case (thousands of 100 byte messages per second) and the very large messages case (100-1000 MB). This spans an interesting range of performance space. We end up with a protocol that costs around 5 microseconds in the small case and operates at 1-1.5 GB/s in the large case.

This came about as I was preparing a demo using dask.array on a distributed cluster for a Continuum webinar. I noticed that my computations were taking much longer than expected. The Web UI quickly pointed me to the fact that my machines were spending 10-20 seconds moving 30 MB chunks of numpy array data between them. This is very strange because I was on 100MB/s network, and so I expected these transfers to happen in more like 0.3s than 15s.

The Web UI made this glaringly apparent, so my first lesson was how valuable visual profiling tools can be when they make performance issues glaringly obvious. Thanks here goes to the Bokeh developers who helped the development of the Dask real-time Web UI.

Problem 1: Tornado’s sentinels

Dask’s networking is built off of Tornado’s TCP IOStreams.

There are two common ways to delineate messages on a socket, sentinel values that signal the end of a message, and prefixing a length before every message. Early on we tried both in Dask but found that prefixing a length before every message was slow. It turns out that this was because TCP sockets try to batch small messages to increase bandwidth. Turning this optimization off ended up being an effective and easy solution, see the TCP_NODELAY parameter.

However, before we figured that out we used sentinels for a long time. Unfortunately Tornado does not handle sentinels well for large messages. At the receipt of every new message it reads through all buffered data to see if it can find the sentinel. This makes lots and lots of copies and reads through lots and lots of bytes. This isn’t a problem if your messages are a few kilobytes, as is common in web development, but it’s terrible if your messages are millions or billions of bytes long.

Switching back to prefixing messages with lengths and turning off the no-delay optimization moved our bandwidth up from 3MB/s to 20MB/s per node. Thanks goes to Ben Darnell (main Tornado developer) for helping us to track this down.

Problem 2: Memory Copies

A nice machine can copy memory at 5 GB/s. If your network is only 100 MB/s then you can easily suffer several memory copies in your system without caring. This leads to code that looks like the following:

socket.send(header + payload)

This code concatenates two bytestrings, header and payload before sending the result down a socket. If we cared deeply about avoiding memory copies then we might instead send these two separately:

socket.send(header)
socket.send(payload)

But who cares, right? At 5 GB/s copying memory is cheap!

Unfortunately this breaks down under either of the following conditions

You are sloppy enough to do this multiple times
You find yourself on a machine with surprisingly low memory bandwidth, like 10 times slower, as is the case on some EC2 machines.

Both of these were true for me but fortunately it’s usually straightforward to reduce the number of copies down to a small number (we got down to three), with moderate effort.

Problem 3: Unwanted Compression

Dask compresses all large messages with LZ4 or Snappy if they’re available. Unfortunately, if your data isn’t very compressible then this is mostly lost time. Doubly unforutnate is that you also have to decompress the data on the recipient side. Decompressing not-very-compressible data was surprisingly slow.

Now we compress with the following policy:

If the message is less than 10kB, don’t bother
Pick out five 10kB samples of the data and compress those. If the result isn’t well compressed then don’t bother compressing the full payload.
Compress the full payload, if it doesn’t compress well then just send along the original to spare the receiver’s side from compressing.

In this case we use cheap checks to guard against unwanted compression. We also avoid any cost at all for small messages, which we care about deeply.

Problem 4: Cloudpickle is not as fast as Pickle

This was surprising, because cloudpickle mostly defers to Pickle for the easy stuff, like NumPy arrays.

In [1]: import numpy as np

In [2]: data = np.random.randint(0, 255, dtype='u1', size=10000000)

In [3]: import pickle, cloudpickle

In [4]: %time len(pickle.dumps(data, protocol=-1))
CPU times: user 8.65 ms, sys: 8.42 ms, total: 17.1 ms
Wall time: 16.9 ms
Out[4]: 10000161

In [5]: %time len(cloudpickle.dumps(data, protocol=-1))
CPU times: user 20.6 ms, sys: 24.5 ms, total: 45.1 ms
Wall time: 44.4 ms
Out[5]: 10000161

But it turns out that cloudpickle is using the Python implementation, while pickle itself (or cPickle in Python 2) is using the compiled C implemenation. Fortunately this is easy to correct, and a quick typecheck on common large dataformats in Python (NumPy and Pandas) gets us this speed boost.

Problem 5: Pickle is still slower than you’d expect

Pickle runs at about half the speed of memcopy, which is what you’d expect from a protocol that is mostly just “serialize the dtype, strides, then tack on the data bytes”. There must be an extraneous memory copy in there.

See issue 7544

Problem 6: MsgPack is bad at large bytestrings

Dask serializes most messages with MsgPack, which is ordinarily very fast. Unfortunately the MsgPack spec doesn’t support bytestrings greater than 4GB (which do come up for us) and the Python implementations don’t pass through large bytestrings very efficiently. So we had to handle large bytestrings separately. Any message that contains bytestrings over 1MB in size will have them stripped out and sent along in a separate frame. This both avoids the MsgPack overhead and avoids a memory copy (we can send the bytes directly to the socket).

Problem 7: Tornado makes a copy

Sockets on Windows don’t accept payloads greater than 128kB in size. As a result Tornado chops up large messages into many small ones. On linux this memory copy is extraneous. It can be removed with a bit of logic within Tornado. I might do this in the moderate future.

Results

We serialize small messages in about 5 microseconds (thanks msgpack!) and move large bytes around in the cost of three memory copies (about 1-1.5 GB/s) which is generally faster than most networks in use.

Here is a profile of sending and receiving a gigabyte-sized NumPy array of random values through to the same process over localhost (500 MB/s on my machine.)

         381360 function calls (381323 primitive calls) in 1.451 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.366    0.366    0.366    0.366 {built-in method dumps}
        8    0.289    0.036    0.291    0.036 iostream.py:360(write)
    15353    0.228    0.000    0.228    0.000 {method 'join' of 'bytes' objects}
    15355    0.166    0.000    0.166    0.000 {method 'recv' of '_socket.socket' objects}
    15362    0.156    0.000    0.398    0.000 iostream.py:1510(_merge_prefix)
     7759    0.101    0.000    0.101    0.000 {method 'send' of '_socket.socket' objects}
    17/14    0.026    0.002    0.686    0.049 gen.py:990(run)
    15355    0.021    0.000    0.198    0.000 iostream.py:721(_read_to_buffer)
        8    0.018    0.002    0.203    0.025 iostream.py:876(_consume)
       91    0.017    0.000    0.335    0.004 iostream.py:827(_handle_write)
       89    0.015    0.000    0.217    0.002 iostream.py:585(_read_to_buffer_loop)
   122567    0.009    0.000    0.009    0.000 {built-in method len}
    15355    0.008    0.000    0.173    0.000 iostream.py:1010(read_from_fd)
    38369    0.004    0.000    0.004    0.000 {method 'append' of 'list' objects}
     7759    0.004    0.000    0.104    0.000 iostream.py:1023(write_to_fd)
        1    0.003    0.003    1.451    1.451 ioloop.py:746(start)

Dominant unwanted costs include the following:

400ms: Pickling the NumPy array
400ms: Bytestring handling within Tornado

After this we’re just bound by pushing bytes down a wire.

Conclusion

Writing fast code isn’t about writing any one thing particularly well, it’s about mitigating everything that can get in your way. As you approch peak performance, previously minor flaws suddenly become your dominant bottleneck. Success here depends on frequent profiling and keeping your mind open to unexpected and surprising costs.

Distributed Dask Arrays

2016-02-26T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

In this post we analyze weather data across a cluster using NumPy in parallel with dask.array. We focus on the following:

How to set up the distributed scheduler with a job scheduler like Sun GridEngine.
How to load NetCDF data from a network file system (NFS) into distributed RAM
How to manipulate data with dask.arrays
How to interact with distributed data using IPython widgets

This blogpost has an accompanying screencast which might be a bit more fun than this text version.

This is the third in a sequence of blogposts about dask.distributed:

We wanted to emulate the typical academic cluster setup using a job scheduler like SunGridEngine (similar to SLURM, Torque, PBS scripts and other technologies), a shared network file system, and typical binary stored arrays in NetCDF files (similar to HDF5).

To this end we used Starcluster, a quick way to set up such a cluster on EC2 with SGE and NFS, and we downloaded data from the European Centre for Meteorology and Weather Forecasting

To deploy dask’s distributed scheduler with SGE we made a scheduler on the master node:

sgeadmin@master:~$ dscheduler
distributed.scheduler - INFO - Start Scheduler at:  172.31.7.88:8786

And then used the qsub command to start four dask workers, pointing to the scheduler address:

sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 1 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 2 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 3 ("dworker") has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 4 ("dworker") has been submitted

After a few seconds these workers start on various nodes in the cluster and connect to the scheduler.

Load sample data on a single machine

On the shared NFS drive we’ve downloaded several NetCDF3 files, each holding the global temperature every six hours for a single day:

>>> from glob import glob
>>> filenames = sorted(glob('*.nc3'))
>>> filenames[:5]
['2014-01-01.nc3',
 '2014-01-02.nc3',
 '2014-01-03.nc3',
 '2014-01-04.nc3',
 '2014-01-05.nc3']

We use conda to install the netCDF4 library and make a small function to read the t2m variable for “temperature at two meters elevation” from a single filename:

conda install netcdf4

import netCDF4
def load_temperature(fn):
    with netCDF4.Dataset(fn) as f:
        return f.variables['t2m'][:]

This converts a single file into a single numpy array in memory. We could call this on an individual file locally as follows:

>>> load_temperature(filenames[0])
array([[[ 253.96238624,  253.96238624,  253.96238624, ...,  253.96238624,
          253.96238624,  253.96238624],
        [ 252.80590921,  252.81070124,  252.81389593, ...,  252.79792249,
          252.80111718,  252.80271452],
          ...
>>> load_temperature(filenames[0]).shape
(4, 721, 1440)

Our dataset has dimensions of (time, latitude, longitude). Note above that each day has four time entries (measurements every six hours).

The NFS set up by Starcluster is unfortunately quite small. We were only able to fit around five months of data (136 days) in shared disk.

Load data across cluster

We want to call the load_temperature function on our list filenames on each of our four workers. We connect a dask Executor to our scheduler address and then map our function on our filenames:

>>> from distributed import Executor, progress
>>> e = Executor('172.31.7.88:8786')
>>> e
<Executor: scheduler=172.31.7.88:8786 workers=4 threads=32>

>>> futures = e.map(load_temperature, filenames)
>>> progress(futures)

After this completes we have several numpy arrays scattered about the memory of each of our four workers.

Coordinate with dask.array

We coordinate these many numpy arrays into a single logical dask array as follows:

>>> from distributed.collections import futures_to_dask_arrays
>>> xs = futures_to_dask_arrays(futures)  # many small dask arrays

>>> import dask.array as da
>>> x = da.concatenate(xs, axis=0)        # one large dask array, joined by time
>>> x
dask.array<concate..., shape=(544, 721, 1440), dtype=float64, chunksize=(4, 721, 1440)>

This single logical dask array is comprised of 136 numpy arrays spread across our cluster. Operations on the single dask array will trigger many operations on each of our numpy arrays.

Interact with Distributed Data

We can now interact with our dataset using standard NumPy syntax and other PyData libraries. Below we pull out a single time slice and render it to the screen with matplotlib.

from matplotlib import pyplot as plt
plt.imshow(x[100, :, :].compute(), cmap='viridis')
plt.colorbar()

In the screencast version of this post we hook this up to an IPython slider widget and scroll around time, which is fun.

Speed

We benchmark a few representative operations to look at the strengths and weaknesses of the distributed system.

Single element

This single element computation accesses a single number from a single NumPy array of our dataset. It is bound by a network roundtrip from client to scheduler, to worker, and back.

>>> %time x[0, 0, 0].compute()
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 9.72 ms

Single time slice

This time slice computation pulls around 8 MB from a single NumPy array on a single worker. It is likely bound by network bandwidth.

>>> %time x[0].compute()
CPU times: user 24 ms, sys: 24 ms, total: 48 ms
Wall time: 274 ms

Mean computation

This mean computation touches every number in every NumPy array across all of our workers. Computing means is quite fast, so this is likely bound by scheduler overhead.

>>> %time x.mean().compute()
CPU times: user 88 ms, sys: 0 ns, total: 88 ms
Wall time: 422 ms

Interactive Widgets

To make these times feel more visceral we hook up these computations to IPython Widgets.

This first example looks fairly fluid. This only touches a single worker and returns a small result. It is cheap because it indexes in a way that is well aligned with how our NumPy arrays are split up by time.

@interact(time=[0, x.shape[0] - 1])
def f(time):
    return x[time, :, :].mean().compute()

This second example is less fluid because we index across our NumPy chunks. Each computation touches all of our data. It’s still not bad though and quite acceptable by today’s standards of interactive distributed data science.

@interact(lat=[0, x.shape[1] - 1])
def f(lat):
    return x[:, lat, :].mean().compute()

Normalize Data

Until now we’ve only performed simple calculations on our data, usually grabbing out means. The image of the temperature above looks unsurprising. The image is dominated by the facts that land is warmer than oceans and that the equator is warmer than the poles. No surprises there.

To make things more interesting we subtract off the mean and divide by the standard deviation over time. This will tell us how unexpectedly hot or cold a particular point was, relative to all measurements of that point over time. This gives us something like a geo-located Z-Score.

z = (x - x.mean(axis=0)) / x.std(axis=0)
z = e.persist(z)
progress(z)

plt.imshow(z[slice].compute(), cmap='RdBu_r')
plt.colorbar()

We can now see much more fine structure of the currents of the day. In the screencast version we hook this dataset up to a slider as well and inspect various times.

I’ve avoided displaying GIFs of full images changing in this post to keep the size down, however we can easily render a plot of average temperature by latitude changing over time here:

import numpy as np
xrange = 90 - np.arange(z.shape[1]) / 4

@interact(time=[0, z.shape[0] - 1])
def f(time):
    plt.figure(figsize=(10, 4))
    plt.plot(xrange, z[time].mean(axis=1).compute())
    plt.ylabel("Normalized Temperature")
    plt.xlabel("Latitude (degrees)")

Conclusion

We showed how to use distributed dask.arrays on a typical academic cluster. I’ve had several conversations with different groups about this topic; it seems to be a common case. I hope that the instructions at the beginning of this post prove to be helpful to others.

It is really satisfying to me to couple interactive widgets with data on a cluster in an intuitive way. This sort of fluid interaction on larger datasets is a core problem in modern data science.

What didn’t work

As always I’ll include a section like this on what didn’t work well or what I would have done with more time:

No high-level read_netcdf function: We had to use the mid-level API of executor.map to construct our dask array. This is a bit of a pain for novice users. We should probably adapt existing high-level functions in dask.array to robustly handle the distributed data case.
Need a larger problem: Our dataset could have fit into a Macbook Pro. A larger dataset that could not have been efficiently investigated from a single machine would have really cemented the need for this technology.
Easier deployment: The solution above with qsub was straightforward but not always accessible to novice users. Additionally while SGE is common there are several other systems that are just as common. We need to think of nice ways to automate this for the user.
XArray integration: Many people use dask.array on single machines through XArray, an excellent library for the analysis of labeled nd-arrays especially common in climate science. It would be good to integrate this new distributed work into the XArray project. I suspect that doing this mostly involves handling the data ingest problem described above.
Reduction speed: The computation of normalized temperature, z, took a surprisingly long time. I’d like to look into what is holding up that computation.

Pandas on HDFS with Dask Dataframes

2016-02-22T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

In this post we use Pandas in parallel across an HDFS cluster to read CSV data. We coordinate these computations with dask.dataframe. A screencast version of this blogpost is available here and the previous post in this series is available here.

To start, we connect to our scheduler, import the hdfs module from the distributed library, and read our CSV data from HDFS.

>>> from distributed import Executor, hdfs, progress
>>> e = Executor('127.0.0.1:8786')
>>> e
<Executor: scheduler=127.0.0.1:8786 workers=64 threads=64>

>>> nyc2014 = hdfs.read_csv('/nyctaxi/2014/*.csv',
...               parse_dates=['pickup_datetime', 'dropoff_datetime'],
...               skipinitialspace=True)

>>> nyc2015 = hdfs.read_csv('/nyctaxi/2015/*.csv',
...               parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])

>>> nyc2014, nyc2015 = e.persist([nyc2014, nyc2015])
>>> progress(nyc2014, nyc2015)

Our data comes from the New York City Taxi and Limousine Commission which publishes all yellow cab taxi rides in NYC for various years. This is a nice model dataset for computational tabular data because it’s large enough to be annoying while also deep enough to be broadly appealing. Each year is about 25GB on disk and about 60GB in memory as a Pandas DataFrame.

HDFS breaks up our CSV files into 128MB chunks on various hard drives spread throughout the cluster. The dask.distributed workers each read the chunks of bytes local to them and call the pandas.read_csv function on these bytes, producing 391 separate Pandas DataFrame objects spread throughout the memory of our eight worker nodes. The returned objects, nyc2014 and nyc2015, are dask.dataframe objects which present a subset of the Pandas API to the user, but farm out all of the work to the many Pandas dataframes they control across the network.

If we wait for the data to load fully into memory then we can perform pandas-style analysis at interactive speeds.

>>> nyc2015.head()

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RateCodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount
0	2	2015-01-15 19:05:39	2015-01-15 19:23:42	1	1.59	-73.993896	40.750111	1	N	-73.974785	40.750618	1	12.0	1.0	0.5	3.25	0.3	17.05
1	1	2015-01-10 20:33:38	2015-01-10 20:53:28	1	3.30	-74.001648	40.724243	1	N	-73.994415	40.759109	1	14.5	0.5	0.5	2.00	0.3	17.80
2	1	2015-01-10 20:33:38	2015-01-10 20:43:41	1	1.80	-73.963341	40.802788	1	N	-73.951820	40.824413	2	9.5	0.5	0.5	0.00	0.3	10.80
3	1	2015-01-10 20:33:39	2015-01-10 20:35:31	1	0.50	-74.009087	40.713818	1	N	-74.004326	40.719986	2	3.5	0.5	0.5	0.00	0.3	4.80
4	1	2015-01-10 20:33:39	2015-01-10 20:52:58	1	3.00	-73.971176	40.762428	1	N	-74.004181	40.742653	2	15.0	0.5	0.5	0.00	0.3	16.30

>>> len(nyc2014)
165114373

>>> len(nyc2015)
146112989

Interestingly it appears that the NYC cab industry has contracted a bit in the last year. There are fewer cab rides in 2015 than in 2014.

When we ask for something like the length of the full dask.dataframe we actually ask for the length of all of the hundreds of Pandas dataframes and then sum them up. This process of reaching out to all of the workers completes in around 200-300 ms, which is generally fast enough to feel snappy in an interactive session.

The dask.dataframe API looks just like the Pandas API, except that we call .compute() when we want an actual result.

>>> nyc2014.passenger_count.sum().compute()
279997507.0

>>> nyc2015.passenger_count.sum().compute()
245566747

Dask.dataframes build a plan to get your result and the distributed scheduler coordinates that plan on all of the little Pandas dataframes on the workers that make up our dataset.

Pandas for Metadata

Let’s appreciate for a moment all the work we didn’t have to do around CSV handling because Pandas magically handled it for us.

>>> nyc2015.dtypes
VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
pickup_longitude                float64
pickup_latitude                 float64
RateCodeID                        int64
store_and_fwd_flag               object
dropoff_longitude               float64
dropoff_latitude                float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount\r                  float64
dtype: object

We didn’t have to find columns or specify data-types. We didn’t have to parse each value with an int or float function as appropriate. We didn’t have to parse the datetimes, but instead just specified a parse_datetimes= keyword. The CSV parsing happened about as quickly as can be expected for this format, clocking in at a network total of a bit under 1 GB/s.

Pandas is well loved because it removes all of these little hurdles from the life of the analyst. If we tried to reinvent a new “Big-Data-Frame” we would have to reimplement all of the work already well done inside of Pandas. Instead, dask.dataframe just coordinates and reuses the code within the Pandas library. It is successful largely due to work from core Pandas developers, notably Masaaki Horikoshi (@sinhrks), who have done tremendous work to align the API precisely with the Pandas core library.

Analyze Tips and Payment Types

In an effort to demonstrate the abilities of dask.dataframe we ask a simple question of our data, “how do New Yorkers tip?”. The 2015 NYCTaxi data is quite good about breaking down the total cost of each ride into the fare amount, tip amount, and various taxes and fees. In particular this lets us measure the percentage that each rider decided to pay in tip.

>>> nyc2015[['fare_amount', 'tip_amount', 'payment_type']].head()

	fare_amount	tip_amount	payment_type
0	12.0	3.25	1
1	14.5	2.00	1
2	9.5	0.00	2
3	3.5	0.00	2
4	15.0	0.00	2

In the first two lines we see evidence supporting the 15-20% tip standard common in the US. The following three lines interestingly show zero tip. Judging only by these first five lines (a very small sample) we see a strong correlation here with the payment type. We analyze this a bit more by counting occurrences in the payment_type column both for the full dataset, and filtered by zero tip:

>>> %time nyc2015.payment_type.value_counts().compute()
CPU times: user 132 ms, sys: 0 ns, total: 132 ms
Wall time: 558 ms

1    91574644
2    53864648
3      503070
4      170599
5          28
Name: payment_type, dtype: int64

>>> %time nyc2015[nyc2015.tip_amount == 0].payment_type.value_counts().compute()
CPU times: user 212 ms, sys: 4 ms, total: 216 ms
Wall time: 1.69 s

2    53862557
1     3365668
3      502025
4      170234
5          26
Name: payment_type, dtype: int64

We find that almost all zero-tip rides correspond to payment type 2, and that almost all payment type 2 rides don’t tip. My un-scientific hypothesis here is payment type 2 corresponds to cash fares and that we’re observing a tendancy of drivers not to record cash tips. However we would need more domain knowledge about our data to actually make this claim with any degree of authority.

Analyze Tips Fractions

Lets make a new column, tip_fraction, and then look at the average of this column grouped by day of week and grouped by hour of day.

First, we need to filter out bad rows, both rows with this odd payment type, and rows with zero fare (there are a surprising number of free cab rides in NYC.) Second we create a new column equal to the ratio of tip_amount / fare_amount.

>>> df = nyc2015[(nyc2015.fare_amount > 0) & (nyc2015.payment_type != 2)]
>>> df = df.assign(tip_fraction=(df.tip_amount / df.fare_amount))

Next we choose to groupby the pickup datetime column in order to see how the average tip fraction changes by day of week and by hour. The groupby and datetime handling of Pandas makes these operations trivial.

>>> dayofweek = df.groupby(df.tpep_pickup_datetime.dt.dayofweek).tip_fraction.mean()
>>> hour = df.groupby(df.tpep_pickup_datetime.dt.hour).tip_fraction.mean()

>>> dayofweek, hour = e.persist([dayofweek, hour])
>>> progress(dayofweek, hour)

Grouping by day-of-week doesn’t show anything too striking to my eye. However I would like to note at how generous NYC cab riders seem to be. A 23-25% tip can be quite nice:

>>> dayofweek.compute()
tpep_pickup_datetime
  0.237510
  0.236494
  0.236073
  0.246007
  0.242081
  0.232415
  0.259974
Name: tip_fraction, dtype: float64

But grouping by hour shows that late night and early morning riders are more likely to tip extravagantly:

>>> hour.compute()
tpep_pickup_datetime
   0.263602
   0.278828
   0.293536
   0.276784
   0.348649
   0.248618
   0.233257
   0.216003
   0.221508
   0.217018
  0.225618
  0.231396
  0.225186
  0.235662
  0.237636
  0.228832
  0.234086
  0.240635
  0.237488
  0.272792
  0.235866
  0.242157
  0.243244
  0.244586
Name: tip_fraction, dtype: float64
In [24]:

We plot this with matplotlib and see a nice trough during business hours with a surge in the early morning with an astonishing peak of 34% at 4am:

Performance

Lets dive into a few operations that run at different time scales. This gives a good understanding of the strengths and limits of the scheduler.

>>> %time nyc2015.head()
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 20.9 ms

This head computation is about as fast as a film projector. You could perform this roundtrip computation between every consecutive frame of a movie; to a human eye this appears fluid. In the last post we asked about how low we could bring latency. In that post we were running computations from my laptop in California and so were bound by transcontinental latencies of 200ms. This time, because we’re operating from the cluster, we can get down to 20ms. We’re only able to be this fast because we touch only a single data element, the first partition. Things change when we need to touch the entire dataset.

>>> %time len(nyc2015)
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 271 ms

The length computation takes 200-300 ms. This computation takes longer because we touch every individual partition of the data, of which there are 178. The scheduler incurs about 1ms of overhead per task, add a bit of latency and you get the ~200ms total. This means that the scheduler will likely be the bottleneck whenever computations are very fast, such as is the case for computing len. Really, this is good news; it means that by improving the scheduler we can reduce these durations even further.

If you look at the groupby computations above you can add the numbers in the progress bars to show that we computed around 3000 tasks in around 7s. It looks like this computation is about half scheduler overhead and about half bound by actual computation.

Conclusion

We used dask+distributed on a cluster to read CSV data from HDFS into a dask dataframe. We then used dask.dataframe, which looks identical to the Pandas dataframe, to manipulate our distributed dataset intuitively and efficiently.

We looked a bit at the performance characteristics of simple computations.

What doesn’t work

As always I’ll have a section like this that honestly says what doesn’t work well and what I would have done with more time.

Dask dataframe implements a commonly used subset of Pandas functionality, not all of it. It’s surprisingly hard to communicate the exact bounds of this subset to users. Notably, in the distributed setting we don’t have a shuffle algorithm, so groupby(...).apply(...) and some joins are not yet possible.
If you want to use threads, you’ll need Pandas 0.18.0 which, at the time of this writing, was still in release candidate stage. This Pandas release fixes some important GIL related issues.
The 1ms overhead per task limit is significant. While we can still scale out to clusters far larger than what we have here, we probably won’t be able to strongly accelerate very quick operations until we reduce this number.
We use the hdfs3 library to read data from HDFS. This library seems to work great but is new and could use more active users to flush out bug reports.

Setup and Data

You can obtain public data from the New York City Taxi and Limousine Commission here. I downloaded this onto the head node and dumped it into HDFS with commands like the following:

wget https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-{01..12}.csv
hdfs dfs -mkdir /nyctaxi
hdfs dfs -mkdir /nyctaxi/2015
hdfs dfs -put yellow*.csv /nyctaxi/2015/

The cluster was hosted on EC2 and was comprised of nine m3.2xlarges with 8 cores and 30GB of RAM each. Eight of these nodes were used as workers; they used processes for parallelism, not threads.

Introducing Dask distributed

2016-02-17T00:00:00+00:00

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We analyze JSON data on a cluster using pure Python projects.

Dask, a Python library for parallel computing, now works on clusters. During the past few months I and others have extended dask with a new distributed memory scheduler. This enables dask’s existing parallel algorithms to scale across 10s to 100s of nodes, and extends a subset of PyData to distributed computing. Over the next few weeks I and others will write about this system. Please note that dask+distributed is developing quickly and so the API is likely to shift around a bit.

Today we start simple with the typical cluster computing problem, parsing JSON records, filtering, and counting events using dask.bag and the new distributed scheduler. We’ll dive into more advanced problems in future posts.

A video version of this blogpost is available here.

GitHub releases data dumps of their public event stream as gzipped compressed, line-delimited, JSON. This data is too large to fit comfortably into memory, even on a sizable workstation. We could stream it from disk but, due to the compression and JSON encoding this takes a while and so slogs down interactive use. For an interactive experience with data like this we need a distributed cluster.

Setup and Data

We provision nine m3.2xlarge nodes on EC2. These have eight cores and 30GB of RAM each. On this cluster we provision one scheduler and nine workers (see setup docs). (More on launching in later posts.) We have five months of data, from 2015-01-01 to 2015-05-31 on the githubarchive-data bucket in S3. This data is publicly avaialble if you want to play with it on EC2. You can download the full dataset at https://www.githubarchive.org/ .

The first record looks like the following:

 {'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/9152315?',
   'gravatar_id': '',
   'id': 9152315,
   'login': 'davidjhulse',
   'url': 'https://api.github.com/users/davidjhulse'},
  'created_at': '2015-01-01T00:00:00Z',
  'id': '2489368070',
  'payload': {'before': '86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
   'commits': [{'author': {'email': 'david.hulse@live.com',
      'name': 'davidjhulse'},
     'distinct': True,
     'message': 'Altered BingBot.jar\n\nFixed issue with multiple account support',
     'sha': 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
     'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],
   'distinct_size': 1,
   'head': 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
   'push_id': 536740396,
   'ref': 'refs/heads/master',
   'size': 1},
  'public': True,
  'repo': {'id': 28635890,
   'name': 'davidjhulse/davesbingrewardsbot',
   'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'},
  'type': 'PushEvent'}

So we have a large dataset on S3 and a moderate sized play cluster on EC2, which has access to S3 data at about 100MB/s per node. We’re ready to play.

Play

We start an ipython interpreter on our local laptop and connect to the dask scheduler running on the cluster. For the purposes of timing, the cluster is on the East Coast while the local machine is in California on commercial broadband internet.

>>> from distributed import Executor, s3
>>> e = Executor('54.173.84.107:8786')
>>> e
<Executor: scheduler=54.173.84.107:8786 workers=72 threads=72>

Our seventy-two worker processes come from nine workers with eight processes each. We chose processes rather than threads for this task because computations will be bound by the GIL. We will change this to threads in later examples.

We start by loading a single month of data into distributed memory.

import json
text = s3.read_text('githubarchive-data', '2015-01', compression='gzip')
records = text.map(json.loads)
records = e.persist(records)

The data lives in S3 in hourly files as gzipped encoded, line delimited JSON. The s3.read_text and text.map functions produce dask.bag objects which track our operations in a lazily built task graph. When we ask the executor to persist this collection we ship those tasks off to the scheduler to run on all of the workers in parallel. The persist function gives us back another dask.bag pointing to these remotely running results. This persist function returns immediately, and the computation happens on the cluster in the background asynchronously. We gain control of our interpreter immediately while the cluster hums along.

The cluster takes around 40 seconds to download, decompress, and parse this data. If you watch the video embedded above you’ll see fancy progress-bars.

We ask for a single record. This returns in around 200ms, which is fast enough that it feels instantaneous to a human.

>>> records.take(1)
({'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/9152315?',
   'gravatar_id': '',
   'id': 9152315,
   'login': 'davidjhulse',
   'url': 'https://api.github.com/users/davidjhulse'},
  'created_at': '2015-01-01T00:00:00Z',
  'id': '2489368070',
  'payload': {'before': '86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
   'commits': [{'author': {'email': 'david.hulse@live.com',
      'name': 'davidjhulse'},
     'distinct': True,
     'message': 'Altered BingBot.jar\n\nFixed issue with multiple account support',
     'sha': 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
     'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],
   'distinct_size': 1,
   'head': 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
   'push_id': 536740396,
   'ref': 'refs/heads/master',
   'size': 1},
  'public': True,
  'repo': {'id': 28635890,
   'name': 'davidjhulse/davesbingrewardsbot',
   'url': 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'},
  'type': 'PushEvent'},)

This particular event is a 'PushEvent'. Let’s quickly see all the kinds of events. For fun, we’ll also time the interaction:

>>> %time records.pluck('type').frequencies().compute()
CPU times: user 112 ms, sys: 0 ns, total: 112 ms
Wall time: 2.41 s

[('ReleaseEvent', 44312),
 ('MemberEvent', 69757),
 ('IssuesEvent', 693363),
 ('PublicEvent', 14614),
 ('CreateEvent', 1651300),
 ('PullRequestReviewCommentEvent', 214288),
 ('PullRequestEvent', 680879),
 ('ForkEvent', 491256),
 ('DeleteEvent', 256987),
 ('PushEvent', 7028566),
 ('IssueCommentEvent', 1322509),
 ('GollumEvent', 150861),
 ('CommitCommentEvent', 96468),
 ('WatchEvent', 1321546)]

And we compute the total count of all commits for this month.

>>> %time records.count().compute()
CPU times: user 134 ms, sys: 133 µs, total: 134 ms
Wall time: 1.49 s

14036706

We see that it takes a few seconds to walk through the data (and perform all scheduling overhead.) The scheduler adds about a millisecond overhead per task, and there are about 1000 partitions/files here (the GitHub data is split by hour and there are 730 hours in a month) so most of the cost here is overhead.

Investigate Jupyter

We investigate the activities of Project Jupyter. We chose this project because it’s sizable and because we understand the players involved and so can check our accuracy. This will require us to filter our data to a much smaller subset, then find popular repositories and members.

>>> jupyter = (records.filter(lambda d: d['repo']['name'].startswith('jupyter/'))
                      .repartition(10))
>>> jupyter = e.persist(jupyter)

All records, regardless of event type, have a repository which has a name like 'organization/repository' in typical GitHub fashion. We filter all records that start with 'jupyter/'. Additionally, because this dataset is likely much smaller, we push all of these records into just ten partitions. This dramatically reduces scheduling overhead. The persist call hands this computation off to the scheduler and then gives us back our collection that points to that computing result. Filtering this month for Jupyter events takes about 7.5 seconds. Afterwards computations on this subset feel snappy.

>>> %time jupyter.count().compute()
CPU times: user 5.19 ms, sys: 97 µs, total: 5.28 ms
Wall time: 199 ms

747

>>> %time jupyter.take(1)
CPU times: user 7.01 ms, sys: 259 µs, total: 7.27 ms
Wall time: 182 ms

({'actor': {'avatar_url': 'https://avatars.githubusercontent.com/u/26679?',
   'gravatar_id': '',
   'id': 26679,
   'login': 'marksteve',
   'url': 'https://api.github.com/users/marksteve'},
  'created_at': '2015-01-01T13:25:44Z',
  'id': '2489612400',
  'org': {'avatar_url': 'https://avatars.githubusercontent.com/u/7388996?',
   'gravatar_id': '',
   'id': 7388996,
   'login': 'jupyter',
   'url': 'https://api.github.com/orgs/jupyter'},
  'payload': {'action': 'started'},
  'public': True,
  'repo': {'id': 5303123,
   'name': 'jupyter/nbviewer',
   'url': 'https://api.github.com/repos/jupyter/nbviewer'},
  'type': 'WatchEvent'},)

So the first event of the year was by 'marksteve' who decided to watch the 'nbviewer' repository on new year’s day.

Notice that these computations take around 200ms. I can’t get below this from my local machine, so we’re likely bound by communicating to such a remote location. A 200ms latency is not great if you’re playing a video game, but it’s decent for interactive computing.

Here are all of the Jupyter repositories touched in the month of January,

>>> %time jupyter.pluck('repo').pluck('name').distinct().compute()
CPU times: user 2.84 ms, sys: 4.03 ms, total: 6.86 ms
Wall time: 204 ms

['jupyter/dockerspawner',
 'jupyter/design',
 'jupyter/docker-demo-images',
 'jupyter/jupyterhub',
 'jupyter/configurable-http-proxy',
 'jupyter/nbshot',
 'jupyter/sudospawner',
 'jupyter/colaboratory',
 'jupyter/strata-sv-2015-tutorial',
 'jupyter/tmpnb-deploy',
 'jupyter/nature-demo',
 'jupyter/nbcache',
 'jupyter/jupyter.github.io',
 'jupyter/try.jupyter.org',
 'jupyter/jupyter-drive',
 'jupyter/tmpnb',
 'jupyter/tmpnb-redirector',
 'jupyter/nbgrader',
 'jupyter/nbindex',
 'jupyter/nbviewer',
 'jupyter/oauthenticator']

And the top ten most active people on GitHub.

>>> %time (jupyter.pluck('actor')
                  .pluck('login')
                  .frequencies()
                  .topk(10, lambda kv: kv[1])
                  .compute())
CPU times: user 8.03 ms, sys: 90 µs, total: 8.12 ms
Wall time: 226 ms

[('rgbkrk', 156),
 ('minrk', 87),
 ('Carreau', 87),
 ('KesterTong', 74),
 ('jhamrick', 70),
 ('bollwyvl', 25),
 ('pkt', 18),
 ('ssanderson', 13),
 ('smashwilson', 13),
 ('ellisonbg', 13)]

Nothing too surprising here if you know these folks.

Full Dataset

The full five months of data is too large to fit in memory, even for this cluster. When we represent semi-structured data like this with dynamic data structures like lists and dictionaries there is quite a bit of memory bloat. Some careful attention to efficient semi-structured storage here could save us from having to switch to such a large cluster, but that will have to be the topic of another post.

Instead, we operate efficiently on this dataset by flowing it through memory, persisting only the records we care about. The distributed dask scheduler descends from the single-machine dask scheduler, which was quite good at flowing through a computation and intelligently removing intermediate results.

From a user API perspective, we call persist only on the jupyter dataset, and not the full records dataset.

>>> full = (s3.read_text('githubarchive-data', '2015', compression='gzip')
              .map(json.loads)

>>> jupyter = (full.filter(lambda d: d['repo']['name'].startswith('jupyter/'))
                   .repartition(10))

>>> jupyter = e.persist(jupyter)

It takes 2m36s to download, decompress, and parse the five months of publicly available GitHub events for all Jupyter events on nine m3.2xlarges.

There were seven thousand such events.

>>> jupyter.count().compute()
7065

We find which repositories saw the most activity during that time:

>>> %time (jupyter.pluck('repo')
                  .pluck('name')
                  .frequencies()
                  .topk(20, lambda kv: kv[1])
                  .compute())
CPU times: user 6.98 ms, sys: 474 µs, total: 7.46 ms
Wall time: 219 ms

[('jupyter/jupyterhub', 1262),
 ('jupyter/nbgrader', 1235),
 ('jupyter/nbviewer', 846),
 ('jupyter/jupyter_notebook', 507),
 ('jupyter/jupyter-drive', 505),
 ('jupyter/notebook', 451),
 ('jupyter/docker-demo-images', 363),
 ('jupyter/tmpnb', 284),
 ('jupyter/jupyter_client', 162),
 ('jupyter/dockerspawner', 149),
 ('jupyter/colaboratory', 134),
 ('jupyter/jupyter_core', 127),
 ('jupyter/strata-sv-2015-tutorial', 108),
 ('jupyter/jupyter_nbconvert', 103),
 ('jupyter/configurable-http-proxy', 89),
 ('jupyter/hubpress.io', 85),
 ('jupyter/jupyter.github.io', 84),
 ('jupyter/tmpnb-deploy', 76),
 ('jupyter/nbconvert', 66),
 ('jupyter/jupyter_qtconsole', 59)]

We see that projects like jupyterhub were quite active during that time while, surprisingly, nbconvert saw relatively little action.

Local Data

The Jupyter data is quite small and easily fits in a single machine. Let’s bring the data to our local machine so that we can compare times:

>>> %time L = jupyter.compute()
CPU times: user 4.74 s, sys: 10.9 s, total: 15.7 s
Wall time: 30.2 s

It takes surprisingly long to download the data, but once its here, we can iterate far more quickly with basic Python.

>>> from toolz.curried import pluck, frequencies, topk, pipe
>>> %time pipe(L, pluck('repo'), pluck('name'), frequencies,
               dict.items, topk(20, key=lambda kv: kv[1]), list)
CPU times: user 11.8 ms, sys: 0 ns, total: 11.8 ms
Wall time: 11.5 ms

[('jupyter/jupyterhub', 1262),
 ('jupyter/nbgrader', 1235),
 ('jupyter/nbviewer', 846),
 ('jupyter/jupyter_notebook', 507),
 ('jupyter/jupyter-drive', 505),
 ('jupyter/notebook', 451),
 ('jupyter/docker-demo-images', 363),
 ('jupyter/tmpnb', 284),
 ('jupyter/jupyter_client', 162),
 ('jupyter/dockerspawner', 149),
 ('jupyter/colaboratory', 134),
 ('jupyter/jupyter_core', 127),
 ('jupyter/strata-sv-2015-tutorial', 108),
 ('jupyter/jupyter_nbconvert', 103),
 ('jupyter/configurable-http-proxy', 89),
 ('jupyter/hubpress.io', 85),
 ('jupyter/jupyter.github.io', 84),
 ('jupyter/tmpnb-deploy', 76),
 ('jupyter/nbconvert', 66),
 ('jupyter/jupyter_qtconsole', 59)]

The difference here is 20x, which is a good reminder that, once you no longer have a large problem you should probably eschew distributed systems and act locally.

Conclusion

Downloading, decompressing, parsing, filtering, and counting JSON records is the new wordcount. It’s the first problem anyone sees. Fortunately it’s both easy to solve and the common case. Woo hoo!

Here we saw that dask+distributed handle the common case decently well and with a Pure Python stack. Typically Python users rely on a JVM technology like Hadoop/Spark/Storm to distribute their computations. Here we have Python distributing Python; there are some usability gains to be had here like nice stack traces, a bit less serialization overhead, and attention to other Pythonic style choices.

Over the next few posts I intend to deviate from this common case. Most “Big Data” technologies were designed to solve typical data munging problems found in web companies or with simple database operations in mind. Python users care about these things too, but they also reach out to a wide variety of fields. In dask+distributed development we care about the common case, but also support less traditional workflows that are commonly found in the life, physical, and algorithmic sciences.

By designing to support these more extreme cases we’ve nailed some common pain points in current distributed systems. Today we’ve seen low latency and remote control; in the future we’ll see others.

What doesn’t work

I’ll have an honest section like this at the end of each upcoming post describing what doesn’t work, what still feels broken, or what I would have done differently with more time.

The imports for dask and distributed are still strange. They’re two separate codebases that play very nicely together. Unfortunately the functionality you need is sometimes in one or in the other and it’s not immediately clear to the novice user where to go. For example dask.bag, the collection we’re using for records, jupyter, etc. is in dask but the s3 module is within the distributed library. We’ll have to merge things at some point in the near-to-moderate future. Ditto for the API: there are compute methods both on the dask collections (records.compute()) and on the distributed executor (e.compute(records)) that behave slightly differently.
We lack an efficient distributed shuffle algorithm. This is very important if you want to use operations like .groupby (which you should avoid anyway). The user API here doesn’t even cleanly warn users that this is missing in the distributed case which is kind of a mess. (It works fine on a single machine.) Efficient alternatives like foldby are available.
I would have liked to run this experiment directly on the cluster to see how low we could have gone below the 200ms barrier we ran into here.