<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posted in 2016</title>
  <updated>2026-03-05T15:05:22.510051+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/2016/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2016/12/24/dask-dev-4/</id>
    <title>Dask Development Log</title>
    <updated>2016-12-24T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;
the &lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA Program&lt;/a&gt;
and the Data Driven Discovery Initiative from the &lt;a class="reference external" href="https://www.moore.org/"&gt;Moore
Foundation&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To increase transparency I’m blogging weekly about the work done on Dask and
related projects during the previous week. This log covers work done between
2016-12-11 and 2016-12-18. Nothing here is ready for production. This
blogpost is written in haste, so refined polish should not be expected.&lt;/p&gt;
&lt;p&gt;Themes of last week:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Cleanup of load balancing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Found cause of worker lag&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Initial Spark/Dask Dataframe comparisons&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Benchmarks with asv&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/24/dask-dev-4.md&lt;/span&gt;, line 25)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="load-balancing-cleanup"&gt;

&lt;p&gt;The last two weeks saw several disruptive changes to the scheduler and workers.
This resulted in an overall performance degradation on messy workloads when
compared to the most recent release, which stopped bleeding-edge users from
using recent dev builds. This has been resolved, and bleeding-edge git-master
is back up to the old speed and then some.&lt;/p&gt;
&lt;p&gt;As a visual aid, this is what bad (or in this case random) load balancing looks
like:&lt;/p&gt;
&lt;a href="/images/bad-work-stealing.png"&gt;
    &lt;img src="/images/bad-work-stealing.png"
         alt="bad work stealing"
         width="70%"&gt;&lt;/a&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/24/dask-dev-4.md&lt;/span&gt;, line 41)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="identified-and-removed-worker-lag"&gt;
&lt;h1&gt;Identified and removed worker lag&lt;/h1&gt;
&lt;p&gt;For a while there have been significant gaps of 100ms or more between successive
tasks in workers, especially when using Pandas. This was particularly odd
because the workers had lots of backed up work to keep them busy (thanks to the
nice load balancing from before). The culprit here was the calculation of the
size of the intermediate on object dtype dataframes.&lt;/p&gt;
&lt;a href="/images/task-stream-pandas-lag.png"&gt;
    &lt;img src="/images/task-stream-pandas-lag.png"
         alt="lag between tasks"
         width="70%"&gt;&lt;/a&gt;
&lt;p&gt;Explaining this in greater depth, recall that to schedule intelligently, the
workers calculate the size in bytes of every intermediate result they produce.
Often this is quite fast, for example for numpy arrays we can just multiply the
number of elements by the dtype itemsize. However for object dtype arrays or
dataframes (which are commonly used for text) it can take a long while to
calculate an accurate result here. Now we no longer calculuate an accurate
result, but instead take a fairly pessimistic guess. The gaps between tasks
shrink considerably.&lt;/p&gt;
&lt;a href="/images/task-stream-pandas-no-lag.png"&gt;
    &lt;img src="/images/task-stream-pandas-no-lag.png"
         alt="no lag between tasks"
         width="40%"&gt;&lt;/a&gt;
&lt;a href="/images/task-stream-pandas-no-lag-zoomed.png"&gt;
    &lt;img src="/images/task-stream-pandas-no-lag-zoomed.png"
         alt="no lag between tasks zoomed"
         width="40%"&gt;&lt;/a&gt;
&lt;p&gt;Although there is still a significant bit of lag around 10ms long between tasks
on these workloads (see zoomed version on the right). On other workloads we’re
able to get inter-task lag down to the tens of microseconds scale. While 10ms
may not sound like a long time, when we perform very many very short tasks this
can quickly become a bottleneck.&lt;/p&gt;
&lt;p&gt;Anyway, this change reduced shuffle overhead by a factor of two. Things are
starting to look pretty snappy for many-small-task workloads.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/24/dask-dev-4.md&lt;/span&gt;, line 81)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="initial-spark-dask-dataframe-comparisons"&gt;
&lt;h1&gt;Initial Spark/Dask Dataframe Comparisons&lt;/h1&gt;
&lt;p&gt;I would like to run a small benchmark comparing Dask and Spark DataFrames. I
spent a bit of the last couple of days using Spark locally on the NYC Taxi data
and futzing with cluster deployment tools to set up Spark clusters on EC2 for
basic benchmarking. I ran across
&lt;a class="reference external" href="https://github.com/nchammas/flintrock"&gt;flintrock&lt;/a&gt;, which has been highly
recommended to me a few times.&lt;/p&gt;
&lt;p&gt;I’ve been thinking about how to do benchmarks in an unbiased way. Comparative
benchmarks are useful to have around to motivate projects to grow and learn
from each other. However in today’s climate where open source software
developers have a vested interest, benchmarks often focus on a projects’
strengths and hide their deficiencies. Even with the best of intentions and
practices, a developer is likely to correct for deficiencies on the fly.
They’re much more able to do this for their own project than for others’.
Benchmarks end up looking more like sales documents than trustworthy research.&lt;/p&gt;
&lt;p&gt;My tentative plan is to reach out to a few Spark devs and see if we can
collaborate on a problem set and hardware before running computations and
comparing results.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/24/dask-dev-4.md&lt;/span&gt;, line 103)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="benchmarks-with-airspeed-velocity"&gt;
&lt;h1&gt;Benchmarks with airspeed velocity&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/postelrich"&gt;Rich Postelnik&lt;/a&gt; is building on work from
&lt;a class="reference external" href="https://github.com/TomAugspurger"&gt;Tom Augspurger&lt;/a&gt; to build out benchmarks for
Dask using &lt;a class="reference external" href="https://github.com/spacetelescope/asv"&gt;airspeed velocity&lt;/a&gt; at
&lt;a class="reference external" href="https://github.com/dask/dask-benchmarks"&gt;dask-benchmarks&lt;/a&gt;. Building out
benchmarks is a great way to get involved if anyone is interested.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/24/dask-dev-4.md&lt;/span&gt;, line 111)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="pre-pre-release"&gt;
&lt;h1&gt;Pre-pre-release&lt;/h1&gt;
&lt;p&gt;I intend to publish a pre-release for a 0.X.0 version bump of dask/dask and
dask/distributed sometime next week.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/12/24/dask-dev-4/"/>
    <summary>This work is supported by Continuum Analytics
the XDATA Program
and the Data Driven Discovery Initiative from the Moore
Foundation</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="scipy" label="scipy"/>
    <published>2016-12-24T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/12/18/dask-dev-3/</id>
    <title>Dask Development Log</title>
    <updated>2016-12-18T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;
the &lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA Program&lt;/a&gt;
and the Data Driven Discovery Initiative from the &lt;a class="reference external" href="https://www.moore.org/"&gt;Moore
Foundation&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To increase transparency I’m blogging weekly about the work done on Dask and
related projects during the previous week. This log covers work done between
2016-12-11 and 2016-12-18. Nothing here is ready for production. This
blogpost is written in haste, so refined polish should not be expected.&lt;/p&gt;
&lt;p&gt;Themes of last week:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Benchmarking new scheduler and worker on larger systems&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kubernetes and Google Container Engine&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fastparquet on S3&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/18/dask-dev-3.md&lt;/span&gt;, line 24)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="rewriting-load-balancing"&gt;

&lt;p&gt;In the last two weeks we rewrote a significant fraction of the worker and
scheduler. This enables future growth, but also resulted in a loss of our load
balancing and work stealing algorithms (the old one no longer made sense in the
context of the new system.) Careful dynamic load balancing is essential to
running atypical workloads (which are surprisingly typical among Dask users) so
rebuilding this has been all-consuming this week for me personally.&lt;/p&gt;
&lt;p&gt;Briefly, Dask initially assigns tasks to workers taking into account the
expected runtime of the task, the size and location of the data that the task
needs, the duration of other tasks on every worker, and where each piece of data
sits on all of the workers. Because the number of tasks can grow into the
millions and the number of workers can grow into the thousands, Dask needs to
figure out a near-optimal placement in near-constant time, which is hard.
Furthermore, after the system runs for a while, uncertainties in our estimates
build, and we need to rebalance work from saturated workers to idle workers
relatively frequently. Load balancing intelligently and responsively is
essential to a satisfying user experience.&lt;/p&gt;
&lt;p&gt;We have a decently strong test suite around these behaviors, but it’s hard to
be comprehensive on performance-based metrics like this, so there has also been
a lot of benchmarking against real systems to identify new failure modes.
We’re doing what we can to create isolated tests for every failure mode that we
find to make future rewrites retain good behavior.&lt;/p&gt;
&lt;p&gt;Generally working on the Dask distributed scheduler has taught me the
brittleness of unit tests. As we have repeatedly rewritten internals while
maintaining the same external API our testing strategy has evolved considerably
away from fine-grained unit tests to a mixture of behavioral integration tests
and a very strict runtime validation system.&lt;/p&gt;
&lt;p&gt;Rebuilding the load balancing algorithms has been high priority for me
personally because these performance issues inhibit current power-users from
using the development version on their problems as effectively as with the
latest release. I’m looking forward to seeing load-balancing humming nicely
again so that users can return to git-master and so that I can return to
handling a broader base of issues. (Sorry to everyone I’ve been ignoring the
last couple of weeks).&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/18/dask-dev-3.md&lt;/span&gt;, line 64)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="test-deployments-on-google-container-engine"&gt;
&lt;h1&gt;Test deployments on Google Container Engine&lt;/h1&gt;
&lt;p&gt;I’ve personally started switching over my development cluster from Amazon’s EC2
to Google’s Container Engine. Here are some pro’s and con’s from my particular
perspective. Many of these probably have more to do with how I use each
particular tool rather than intrinsic limitations of the service itself.&lt;/p&gt;
&lt;p&gt;In Google’s Favor&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Native and immediate support for Kubernetes and Docker, the combination of
which allows me to more quickly and dynamically create and scale clusters
for different experiments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dynamic scaling from a single node to a hundred nodes and back ten minutes
later allows me to more easily run a much larger range of scales.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;I like being charged by the minute rather than by the hour, especially
given the ability to dynamically scale up&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Authentication and billing feel simpler&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In Amazon’s Favor&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;I already have tools to launch Dask on EC2&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;All of my data is on Amazon’s S3&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;I have nice data acquisition tools,
&lt;a class="reference external" href="http://s3fs.readthedocs.io/en/latest/"&gt;s3fs&lt;/a&gt;, for S3 based on boto3.
Google doesn’t seem to have a nice Python 3 library for accessing Google
Cloud Storage :(&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I’m working from Olivier Grisel’s repository
&lt;a class="reference external" href="https://github.com/ogrisel/docker-distributed"&gt;docker-distributed&lt;/a&gt; although
updating to newer versions and trying to use as few modifications from naive
deployment as possible. My current branch is
&lt;a class="reference external" href="https://github.com/mrocklin/docker-distributed/tree/update"&gt;here&lt;/a&gt;. I hope to
have something more stable for next week.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/18/dask-dev-3.md&lt;/span&gt;, line 98)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="fastparquet-on-s3"&gt;
&lt;h1&gt;Fastparquet on S3&lt;/h1&gt;
&lt;p&gt;We gave fastparquet and Dask.dataframe a spin on some distributed S3 data on
Friday. I was surprised that everything seemed to work out of the box. Martin
Durant, who built both fastparquet and s3fs has done some nice work to make
sure that all of the pieces play nicely together. We ran into some performance
issues pulling bytes from S3 itself. I expect that there will be some tweaking
over the next few weeks.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/12/18/dask-dev-3/"/>
    <summary>This work is supported by Continuum Analytics
the XDATA Program
and the Data Driven Discovery Initiative from the Moore
Foundation</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="scipy" label="scipy"/>
    <published>2016-12-18T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/12/12/dask-dev-2/</id>
    <title>Dask Development Log</title>
    <updated>2016-12-12T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;
the &lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA Program&lt;/a&gt;
and the Data Driven Discovery Initiative from the &lt;a class="reference external" href="https://www.moore.org/"&gt;Moore
Foundation&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To increase transparency I’m blogging weekly about the work done on Dask and
related projects during the previous week. This log covers work done between
2016-12-05 and 2016-12-12. Nothing here is stable or ready for production.
This blogpost is written in haste, so refined polish should not be expected.&lt;/p&gt;
&lt;p&gt;Themes of last week:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Dask.array without known chunk sizes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Import time&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fastparquet blogpost and feedback&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scheduler improvements for 1000+ worker clusters&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Channels and inter-client communication&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;New dependencies?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/12/dask-dev-2.md&lt;/span&gt;, line 27)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="dask-array-without-known-chunk-sizes"&gt;

&lt;p&gt;Dask arrays can now work even in situations where we don’t know the exact chunk
size. This is particularly important because it allows us to convert
dask.dataframes to dask.arrays in a standard analysis cycle that includes both
data preparation and statistical or machine learning algorithms.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_records&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This work was motivated by the work of Christopher White on building scalable
solvers for problems like logistic regression and generalized linear models
over at &lt;a class="reference external" href="https://github.com/moody-marlin/dask-glm"&gt;dask-glm&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;As a pleasant side effect we can now also index dask.arrays with dask.arrays (a
previous limitation)&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;and mutate dask.arrays in certain cases with setitem&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Both of which are frequntly requested.&lt;/p&gt;
&lt;p&gt;However, there are still holes in this implementation and many operations (like
slicing) generally don’t work on arrays without known chunk sizes. We’re
increasing capability here but blurring the lines of what is possible and what
is not possible, which used to be very clear.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/dask/dask/pull/1838"&gt;dask/dask#1838&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/dask/dask/pull/1840"&gt;dask/dask#1840&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/12/dask-dev-2.md&lt;/span&gt;, line 67)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="import-time"&gt;
&lt;h1&gt;Import time&lt;/h1&gt;
&lt;p&gt;Import times had been steadily climbing for a while, rising above one second at
times. These were reduced by Antoine Pitrou down to a more reasonable 300ms.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/dask/dask/pull/1833"&gt;dask/dask#1833&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/dask/distributed/pull/718"&gt;dask/distributed#718&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/12/dask-dev-2.md&lt;/span&gt;, line 75)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="fastparquet-blogpost-and-feedback"&gt;
&lt;h1&gt;FastParquet blogpost and feedback&lt;/h1&gt;
&lt;p&gt;Martin Durant has built a nice Python Parquet library here: &lt;a class="reference external" href="http://fastparquet.readthedocs.io/en/latest/"&gt;http://fastparquet.readthedocs.io/en/latest/&lt;/a&gt;
and released a blogpost about it last week here: &lt;a class="reference external" href="https://www.continuum.io/blog/developer-blog/introducing-fastparquet"&gt;https://www.continuum.io/blog/developer-blog/introducing-fastparquet&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Since then we’ve gotten some good feedback and error reports (non-string column
names etc.) Martin has been optimizing performance and recently adding append
support.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/dask/fastparquet/pull/39"&gt;dask/fastparquet#39&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/dask/fastparquet/pull/43"&gt;dask/fastparquet#43&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/12/dask-dev-2.md&lt;/span&gt;, line 87)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="scheduler-optimizations-for-1000-worker-clusters"&gt;
&lt;h1&gt;Scheduler optimizations for 1000+ worker clusters&lt;/h1&gt;
&lt;p&gt;The recent refactoring of the scheduler and worker exposed new opportunities
for performance and for measurement. One of the 1000+ worker deployments here
in NYC was kind enough to volunteer some compute time to run some experiments.
It was very fun having all of the Dask/Bokeh dashboards up at once (there are
now half a dozen of these things) giving live monitoring information on a
thousand-worker deployment. It’s stunning how clearly performance issues
present themselves when you have the right monitoring system.&lt;/p&gt;
&lt;p&gt;Anyway, this lead to better sequentialization when handling messages, greatly
reduced open file handle requirements, and the use of cytoolz over toolz in a
few critical areas.&lt;/p&gt;
&lt;p&gt;I intend to try this experiment again this week, now with new diagnostics. To
aid in that we’ve made it very easy to turn timings and counters automatically
into live Bokeh plots. It now takes literally one line of code to add a new
plot to these pages (left: scheduler right: worker)&lt;/p&gt;
&lt;a href="/images/bokeh-counters.gif"&gt;
  &lt;img src="/images/bokeh-counters.gif"
       alt="Dask Bokeh counters page"
       width="100%"&gt;&lt;/a&gt;
&lt;p&gt;Already we can see that the time it takes to connect between workers is
absurdly high in the 10ms to 100ms range, highlighting an important performance
flaw.&lt;/p&gt;
&lt;p&gt;This depends on an experimental project,
&lt;a class="reference external" href="https://github.com/jcrist/crick"&gt;crick&lt;/a&gt;, by Jim Crist that provides a fast
T-Digest implemented in C (see also &lt;a class="reference external" href="https://github.com/tdunning/t-digest"&gt;Ted Dunning’s
implementation&lt;/a&gt;.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/jcrist/crick"&gt;jcrist/crick&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/dask/distributed/pull/738"&gt;dask/distributed#738&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/12/dask-dev-2.md&lt;/span&gt;, line 123)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="channels-and-inter-worker-communication"&gt;
&lt;h1&gt;Channels and inter-worker communication&lt;/h1&gt;
&lt;p&gt;I’m starting to experiment with mechanisms for inter-client communication of
futures. This enables both collaborative workflows (two researchers sharing
the same cluster) and also complex workflows in which tasks start other tasks
in a more streaming setting.&lt;/p&gt;
&lt;p&gt;We added a simple mechanism to share a rolling buffer of futures between
clients:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Client 1&lt;/span&gt;
&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scheduler:8786&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;x&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Client 1&lt;/span&gt;
&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scheduler:8786&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;x&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Additionally, this relatively simple mechanism was built external to the
scheduler and client, establishing a pattern we can repeat in the future for
more complex inter-client communication systems. Generally I’m on the lookout
for other ways to make the system more extensible. This range of extension
requests for the scheduler is somewhat large these days and we’d like to find
ways to keep these expansions maintainable going forward.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/dask/distributed/pull/729"&gt;dask/distributed#729&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/12/dask-dev-2.md&lt;/span&gt;, line 159)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="new-dependency-sorted-collections"&gt;
&lt;h1&gt;New dependency: Sorted collections&lt;/h1&gt;
&lt;p&gt;The scheduler is now using the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sortedcollections&lt;/span&gt;&lt;/code&gt; module, which is based off
of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sortedcontainers&lt;/span&gt;&lt;/code&gt; which is a pure-Python library offering sorted containers
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;SortedList&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;SortedSet&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;ValueSortedDict&lt;/span&gt;&lt;/code&gt;, etc. at C-extensions speeds.&lt;/p&gt;
&lt;p&gt;So far I’m pretty sold on these libraries. I encourage other library
maintainers to consider them.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=7z2Ki44Vs4E"&gt;https://www.youtube.com/watch?v=7z2Ki44Vs4E&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://www.grantjenks.com/docs/sortedcontainers/introduction.html"&gt;http://www.grantjenks.com/docs/sortedcontainers/introduction.html&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://www.grantjenks.com/docs/sortedcollections/"&gt;http://www.grantjenks.com/docs/sortedcollections/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/12/12/dask-dev-2/"/>
    <summary>This work is supported by Continuum Analytics
the XDATA Program
and the Data Driven Discovery Initiative from the Moore
Foundation</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="scipy" label="scipy"/>
    <published>2016-12-12T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/12/05/dask-dev-1/</id>
    <title>Dask Development Log</title>
    <updated>2016-12-05T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;
the &lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA Program&lt;/a&gt;
and the Data Driven Discovery Initiative from the &lt;a class="reference external" href="https://www.moore.org/"&gt;Moore
Foundation&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Dask has been active lately due to a combination of increased adoption and
funded feature development by private companies. This increased activity
is great, however an unintended side effect is that I have spent less time
writing about development and engaging with the broader community. To address
this I hope to write one blogpost a week about general development. These will
not be particularly polished, nor will they announce ready-to-use features for
users, however they should increase transparency and hopefully better engage
the developer community.&lt;/p&gt;
&lt;p&gt;So themes of last week&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Embedded Bokeh servers for the Workers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Smarter workers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;An overhauled scheduler that is slightly simpler overall (thanks to the
smarter workers) but with more clever work stealing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fastparquet&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/05/dask-dev-1.md&lt;/span&gt;, line 30)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="embedded-bokeh-servers-in-dask-workers"&gt;

&lt;p&gt;The distributed scheduler’s &lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/web.html"&gt;web diagnostic
page&lt;/a&gt; is one of Dask’s
more flashy features. It shows the passage of every computation on the cluster
in real time. These diagnostics are invaluable for understanding performance
both for users and for core developers.&lt;/p&gt;
&lt;p&gt;I intend to focus on worker performance soon, so I decided to attach a Bokeh
server to every worker to serve web diagnostics about that worker. To make
this easier, I also learned how to &lt;em&gt;embed&lt;/em&gt; Bokeh servers inside of other
Tornado applications. This has reduced the effort to create new visuals and
expose real time information considerably and I can now create a full live
visualization in around 30 minutes. It is now &lt;em&gt;faster&lt;/em&gt; for me to build
a new diagnostic than to grep through logs. It’s pretty useful.&lt;/p&gt;
&lt;p&gt;Here are some screenshots. Nothing too flashy, but this information is highly
valuable to me as I measure bandwidths, delays of various parts of the code,
how workers send data between each other, etc..&lt;/p&gt;
&lt;a href="/images/bokeh-worker-system.png"&gt;
  &lt;img src="/images/bokeh-worker-system.png"
       alt="Dask Bokeh Worker system page"
       width="30%"&gt;&lt;/a&gt;
&lt;a href="/images/bokeh-worker-main.png"&gt;
  &lt;img src="/images/bokeh-worker-main.png"
       alt="Dask Bokeh Worker system page"
       width="30%"&gt;&lt;/a&gt;
&lt;a href="/images/bokeh-worker-crossfilter.png"&gt;
  &lt;img src="/images/bokeh-worker-crossfilter.png"
       alt="Dask Bokeh Worker system page"
       width="30%"&gt;&lt;/a&gt;
&lt;p&gt;To be clear, these diagnostic pages aren’t polished in any way. There’s lots
missing, it’s just what I could get done in a day. Still, everyone running a
Tornado application should have an embedded Bokeh server running. They’re
great for rapidly pushing out visually rich diagnostics.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/05/dask-dev-1.md&lt;/span&gt;, line 68)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="smarter-workers-and-a-simpler-scheduler"&gt;
&lt;h1&gt;Smarter Workers and a Simpler Scheduler&lt;/h1&gt;
&lt;p&gt;Previously the scheduler knew everything and the workers were fairly
simple-minded. Now we’ve moved some of the knowledge and responsibility over
to the workers. Previously the scheduler would give just enough work to the
workers to keep them occupied. This allowed the scheduler to make better
decisions about the state of the entire cluster. By delaying committing a task
to a worker until the last moment we made sure that we were making the right
decision. However, this also means that the worker sometimes has idle
resources, particularly network bandwidth, when it could be speculatively
preparing for future work.&lt;/p&gt;
&lt;p&gt;Now we commit all ready-to-run tasks to a worker immediately and that worker
has the ability to pipeline those tasks as it sees fit. This is better locally
but slightly worse globally. To counter balance this we’re now being much more
aggressive about work stealing and, because the workers have more information,
they can manage some of the administrative costs of works stealing themselves.
Because this isn’t bound to run on just the scheduler we can use more expensive
algorithms than when when did everything on the scheduler.&lt;/p&gt;
&lt;p&gt;There were a few motivations for this change:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Dataframe performance was bound by keeping the worker hardware fully
occupied, which we weren’t doing. I expect that these changes will
eventually yield something like a 30% speedup.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Users on traditional job scheduler machines (SGE, SLURM, TORQUE) and users
who like GPUS, both wanted the ability to tag tasks with specific resource
constraints like “This consumes one GPU” or “This task requires a 5GB of RAM
while running” and ensure that workers would respect those constraints when
running tasks. The old workers weren’t complex enough to reason about these
constraints. With the new workers, adding this feature was trivial.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;By moving logic from the scheduler to the worker we’ve actually made them
both easier to reason about. This should lower barriers for contributors
to get into the core project.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/05/dask-dev-1.md&lt;/span&gt;, line 103)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="dataframe-algorithms"&gt;
&lt;h1&gt;Dataframe algorithms&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/pull/1807"&gt;Approximate nunique&lt;/a&gt; and
multiple-output-partition groupbys landed in master last week. These arose
because some power-users had very large dataframes that weree running into
scalability limits. Thanks to Mike Graham for the approximate nunique
algorithm. This has also pushed &lt;a class="reference external" href="https://github.com/pandas-dev/pandas/pull/14729"&gt;hashing
changes&lt;/a&gt; upstream to Pandas.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/12/05/dask-dev-1.md&lt;/span&gt;, line 112)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="fast-parquet"&gt;
&lt;h1&gt;Fast Parquet&lt;/h1&gt;
&lt;p&gt;Martin Durant has been working on a Parquet reader/writer for Python using
Numba. It’s pretty slick. He’s been using it on internal Continuum projects
for a little while and has seen both good performance and a very Pythonic
experience for what was previously a format that was pretty inaccessible.&lt;/p&gt;
&lt;p&gt;He’s planning to write about this in the near future so I won’t steal his
thunder. Here is a link to the documentation:
&lt;a class="reference external" href="https://fastparquet.readthedocs.io/en/latest/"&gt;fastparquet.readthedocs.io&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/12/05/dask-dev-1/"/>
    <summary>This work is supported by Continuum Analytics
the XDATA Program
and the Data Driven Discovery Initiative from the Moore
Foundation</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="scipy" label="scipy"/>
    <published>2016-12-05T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/09/22/cluster-deployments/</id>
    <title>Dask Cluster Deployments</title>
    <updated>2016-09-22T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;
and the &lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA Program&lt;/a&gt;
as part of the &lt;a class="reference external" href="http://blaze.pydata.org"&gt;Blaze Project&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;All code in this post is experimental. It should not be relied upon. For
people looking to deploy dask.distributed on a cluster please refer instead to
the &lt;a class="reference external" href="https://distributed.readthedocs.org"&gt;documentation&lt;/a&gt; instead.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Dask is deployed today on the following systems in the wild:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;SGE&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;SLURM,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Torque&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Condor&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;LSF&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mesos&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Marathon&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kubernetes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;SSH and custom scripts&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;… there may be more. This is what I know of first-hand.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These systems provide users access to cluster resources and ensure that
many distributed services / users play nicely together. They’re essential for
any modern cluster deployment.&lt;/p&gt;
&lt;p&gt;The people deploying Dask on these cluster resource managers are power-users;
they know how their resource managers work and they read the &lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/setup.html"&gt;documentation on
how to setup Dask
clusters&lt;/a&gt;. Generally
these users are pretty happy; however we should reduce this barrier so that
non-power-users with access to a cluster resource manager can use Dask on their
cluster just as easily.&lt;/p&gt;
&lt;p&gt;Unfortunately, there are a few challenges:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Several cluster resource managers exist, each with significant adoption.
Finite developer time stops us from supporting all of them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Policies for scaling out vary widely.
For example we might want a fixed number of workers, or we might want
workers that scale out based on current use. Different groups will want
different solutions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Individual cluster deployments are highly configurable. Dask needs to get
out of the way quickly and let existing technologies configure themselves.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This post talks about some of these issues. It does not contain a definitive
solution.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/22/cluster-deployments.md&lt;/span&gt;, line 55)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="example-kubernetes"&gt;

&lt;p&gt;For example, both &lt;a class="reference external" href="http://ogrisel.com/"&gt;Olivier Griesl&lt;/a&gt; (INRIA, scikit-learn)
and &lt;a class="reference external" href="https://github.com/timodonnell"&gt;Tim O’Donnell&lt;/a&gt; (Mount Sinai, Hammer lab)
publish instructions on how to deploy Dask.distributed on
&lt;a class="reference external" href="http://kubernetes.io/"&gt;Kubernetes&lt;/a&gt;.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/ogrisel/docker-distributed"&gt;Olivier’s repository&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/hammerlab/dask-distributed-on-kubernetes/"&gt;Tim’s repository&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These instructions are well organized. They include Dockerfiles, published
images, Kubernetes config files, and instructions on how to interact with cloud
providers’ infrastructure. Olivier and Tim both obviously know what they’re
doing and care about helping others to do the same.&lt;/p&gt;
&lt;p&gt;Tim (who came second) wasn’t aware of Olivier’s solution and wrote up his own.
Tim was capable of doing this but many beginners wouldn’t be.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;One solution&lt;/strong&gt; would be to include a prominent registry of solutions like
these within Dask documentation so that people can find quality references to
use as starting points. I’ve started a list of resources here:
&lt;a class="reference external" href="https://github.com/dask/distributed/pull/547"&gt;dask/distributed #547&lt;/a&gt; comments
pointing to other resources would be most welcome..&lt;/p&gt;
&lt;p&gt;However, even if Tim did find Olivier’s solution I suspect he would still need
to change it. Tim has different software and scalability needs than Olivier.
This raises the question of &lt;em&gt;“What should Dask provide and what should it leave
to administrators?”&lt;/em&gt; It may be that the &lt;em&gt;best&lt;/em&gt; we can do is to support
copy-paste-edit workflows.&lt;/p&gt;
&lt;p&gt;What is Dask-specific, resource-manager specific, and what needs to be
configured by hand each time?&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/22/cluster-deployments.md&lt;/span&gt;, line 88)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="adaptive-deployments"&gt;
&lt;h1&gt;Adaptive Deployments&lt;/h1&gt;
&lt;p&gt;In order to explore this topic of separable solutions I built a small adaptive
deployment system for Dask.distributed on
&lt;a class="reference external" href="https://mesosphere.github.io/marathon/"&gt;Marathon&lt;/a&gt;, an orchestration platform
on top of Mesos.&lt;/p&gt;
&lt;p&gt;This solution does two things:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;It scales a Dask cluster dynamically based on the current use. If there
are more tasks in the scheduler then it asks for more workers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It deploys those workers using Marathon.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To encourage replication, these two different aspects are solved in two different pieces of code with a clean API boundary.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;A backend-agnostic piece for adaptivity that says when to scale workers up
and how to scale them down safely&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A Marathon-specific piece that deploys or destroys dask-workers using the
Marathon HTTP API&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This combines a policy, &lt;em&gt;adaptive scaling&lt;/em&gt;, with a backend, &lt;em&gt;Marathon&lt;/em&gt; such
that either can be replaced easily. For example we could replace the adaptive
policy with a fixed one to always keep N workers online, or we could replace
Marathon with Kubernetes or Yarn.&lt;/p&gt;
&lt;p&gt;My hope is that this demonstration encourages others to develop third party
packages. The rest of this post will be about diving into this particular
solution.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/22/cluster-deployments.md&lt;/span&gt;, line 117)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="adaptivity"&gt;
&lt;h1&gt;Adaptivity&lt;/h1&gt;
&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;distributed.deploy.Adaptive&lt;/span&gt;&lt;/code&gt; class wraps around a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Scheduler&lt;/span&gt;&lt;/code&gt; and
determines when we should scale up and by how many nodes, and when we should
scale down specifying which idle workers to release.&lt;/p&gt;
&lt;p&gt;The current policy is fairly straightforward:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;If there are unassigned tasks or any stealable tasks and no idle workers,
or if the average memory use is over 50%, then increase the number of
workers by a fixed factor (defaults to two).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If there are idle workers and the average memory use is below 50% then
reclaim the idle workers with the least data on them (after moving data to
nearby workers) until we’re near 50%&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Think this policy could be improved or have other thoughts? Great. It was
easy to implement and entirely separable from the main code so you should be
able to edit it easily or create your own. The current implementation is about
80 lines
(&lt;a class="reference external" href="https://github.com/dask/distributed/blob/master/distributed/deploy/adaptive.py"&gt;source&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;However, this &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Adaptive&lt;/span&gt;&lt;/code&gt; class doesn’t actually know how to perform the
scaling. Instead it depends on being handed a separate object, with two
methods, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;scale_up&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;scale_down&lt;/span&gt;&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;MyCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;scale_up&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;        Bring the total count of workers up to ``n``&lt;/span&gt;

&lt;span class="sd"&gt;        This function/coroutine should bring the total number of workers up to&lt;/span&gt;
&lt;span class="sd"&gt;        the number ``n``.&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="ne"&gt;NotImplementedError&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;scale_down&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;workers&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;        Remove ``workers`` from the cluster&lt;/span&gt;

&lt;span class="sd"&gt;        Given a list of worker addresses this function should remove those&lt;/span&gt;
&lt;span class="sd"&gt;        workers from the cluster.&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="ne"&gt;NotImplementedError&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This cluster object contains the backend-specific bits of &lt;em&gt;how&lt;/em&gt; to scale up and
down, but none of the adaptive logic of &lt;em&gt;when&lt;/em&gt; to scale up and down. The
single-machine
&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/local-cluster.html"&gt;LocalCluster&lt;/a&gt;
object serves as reference implementation.&lt;/p&gt;
&lt;p&gt;So we combine this adaptive scheme with a deployment scheme. We’ll use a tiny
Dask-Marathon deployment library available
&lt;a class="reference external" href="https://github.com/mrocklin/dask-marathon"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_marathon&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MarathonCluster&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Scheduler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed.deploy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Adaptive&lt;/span&gt;

&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Scheduler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MarathonCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;docker_image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mrocklin/dask-distributed&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ac&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Adaptive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This combines a policy, Adaptive, with a deployment scheme, Marathon in a
composable way. The Adaptive cluster watches the scheduler and calls the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;scale_up/down&lt;/span&gt;&lt;/code&gt; methods on the MarathonCluster as necessary.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/22/cluster-deployments.md&lt;/span&gt;, line 188)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="marathon-code"&gt;
&lt;h1&gt;Marathon code&lt;/h1&gt;
&lt;p&gt;Because we’ve isolated all of the “when” logic to the Adaptive code, the
Marathon specific code is blissfully short and specific. We include a slightly
simplified version below. There is a fair amount of Marathon-specific setup in
the constructor and then simple scale_up/down methods below:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;marathon&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MarathonClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MarathonApp&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;marathon.models.container&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MarathonContainer&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;MarathonCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;dask-worker&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;docker_image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mrocklin/dask-distributed&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;marathon_address&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;http://localhost:8080&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scheduler&lt;/span&gt;

        &lt;span class="c1"&gt;# Create Marathon App to run dask-worker&lt;/span&gt;
        &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;--nthreads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpus&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;--name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;$MESOS_TASK_ID&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# use Mesos task ID as worker name&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;--worker-port&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;$PORT_WORKER&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;--nanny-port&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;$PORT_NANNY&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;--http-port&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;$PORT_HTTP&amp;#39;&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;ports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;port&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="s1"&gt;&amp;#39;protocol&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;tcp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="s1"&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                 &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;worker&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;nanny&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;http&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--memory-limit&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;))])&lt;/span&gt;

        &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;cmd&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MarathonContainer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;image&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;docker_image&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MarathonApp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instances&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;port_definitions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ports&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;cpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cpus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Connect and register app&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MarathonClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marathon_address&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_app&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;dask-&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;scale_up&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instances&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale_app&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instances&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;instances&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;scale_down&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;workers&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;workers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kill_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                  &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;worker_info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                  &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This isn’t trivial, you need to know about Marathon for this to make sense, but
fortunately you don’t need to know much else. My hope is that people familiar
with other cluster resource managers will be able to write similar objects and
will publish them as third party libraries as I have with this Marathon
solution here:
&lt;a class="github reference external" href="https://github.com/mrocklin/dask-marathon"&gt;mrocklin/dask-marathon&lt;/a&gt;
(thanks goes to Ben Zaitlen for setting up a great testing harness for this and
getting everything started.)&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/22/cluster-deployments.md&lt;/span&gt;, line 258)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="adaptive-policies"&gt;
&lt;h1&gt;Adaptive Policies&lt;/h1&gt;
&lt;p&gt;Similarly, we can design new policies for deployment. You can read more about
the policies for the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Adaptive&lt;/span&gt;&lt;/code&gt; class in the
&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/adaptive.html"&gt;documentation&lt;/a&gt; or
the
&lt;a class="reference external" href="https://github.com/dask/distributed/blob/master/distributed/deploy/adaptive.py"&gt;source&lt;/a&gt;
(about eighty lines long). I encourage people to implement and use other
policies and contribute back those policies that are useful in practice.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/22/cluster-deployments.md&lt;/span&gt;, line 268)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="final-thoughts"&gt;
&lt;h1&gt;Final thoughts&lt;/h1&gt;
&lt;p&gt;We laid out a problem&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;How does a distributed system support a variety of cluster resource managers
and a variety of scheduling policies while remaining sensible?&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We proposed two solutions:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Maintain a registry of links to solutions, supporting copy-paste-edit practices&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Develop an API boundary that encourages separable development of third party libraries.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It’s not clear that either solution is sufficient, or that the current
implementation of either solution is any good. This is is an important problem
though as Dask.distributed is, today, still mostly used by super-users. I
would like to engage community creativity here as we search for a good
solution.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/09/22/cluster-deployments/"/>
    <summary>This work is supported by Continuum Analytics
and the XDATA Program
as part of the Blaze Project</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="scipy" label="scipy"/>
    <published>2016-09-22T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/09/13/dask-and-celery/</id>
    <title>Dask and Celery</title>
    <updated>2016-09-13T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;This post compares two Python distributed task processing systems,
Dask.distributed and Celery.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Disclaimer: technical comparisons are hard to do well. I am biased towards
Dask and ignorant of correct Celery practices. Please keep this in mind.
Critical feedback by Celery experts is welcome.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.celeryproject.org/"&gt;Celery&lt;/a&gt; is a distributed task queue built in
Python and heavily used by the Python community for task-based workloads.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://dask.pydata.org/en/latest/"&gt;Dask&lt;/a&gt; is a parallel computing library
popular within the PyData community that has grown a fairly sophisticated
&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/"&gt;distributed task scheduler&lt;/a&gt;.
This post explores if Dask.distributed can be useful for Celery-style problems.&lt;/p&gt;
&lt;p&gt;Comparing technical projects is hard both because authors have bias, and also
because the scope of each project can be quite large. This allows authors to
gravitate towards the features that show off our strengths. Fortunately &lt;a class="reference external" href="https://github.com/dask/dask/issues/1537"&gt;a
Celery user asked how Dask compares on
Github&lt;/a&gt; and they listed a few
concrete features:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Handling multiple queues&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Canvas (celery’s workflow)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rate limiting&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Retrying&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These provide an opportunity to explore the Dask/Celery comparision from the
bias of a Celery user rather than from the bias of a Dask developer.&lt;/p&gt;
&lt;p&gt;In this post I’ll point out a couple of large differences, then go through the
Celery hello world in both projects, and then address how these requested
features are implemented or not within Dask. This anecdotal comparison over a
few features should give us a general comparison.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 43)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="biggest-difference-worker-state-and-communication"&gt;

&lt;p&gt;First, the biggest difference (from my perspective) is that Dask workers hold
onto intermediate results and communicate data between each other while in
Celery all results flow back to a central authority. This difference was
critical when building out large parallel arrays and dataframes (Dask’s
original purpose) where we needed to engage our worker processes’ memory and
inter-worker communication bandwidths. Computational systems like Dask do
this, more data-engineering systems like Celery/Airflow/Luigi don’t. This is
the main reason why Dask wasn’t built on top of Celery/Airflow/Luigi originally.&lt;/p&gt;
&lt;p&gt;That’s not a knock against Celery/Airflow/Luigi by any means. Typically
they’re used in settings where this doesn’t matter and they’ve focused their
energies on several features that Dask similarly doesn’t care about or do well.
Tasks usually read data from some globally accessible store like a database or
S3 and either return very small results, or place larger results back in the
global store.&lt;/p&gt;
&lt;p&gt;The question on my mind is now is &lt;em&gt;Can Dask be a useful solution in more
traditional loose task scheduling problems where projects like Celery are
typically used? What are the benefits and drawbacks?&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 65)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="hello-world"&gt;
&lt;h1&gt;Hello World&lt;/h1&gt;
&lt;p&gt;To start we do the &lt;a class="reference external" href="http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html"&gt;First steps with
Celery&lt;/a&gt;
walk-through both in Celery and Dask and compare the two:&lt;/p&gt;
&lt;section id="celery"&gt;
&lt;h2&gt;Celery&lt;/h2&gt;
&lt;p&gt;I follow the Celery quickstart, using Redis instead of RabbitMQ because it’s
what I happen to have handy.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# tasks.py&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;celery&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Celery&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Celery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tasks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;broker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;redis://localhost&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;redis&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;redis-server
celery -A tasks worker --loglevel=info
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;tasks&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# submit and retrieve roundtrip&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;68&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;567&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mi"&gt;888&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;72&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;960&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.7&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="dask"&gt;
&lt;h2&gt;Dask&lt;/h2&gt;
&lt;p&gt;We do the same workload with dask.distributed’s concurrent.futures interface,
using the default single-machine deployment.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;operator&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;20.7&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mi"&gt;328&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;340&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;369&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="comparison"&gt;
&lt;h2&gt;Comparison&lt;/h2&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Functions&lt;/strong&gt;: In Celery you register computations ahead of time on the
server. This is good if you know what you want to run ahead of time (such
as is often the case in data engineering workloads) and don’t want the
security risk of allowing users to run arbitrary code on your cluster. It’s
less pleasant on users who want to experiment. In Dask we choose the
functions to run on the user side, not on the server side. This ends up
being pretty critical in data exploration but may be a hinderance in more
conservative/secure compute settings.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;: In Celery we depend on other widely deployed systems like
RabbitMQ or Redis. Dask depends on lower-level Torando TCP IOStreams and
Dask’s own custom routing logic. This makes Dask trivial to set up, but
also probably less durable. Redis and RabbitMQ have both solved lots of
problems that come up in the wild and leaning on them inspires confidence.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Performance&lt;/strong&gt;: They both operate with sub-second latencies and
millisecond-ish overheads. Dask is marginally lower-overhead but for data
engineering workloads differences at this level are rarely significant.
Dask is an order of magnitude lower-latency, which might be a big deal
depending on your application. For example if you’re firing off tasks from
a user clicking a button on a website 20ms is generally within interactive
budget while 500ms feels a bit slower.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 155)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="simple-dependencies"&gt;
&lt;h1&gt;Simple Dependencies&lt;/h1&gt;
&lt;p&gt;The question asked about
&lt;a class="reference external" href="http://docs.celeryproject.org/en/master/userguide/canvas.html"&gt;Canvas&lt;/a&gt;,
Celery’s dependency management system.&lt;/p&gt;
&lt;p&gt;Often tasks depend on the results of other tasks. Both systems have ways to
help users express these dependencies.&lt;/p&gt;
&lt;section id="id1"&gt;
&lt;h2&gt;Celery&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 164); &lt;em&gt;&lt;a href="#id1"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “celery”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;apply_async&lt;/span&gt;&lt;/code&gt; method has a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;link=&lt;/span&gt;&lt;/code&gt; parameter that can be used to call tasks
after other tasks have run. For example we can compute &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;(1&lt;/span&gt; &lt;span class="pre"&gt;+&lt;/span&gt; &lt;span class="pre"&gt;2)&lt;/span&gt; &lt;span class="pre"&gt;+&lt;/span&gt; &lt;span class="pre"&gt;3&lt;/span&gt;&lt;/code&gt; in Celery
as follows:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply_async&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="dask-distributed"&gt;
&lt;h2&gt;Dask.distributed&lt;/h2&gt;
&lt;p&gt;With the Dask concurrent.futures API, futures can be used within submit calls
and dependencies are implicit.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We could also use the &lt;a class="reference external" href="http://dask.pydata.org/en/latest/delayed.html"&gt;dask.delayed&lt;/a&gt; decorator to annotate arbitrary functions and then use normal-ish Python.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="id2"&gt;
&lt;h2&gt;Comparison&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 196); &lt;em&gt;&lt;a href="#id2"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “comparison”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;I prefer the Dask solution, but that’s subjective.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 200)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="complex-dependencies"&gt;
&lt;h1&gt;Complex Dependencies&lt;/h1&gt;
&lt;section id="id3"&gt;
&lt;h2&gt;Celery&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 202); &lt;em&gt;&lt;a href="#id3"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “celery”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;Celery includes a rich vocabulary of terms to connect tasks in more complex
ways including &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;groups&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;chains&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;chords&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;maps&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;starmaps&lt;/span&gt;&lt;/code&gt;, etc.. More
detail here in their docs for Canvas, the system they use to construct complex
workflows: &lt;a class="reference external" href="http://docs.celeryproject.org/en/master/userguide/canvas.html"&gt;http://docs.celeryproject.org/en/master/userguide/canvas.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;For example here we chord many adds and then follow them with a sum.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;tasks&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tsum&lt;/span&gt;  &lt;span class="c1"&gt;# I had to add a sum method to tasks.py&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;celery&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chord&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;chord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;))(&lt;/span&gt;&lt;span class="n"&gt;tsum&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mi"&gt;172&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;184&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.21&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="mi"&gt;9900&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="id4"&gt;
&lt;h2&gt;Dask&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 222); &lt;em&gt;&lt;a href="#id4"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “dask”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;Dask’s trick of allowing futures in submit calls actually goes pretty far.
Dask doesn’t really need any additional primitives. It can do all of the
patterns expressed in Canvas fairly naturally with normal submit calls.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mi"&gt;52&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;52&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;60.8&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Or with &lt;a class="reference external" href="http://dask.pydata.org/en/latest/delayed.html"&gt;Dask.delayed&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 246)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="multiple-queues"&gt;
&lt;h1&gt;Multiple Queues&lt;/h1&gt;
&lt;p&gt;In Celery there is a notion of queues to which tasks can be submitted and that
workers can subscribe. An example use case is having “high priority” workers
that only process “high priority” tasks. Every worker can subscribe to
the high-priority queue but certain workers will subscribe to that queue
exclusively:&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;my&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Q&lt;/span&gt; &lt;span class="n"&gt;high&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;  &lt;span class="c1"&gt;# only subscribe to high priority&lt;/span&gt;
&lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;my&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Q&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;high&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;  &lt;span class="c1"&gt;# subscribe to both&lt;/span&gt;
&lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;my&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Q&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;high&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;
&lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;my&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Q&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;high&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This is like the TSA pre-check line or the express lane in the grocery store.&lt;/p&gt;
&lt;p&gt;Dask has a couple of topics that are similar or could fit this need in a pinch, but nothing that is strictly analogous.&lt;/p&gt;
&lt;p&gt;First, for the common case above, tasks have priorities. These are typically
set by the scheduler to minimize memory use but can be overridden directly by
users to give certain tasks precedence over others.&lt;/p&gt;
&lt;p&gt;Second, you can restrict tasks to run on subsets of workers. This was
originally designed for data-local storage systems like the Hadoop FileSystem
(HDFS) or clusters with special hardware like GPUs but can be used in the
queues case as well. It’s not quite the same abstraction but could be used to
achieve the same results in a pinch. For each task you can &lt;em&gt;restrict&lt;/em&gt; the pool
of workers on which it can run.&lt;/p&gt;
&lt;p&gt;The relevant docs for this are here:
&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/locality.html#user-control"&gt;http://distributed.readthedocs.io/en/latest/locality.html#user-control&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 279)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="retrying-tasks"&gt;
&lt;h1&gt;Retrying Tasks&lt;/h1&gt;
&lt;p&gt;Celery allows tasks to retry themselves on a failure.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;send_twitter_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;oauth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;twitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Twitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;oauth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;twitter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Twitter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FailWhaleError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Twitter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LoginError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Example from http://docs.celeryproject.org/en/latest/userguide/tasks.html#retrying&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Sadly Dask currently has no support for this (see &lt;a class="reference external" href="https://github.com/dask/distributed/issues/391"&gt;open
issue&lt;/a&gt;). All functions are
considered pure and final. If a task errs the exception is considered to be
the true result. This could change though; it has been requested a couple of
times now.&lt;/p&gt;
&lt;p&gt;Until then users need to implement retry logic within the function (which isn’t
a terrible idea regardless).&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;send_twitter_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;oauth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;twitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Twitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;oauth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;twitter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Twitter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FailWhaleError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Twitter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LoginError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 316)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="rate-limiting"&gt;
&lt;h1&gt;Rate Limiting&lt;/h1&gt;
&lt;p&gt;Celery lets you specify rate limits on tasks, presumably to help you avoid
getting blocked from hammering external APIs&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;1000/h&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;query_external_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Dask definitely has nothing built in for this, nor is it planned. However,
this could be done externally to Dask fairly easily. For example, Dask
supports mapping functions over arbitrary Python Queues. If you send in a
queue then all current and future elements in that queue will be mapped over.
You could easily handle rate limiting in Pure Python on the client side by
rate limiting your input queues. The low latency and overhead of Dask makes it
fairly easy to manage logic like this on the client-side. It’s not as
convenient, but it’s still straightforward.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;queue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Queue&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_external_api&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;Queue&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/13/dask-and-celery.md&lt;/span&gt;, line 346)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="final-thoughts"&gt;
&lt;h1&gt;Final Thoughts&lt;/h1&gt;
&lt;p&gt;Based on this very shallow exploration of Celery, I’ll foolishly claim that
Dask can handle Celery workloads, &lt;em&gt;if you’re not diving into deep API&lt;/em&gt;.
However all of that deep API is actually really important. Celery evolved in
this domain and developed tons of features that solve problems that arise over
and over again. This history saves users an enormous amount of time. Dask
evolved in a very different space and has developed a very different set of
tricks. Many of Dask’s tricks are general enough that they can solve Celery
problems with a small bit of effort, but there’s still that extra step. I’m
seeing people applying that effort to problems now and I think it’ll be
interesting to see what comes out of it.&lt;/p&gt;
&lt;p&gt;Going through the Celery API was a good experience for me personally. I think
that there are some good concepts from Celery that can inform future Dask
development.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/09/13/dask-and-celery/"/>
    <summary>This post compares two Python distributed task processing systems,
Dask.distributed and Celery.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="scipy" label="scipy"/>
    <published>2016-09-13T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/09/12/dask-distributed-release-1.13.0/</id>
    <title>Dask Distributed Release 1.13.0</title>
    <updated>2016-09-12T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;I’m pleased to announce a release of
&lt;a class="reference external" href="http://dask.readthedocs.io/en/latest/"&gt;Dask&lt;/a&gt;’s distributed scheduler,
&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/"&gt;dask.distributed&lt;/a&gt;, version
1.13.0.&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;conda install dask distributed -c conda-forge
or
pip install dask distributed --upgrade
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The last few months have seen a number of important user-facing features:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Executor is renamed to Client&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Workers can spill excess data to disk when they run out of memory&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Client.compute and Client.persist methods for dealing with dask
collections (like dask.dataframe or dask.delayed) gain the ability to
restrict sub-components of the computation to different parts of the
cluster with a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;workers=&lt;/span&gt;&lt;/code&gt; keyword argument.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IPython kernels can be deployed on the worker and schedulers for
interactive debugging.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Bokeh web interface has gained new plots and improve the visual styling
of old ones.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally there are beta features in current development. These features
are available now, but may change without warning in future versions.
Experimentation and feedback by users comfortable with living on the bleeding
edge is most welcome:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Clients can publish named datasets on the scheduler to share between them&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tasks can launch other tasks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Workers can restart themselves in new software environments provided by the
user&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There have also been significant internal changes. Other than increased
performance these changes should not be directly apparent.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;The scheduler was refactored to a more state-machine like architecture.
&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/scheduling-state.html"&gt;Doc page&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Short-lived connections are now managed by a connection pool&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Work stealing has changed and grown more responsive:
&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/work-stealing.html"&gt;Doc page&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;General resilience improvements&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rest of this post will contain very brief explanations of the topics above.
Some of these topics may become blogposts of their own at some point. Until
then I encourage people to look at the &lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest"&gt;distributed scheduler’s
documentation&lt;/a&gt; which is separate
from &lt;a class="reference external" href="http://dask.readthedocs.io/en/latest/"&gt;dask’s normal documentation&lt;/a&gt; and
so may contain new information for some readers (Google Analytics reports about
5-10x the readership on
&lt;a class="reference external" href="http://dask.readthedocs.org"&gt;http://dask.readthedocs.org&lt;/a&gt; than on
&lt;a class="reference external" href="http://distributed.readthedocs.org"&gt;http://distributed.readthedocs.org&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/12/dask-distributed-release-1.13.0.md&lt;/span&gt;, line 60)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="major-changes-and-features"&gt;

&lt;section id="rename-executor-to-client"&gt;
&lt;h2&gt;Rename Executor to Client&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/api.html"&gt;http://distributed.readthedocs.io/en/latest/api.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The term &lt;em&gt;Executor&lt;/em&gt; was originally chosen to coincide with the
&lt;a class="reference external" href="https://docs.python.org/3/library/concurrent.futures.html"&gt;concurrent.futures&lt;/a&gt;
Executor interface, which is what defines the behavior for the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.submit&lt;/span&gt;&lt;/code&gt;,
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.map&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.result&lt;/span&gt;&lt;/code&gt; methods and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Future&lt;/span&gt;&lt;/code&gt; object used as the primary interface.&lt;/p&gt;
&lt;p&gt;Unfortunately, this is the same term used by projects like Spark and Mesos for
“the low-level thing that executes tasks on each of the workers” causing
significant confusion when communicating with other communities or for
transitioning users.&lt;/p&gt;
&lt;p&gt;In response we rename &lt;em&gt;Executor&lt;/em&gt; to a somewhat more generic term, &lt;em&gt;Client&lt;/em&gt; to
designate its role as &lt;em&gt;the thing users interact with to control their
computations&lt;/em&gt;.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;  &lt;span class="c1"&gt;# Old&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;                    &lt;span class="c1"&gt;# Old&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;    &lt;span class="c1"&gt;# New&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;                      &lt;span class="c1"&gt;# New&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Executor remains an alias for Client and will continue to be valid for some
time, but there may be some backwards incompatible changes for internal use of
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;executor=&lt;/span&gt;&lt;/code&gt; keywords within methods. Newer examples and materials will all use
the term &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Client&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="workers-spill-excess-data-to-disk"&gt;
&lt;h2&gt;Workers Spill Excess Data to Disk&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/worker.html#spill-excess-data-to-disk"&gt;http://distributed.readthedocs.io/en/latest/worker.html#spill-excess-data-to-disk&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;When workers get close to running out of memory they can send excess data to
disk. This is not on by default and instead requires adding the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;--memory-limit=auto&lt;/span&gt;&lt;/code&gt; option to &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-worker&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;worker&lt;/span&gt; &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8786&lt;/span&gt;                      &lt;span class="c1"&gt;# Old&lt;/span&gt;
&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;worker&lt;/span&gt; &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8786&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;auto&lt;/span&gt;  &lt;span class="c1"&gt;# New&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This will eventually become the default (and is now when using
&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/local-cluster.html"&gt;LocalCluster&lt;/a&gt;)
but we’d like to see how things progress and phase it in slowly.&lt;/p&gt;
&lt;p&gt;Generally this feature should improve robustness and allow the solution of
larger problems on smaller clusters, although with a performance cost. Dask’s
policies to reduce memory use through clever scheduling remain in place, so in
the common case you should never need this feature, but it’s nice to have as a
failsafe.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="enable-restriction-of-valid-workers-for-compute-and-persist-methods"&gt;
&lt;h2&gt;Enable restriction of valid workers for compute and persist methods&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/locality.html#user-control"&gt;http://distributed.readthedocs.io/en/latest/locality.html#user-control&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Expert users of the distributed scheduler will be aware of the ability to
restrict certain tasks to run only on certain computers. This tends to be
useful when dealing with GPUs or with special databases or instruments only
available on some machines.&lt;/p&gt;
&lt;p&gt;Previously this option was available only on the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;submit&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;map&lt;/span&gt;&lt;/code&gt;, and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;scatter&lt;/span&gt;&lt;/code&gt;
methods, forcing people to use the more immedate interface. Now the dask
collection interface functions &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;compute&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;persist&lt;/span&gt;&lt;/code&gt; support this keyword as
well.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="ipython-integration"&gt;
&lt;h2&gt;IPython Integration&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/ipython.html"&gt;http://distributed.readthedocs.io/en/latest/ipython.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You can start IPython kernels on the workers or scheduler and then access them
directly using either IPython magics or the QTConsole. This tends to be
valuable when things go wrong and you want to interactively debug on the worker
nodes themselves.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Start IPython on the Scheduler&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_ipython_scheduler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Start IPython kernel on the scheduler&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt; &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processing&lt;/span&gt;   &lt;span class="c1"&gt;# Use IPython magics to inspect scheduler&lt;/span&gt;
&lt;span class="go"&gt;{&amp;#39;127.0.0.1:3595&amp;#39;: [&amp;#39;inc-1&amp;#39;, &amp;#39;inc-2&amp;#39;],&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;127.0.0.1:53589&amp;#39;: [&amp;#39;inc-2&amp;#39;, &amp;#39;add-5&amp;#39;]}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Start IPython on the Workers&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_ipython_workers&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Start IPython kernels on all workers&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;[&amp;#39;127.0.0.1:4595&amp;#39;, &amp;#39;127.0.0.1:53589&amp;#39;]&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;remote&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;127.0.0.1:3595&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active&lt;/span&gt;  &lt;span class="c1"&gt;# Use IPython magics&lt;/span&gt;
&lt;span class="go"&gt;{&amp;#39;inc-1&amp;#39;, &amp;#39;inc-2&amp;#39;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="bokeh-interface"&gt;
&lt;h2&gt;Bokeh Interface&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/web.html"&gt;http://distributed.readthedocs.io/en/latest/web.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The Bokeh web interface to the cluster continues to evolve both by improving
existing plots and by adding new plots and new pages.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://raw.githubusercontent.com/dask/dask-org/master/images/bokeh-progress-large.gif"
     alt="dask progress bar"
     width="60%"
     align="right"&gt;&lt;/p&gt;
&lt;p&gt;For example the progress bars have become more compact and shrink down
dynamically to respond to addiional bars.&lt;/p&gt;
&lt;p&gt;And we’ve added in extra tables and plots to monitor workers, such as their
memory use and current backlog of tasks.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/12/dask-distributed-release-1.13.0.md&lt;/span&gt;, line 176)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="experimental-features"&gt;
&lt;h1&gt;Experimental Features&lt;/h1&gt;
&lt;p&gt;The features described below are experimental and may change without warning.
Please do not depend on them in stable code.&lt;/p&gt;
&lt;section id="publish-datasets"&gt;
&lt;h2&gt;Publish Datasets&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/publish.html"&gt;http://distributed.readthedocs.io/en/latest/publish.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You can now save collections on the scheduler, allowing you to come back to the
same computations later or allow collaborators to see and work off of your
results. This can be useful in the following cases:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;There is a dataset from which you frequently base all computations, and you
want that dataset always in memory and easy to access without having to
recompute it each time you start work, even if you disconnect.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You want to send results to a colleague working on the same Dask cluster and
have them get immediate access to your computations without having to send
them a script and without them having to repeat the work on the cluster.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Example: Client One&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scheduler-address:8786&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.dataframe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s3://my-bucket/*.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
      &lt;span class="n"&gt;name&lt;/span&gt;  &lt;span class="n"&gt;balance&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;    &lt;span class="n"&gt;Alice&lt;/span&gt;     &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;      &lt;span class="n"&gt;Bob&lt;/span&gt;     &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="n"&gt;Charlie&lt;/span&gt;     &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;   &lt;span class="n"&gt;Dennis&lt;/span&gt;     &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;    &lt;span class="n"&gt;Edith&lt;/span&gt;     &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;publish_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;accounts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Example: Client Two&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scheduler-address:8786&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;list_datasets&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;[&amp;#39;accounts&amp;#39;]&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;accounts&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;      name  balance&lt;/span&gt;
&lt;span class="go"&gt;0    Alice     -100&lt;/span&gt;
&lt;span class="go"&gt;1      Bob     -200&lt;/span&gt;
&lt;span class="go"&gt;2  Charlie     -300&lt;/span&gt;
&lt;span class="go"&gt;3   Dennis     -400&lt;/span&gt;
&lt;span class="go"&gt;4    Edith     -500&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="launch-tasks-from-tasks"&gt;
&lt;h2&gt;Launch Tasks from tasks&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/task-launch.html"&gt;http://distributed.readthedocs.io/en/latest/task-launch.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You can now submit tasks to the cluster that themselves submit more tasks.
This allows the submission of highly dynamic workloads that can shape
themselves depending on future computed values without ever checking back in
with the original client.&lt;/p&gt;
&lt;p&gt;This is accomplished by starting new local &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Client&lt;/span&gt;&lt;/code&gt;s within the task that can
interact with the scheduler.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;func&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;local_client&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;local_client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;c2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;There are a few straightforward use cases for this, like iterative algorithms
with stoping criteria, but also many novel use cases including streaming
and monitoring systems.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="restart-workers-in-redeployable-python-environments"&gt;
&lt;h2&gt;Restart Workers in Redeployable Python Environments&lt;/h2&gt;
&lt;p&gt;You can now zip up and distribute full Conda environments, and ask
dask-workers to restart themselves, live, in that environment. This involves
the following:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Create a conda environment locally (or any redeployable directory including
a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;python&lt;/span&gt;&lt;/code&gt; executable)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Zip up that environment and use the existing dask.distributed network
to copy it to all of the workers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Shut down all of the workers and restart them within the new environment&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This helps users to experiment with different software environments with a much
faster turnaround time (typically tens of seconds) than asking IT to install
libraries or building and deploying Docker containers (which is also a fine
solution). Note that they typical solution of uploading individual python
scripts or egg files has been around for a while, &lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/api.html#distributed.client.Client.upload_file"&gt;see API docs for
upload_file&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/09/12/dask-distributed-release-1.13.0.md&lt;/span&gt;, line 282)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="acknowledgements"&gt;
&lt;h1&gt;Acknowledgements&lt;/h1&gt;
&lt;p&gt;Since version 1.12.0 on August 18th the following people have contributed
commits to the &lt;a class="reference external" href="https://github.com/dask/distributed"&gt;dask/distributed repository&lt;/a&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Dave Hirschfeld&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;dsidi&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jim Crist&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Joseph Crail&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loïc Estève&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martin Durant&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Min RK&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scott Sievert&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/09/12/dask-distributed-release-1.13.0/"/>
    <summary>I’m pleased to announce a release of
Dask’s distributed scheduler,
dask.distributed, version
1.13.0.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2016-09-12T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/08/16/dask-for-institutions/</id>
    <title>Dask for Institutions</title>
    <updated>2016-08-16T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="20%"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/08/16/dask-for-institutions.md&lt;/span&gt;, line 14)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="introduction"&gt;

&lt;p&gt;Institutions use software differently than individuals. Over the last few
months I’ve had dozens of conversations about using Dask within larger
organizations like universities, research labs, private companies, and
non-profit learning systems. This post provides a very coarse summary of those
conversations and extracts common questions. I’ll then try to answer those
questions.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: some of this post will be necessarily vague at points. Some companies
prefer privacy. All details here are either in public Dask issues or have come
up with enough institutions (say at least five) that I’m comfortable listing
the problem here.&lt;/em&gt;&lt;/p&gt;
&lt;section id="common-story"&gt;
&lt;h2&gt;Common story&lt;/h2&gt;
&lt;p&gt;Institution X, a university/research lab/company/… has many
scientists/analysts/modelers who develop models and analyze data with Python,
the PyData stack like NumPy/Pandas/SKLearn, and a large amount of custom code.
These models/data sometimes grow to be large enough to need a moderately large
amount of parallel computing.&lt;/p&gt;
&lt;p&gt;Fortunately, Institution X has an in-house cluster acquired for exactly this
purpose of accelerating modeling and analysis of large computations and
datasets. Users can submit jobs to the cluster using a job scheduler like
SGE/LSF/Mesos/Other.&lt;/p&gt;
&lt;p&gt;However the cluster is still under-utilized and the users are still asking for
help with parallel computing. Either users aren’t comfortable using the
SGE/LSF/Mesos/Other interface, it doesn’t support sufficiently complex/dynamic
workloads, or the interaction times aren’t good enough for the interactive use
that users appreciate.&lt;/p&gt;
&lt;p&gt;There was an internal effort to build a more complex/interactive/Pythonic
system on top of SGE/LSF/Mesos/Other but it’s not particularly mature and
definitely isn’t something that Institution X wants to pursue. It turned out
to be a harder problem than expected to design/build/maintain such a system
in-house. They’d love to find an open source solution that was well featured
and maintained by a community.&lt;/p&gt;
&lt;p&gt;The Dask.distributed scheduler looks like it’s 90% of the system that
Institution X needs. However there are a few open questions:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;How do we integrate dask.distributed with the SGE/LSF/Mesos/Other job
scheduler?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How can we grow and shrink the cluster dynamically based on use?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How do users manage software environments on the workers?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How secure is the distributed scheduler?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dask is resilient to worker failure, how about scheduler failure?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What happens if &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-worker&lt;/span&gt;&lt;/code&gt;s are in two different data centers? Can we
scale in an asymmetric way?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How do we handle multiple concurrent users and priorities?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How does this compare with Spark?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So for the rest of this post I’m going to answer these questions. As usual,
few of answers will be of the form “Yes Dask can solve all of your problems.”
These are open questions, not the questions that were easy to answer. We’ll
get into what’s possible today and how we might solve these problems in the
future.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="how-do-we-integrate-dask-distributed-with-sge-lsf-mesos-other"&gt;
&lt;h2&gt;How do we integrate dask.distributed with SGE/LSF/Mesos/Other?&lt;/h2&gt;
&lt;p&gt;It’s not difficult to deploy dask.distributed at scale within an existing
cluster using a tool like SGE/LSF/Mesos/Other. In many cases there is already
a researcher within the institution doing this manually by running
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-scheduler&lt;/span&gt;&lt;/code&gt; on some static node in the cluster and launching &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-worker&lt;/span&gt;&lt;/code&gt;
a few hundred times with their job scheduler and a small job script.&lt;/p&gt;
&lt;p&gt;The goal now is how to formalize this process for the individual version of
SGE/LSF/Mesos/Other used within the institution while also developing and
maintaining a standard Pythonic interface so that all of these tools can be
maintained cheaply by Dask developers into the foreseeable future. In some
cases Institution X is happy to pay for the development of a convenient “start
dask on my job scheduler” tool, but they are less excited about paying to
maintain it forever.&lt;/p&gt;
&lt;p&gt;We want Python users to be able to say something like the following:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SGECluster&lt;/span&gt;

&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SGECluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nworkers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;… and have this same interface be standardized across different job
schedulers.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="how-can-we-grow-and-shrink-the-cluster-dynamically-based-on-use"&gt;
&lt;h2&gt;How can we grow and shrink the cluster dynamically based on use?&lt;/h2&gt;
&lt;p&gt;Alternatively, we could have a single dask.distributed deployment running 24/7
that scales itself up and down dynamically based on current load. Again, this
is entirely possible today if you want to do it manually (you can add and
remove workers on the fly) but we should add some signals to the scheduler like
the following:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;“I’m under duress, please add workers”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;“I’ve been idling for a while, please reclaim workers”&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;and connect these signals to a manager that talks to the job scheduler. This
removes an element of control from the users and places it in the hands of a
policy that IT can tune to play more nicely with their other services on the
same network.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="how-do-users-manage-software-environments-on-the-workers"&gt;
&lt;h2&gt;How do users manage software environments on the workers?&lt;/h2&gt;
&lt;p&gt;Today Dask assumes that all users and workers share the exact same software
environment. There are some small tools to send updated &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.py&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.egg&lt;/span&gt;&lt;/code&gt; files
to the workers but that’s it.&lt;/p&gt;
&lt;p&gt;Generally Dask trusts that the full software environment will be handled by
something else. This might be a network file system (NFS) mount on traditional
cluster setups, or it might be handled by moving docker or conda environments
around by some other tool like &lt;a class="reference external" href="http://knit.readthedocs.io/en/latest/"&gt;knit&lt;/a&gt;
for YARN deployments or something more custom. For example Continuum &lt;a class="reference external" href="https://docs.continuum.io/anaconda-cluster/"&gt;sells
proprietary software&lt;/a&gt; that
does this.&lt;/p&gt;
&lt;p&gt;Getting the standard software environment setup generally isn’t such a big deal
for institutions. They typically have some system in place to handle this
already. Where things become interesting is when users want to use
drastically different environments from the system environment, like using Python
2 vs Python 3 or installing a bleeding-edge scikit-learn version. They may
also want to change the software environment many times in a single session.&lt;/p&gt;
&lt;p&gt;The best solution I can think of here is to pass around fully downloaded conda
environments using the dask.distributed network (it’s good at moving large
binary blobs throughout the network) and then teaching the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-worker&lt;/span&gt;&lt;/code&gt;s to
bootstrap themselves within this environment. We should be able to tear
everything down and restart things within a small number of seconds. This
requires some work; first to make relocatable conda binaries (which is usually
fine but is not always fool-proof due to links) and then to help the
dask-workers learn to bootstrap themselves.&lt;/p&gt;
&lt;p&gt;Somewhat related, Hussain Sultan of Capital One recently contributed a
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-submit&lt;/span&gt;&lt;/code&gt; command to run scripts on the cluster:
&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/submitting-applications.html"&gt;http://distributed.readthedocs.io/en/latest/submitting-applications.html&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id="how-secure-is-the-distributed-scheduler"&gt;
&lt;h2&gt;How secure is the distributed scheduler?&lt;/h2&gt;
&lt;p&gt;Dask.distributed is incredibly insecure. It allows anyone with network access
to the scheduler to execute arbitrary code in an unprotected environment. Data
is sent in the clear. Any malicious actor can both steal your secrets and then
cripple your cluster.&lt;/p&gt;
&lt;p&gt;This is entirely the norm however. Security is usually handled by other
services that manage computational frameworks like Dask.&lt;/p&gt;
&lt;p&gt;For example we might rely on Docker to isolate workers from destroying their
surrounding environment and rely on network access controls to protect data
access.&lt;/p&gt;
&lt;p&gt;Because Dask runs on Tornado, a serious networking library and web framework,
there are some things we can do easily like enabling SSL, authentication, etc..
However I hesitate to jump into providing “just a little bit of security”
without going all the way for fear of providing a false sense of security. In
short, I have no plans to work on this without a lot of encouragement. Even
then I would strongly recommend that institutions couple Dask with tools
intended for security. I believe that is common practice for distributed
computational systems generally.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="dask-is-resilient-to-worker-failure-how-about-scheduler-failure"&gt;
&lt;h2&gt;Dask is resilient to worker failure, how about scheduler failure?&lt;/h2&gt;
&lt;p&gt;Workers can come and go. Clients can come and go. The state in the scheduler
is currently irreplaceable and no attempt is made to back it up. There are a
few things you could imagine here:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Backup state and recent events to some persistent storage so that state can
be recovered in case of catastrophic loss&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Have a hot failover node that gets a copy of every action that the
scheduler takes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Have multiple peer schedulers operate simultaneously in a way that they can
pick up slack from lost peers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Have clients remember what they have submitted and resubmit when a
scheduler comes back online&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Currently option 4 is currently the most feasible and gets us most of the way
there. However options 2 or 3 would probably be necessary if Dask were to ever
run as critical infrastructure in a giant institution. We’re not there yet.&lt;/p&gt;
&lt;p&gt;As of &lt;a class="reference external" href="https://github.com/dask/distributed/pull/413"&gt;recent work&lt;/a&gt; spurred on by
Stefan van der Walt at UC Berkeley/BIDS the scheduler can now die and come back
and everyone will reconnect. The state for computations in flight is entirely
lost but the computational infrastructure remains intact so that people can
resubmit jobs without significant loss of service.&lt;/p&gt;
&lt;p&gt;Dask has a bit of a harder time with this topic because it offers a persistent
stateful interface. This problem is much easier for distributed database
projects that run ephemeral queries off of persistent storage, return the
results, and then clear out state.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="what-happens-if-dask-workers-are-in-two-different-data-centers-can-we-scale-in-an-asymmetric-way"&gt;
&lt;h2&gt;What happens if dask-workers are in two different data centers? Can we scale in an asymmetric way?&lt;/h2&gt;
&lt;p&gt;The short answer is no. Other than number of cores and available RAM all
workers are considered equal to each other (except when the user &lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/locality.html#user-control"&gt;explicitly
specifies
otherwise&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;However this problem and problems like it have come up a lot lately. Here are a
few examples of similar cases:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Multiple data centers geographically distributed around the country&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multiple racks within a single data center&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multiple workers that have GPUs that can move data between each other easily&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multiple processes on a single machine&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Having some notion of hierarchical worker group membership or inter-worker
preferred relationships is probably inevitable long term. As with all
distributed scheduling questions the hard part isn’t deciding that this is
useful, or even coming up with a sensible design, but rather figuring out how
to make decisions on the sensible design that are foolproof and operate in
constant time. I don’t personally see a good approach here yet but expect one
to arise as more high priority use cases come in.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="how-do-we-handle-multiple-concurrent-users-and-priorities"&gt;
&lt;h2&gt;How do we handle multiple concurrent users and priorities?&lt;/h2&gt;
&lt;p&gt;There are several sub-questions here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Can multiple users use Dask on my cluster at the same time?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Yes, either by spinning up separate scheduler/worker sets or by sharing the same
set.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;If they’re sharing the same workers then won’t they clobber each other’s
data?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is very unlikely. Dask is careful about naming tasks, so it’s very
unlikely that the two users will submit conflicting computations that compute to
different values but occupy the same key in memory. However if they both submit
computations that overlap somewhat then the scheduler will nicely avoid
recomputation. This can be very nice when you have many people doing slightly
different computations on the same hardware. This works in the same way that
Git works.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;If they’re sharing the same workers then won’t they clobber each other’s
resources?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Yes, this is definitely possible. If you’re concerned about this then you
should give everyone their own scheduler/workers (which is easy and standard
practice). There is not currently much user management built into Dask.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="how-does-this-compare-with-spark"&gt;
&lt;h2&gt;How does this compare with Spark?&lt;/h2&gt;
&lt;p&gt;At an institutional level Spark seems to primarily target ETL + Database-like
computations. While Dask modules like Dask.bag and Dask.dataframe can happily
play in this space this doesn’t seem to be the focus of recent conversations.&lt;/p&gt;
&lt;p&gt;Recent conversations are almost entirely around supporting interactive custom
parallelism (lots of small tasks with complex dependencies between them) rather
than the big Map-&amp;gt;Filter-&amp;gt;Groupby-&amp;gt;Join abstractions you often find in a
database or Spark. That’s not to say that these operations aren’t hugely
important; there is a lot of selection bias here. The people I talk to are
people for whom Spark/Databases are clearly not an appropriate fit. They are
tackling problems that are way more complex, more heterogeneous, and with a
broader variety of users.&lt;/p&gt;
&lt;p&gt;I usually describe this situation with an analogy comparing “Big data” systems
to human transportation mechanisms in a city. Here we go:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;A Database is like a train&lt;/em&gt;: it goes between a set of well defined points
with great efficiency, speed, and predictability. These are popular and
profitable routes that many people travel between (e.g. business analytics).
You do have to get from home to the train station on your own (ETL), but once
you’re in the database/train you’re quite comfortable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Spark is like an automobile&lt;/em&gt;: it takes you door-to-door from your home to
your destination with a single tool. While this may not be as fast as the train for
the long-distance portion, it can be extremely convenient to do ETL, Database
work, and some machine learning all from the comfort of a single system.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Dask is like an all-terrain-vehicle&lt;/em&gt;: it takes you out of town on rough
ground that hasn’t been properly explored before. This is a good match for
the Python community, which typically does a lot of exploration into new
approaches. You can also drive your ATV around town and you’ll be just fine,
but if you want to do thousands of SQL queries then you should probably
invest in a proper database or in Spark.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Again, there is a lot of selection bias here, if what you want is a database
then you should probably get a database. Dask is not a database.&lt;/p&gt;
&lt;p&gt;This is also wildly over-simplifying things. Databases like Oracle have lots
of ETL and analytics tools, Spark is known to go off road, etc.. I obviously
have a bias towards Dask. You really should never trust an author of a project
to give a fair and unbiased view of the capabilities of the tools in the
surrounding landscape.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/08/16/dask-for-institutions.md&lt;/span&gt;, line 298)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;That’s a rough sketch of current conversations and open problems for “How Dask
might evolve to support institutional use cases.” It’s really quite surprising
just how prevalent this story is among the full spectrum from universities to
hedge funds.&lt;/p&gt;
&lt;p&gt;The problems listed above are by no means halting adoption. I’m not listing
the 100 or so questions that are answered with “yes, that’s already supported
quite well”. Right now I’m seeing Dask being adopted by individuals and small
groups within various institutions. Those individuals and small groups are
pushing that interest up the stack. It’s still several months before any 1000+
person organization adopts Dask as infrastructure, but the speed at which
momentum is building is quite encouraging.&lt;/p&gt;
&lt;p&gt;I’d also like to thank the several nameless people who exercise Dask on various
infrastructures at various scales on interesting problems and have reported
serious bugs. These people don’t show up on the GitHub issue tracker but their
utility in flushing out bugs is invaluable.&lt;/p&gt;
&lt;p&gt;As interest in Dask grows it’s interesting to see how it will evolve.
Culturally Dask has managed to simultaneously cater to both the open science
crowd as well as the private-sector crowd. The project gets both financial
support and open source contributions from each side. So far there hasn’t been
any conflict of interest (everyone is pushing in roughly the same direction)
which has been a really fruitful experience for all involved I think.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/08/16/dask-for-institutions/"/>
    <summary>&lt;img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="20%"&gt;</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2016-08-16T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/07/12/dask-learn-part-1/</id>
    <title>Dask and Scikit-Learn -- Model Parallelism</title>
    <updated>2016-07-12T00:00:00+00:00</updated>
    <author>
      <name>Jim Crist</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;This post was written by Jim Crist. The original post lives at
&lt;a class="reference external" href="http://jcrist.github.io/dask-sklearn-part-1.html"&gt;http://jcrist.github.io/dask-sklearn-part-1.html&lt;/a&gt;
(with better styling)&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This is the first of a series of posts discussing some recent experiments
combining &lt;a class="reference external" href="http://dask.pydata.org/en/latest/"&gt;dask&lt;/a&gt; and
&lt;a class="reference external" href="http://scikit-learn.org/stable/"&gt;scikit-learn&lt;/a&gt;. A small (and extremely alpha)
library has been built up from these experiments, and can be found
&lt;a class="reference external" href="https://github.com/jcrist/dask-learn"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Before we start, I would like to make the following caveats:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;I am not a machine learning expert. Do not consider this a guide on how to do
machine learning, the usage of scikit-learn below is probably naive.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;All of the code discussed here is in flux, and shouldn’t be considered stable
or robust. That said, if you know something about machine learning and want
to help out, I’d be more than happy to receive issues or pull requests :).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are several ways of parallelizing algorithms in machine learning. Some
algorithms can be made to be data-parallel (either across features or across
samples). In this post we’ll look instead at model-parallelism (use same data
across different models), and dive into a daskified implementation of
&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html"&gt;GridSearchCV&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/07/12/dask-learn-part-1.md&lt;/span&gt;, line 34)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="what-is-grid-search"&gt;

&lt;p&gt;Many machine learning algorithms have &lt;em&gt;hyperparameters&lt;/em&gt; which can be tuned to
improve the performance of the resulting estimator. A &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search"&gt;grid
search&lt;/a&gt;
is one way of optimizing these parameters — it works by doing a parameter
sweep across a cartesian product of a subset of these parameters (the “grid”),
and then choosing the best resulting estimator. Since this is fitting many
independent estimators across the same set of data, it can be fairly easily
parallelized.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/07/12/dask-learn-part-1.md&lt;/span&gt;, line 45)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="grid-search-with-scikit-learn"&gt;
&lt;h1&gt;Grid search with scikit-learn&lt;/h1&gt;
&lt;p&gt;In scikit-learn, a grid search is performed using the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;GridSearchCV&lt;/span&gt;&lt;/code&gt; class, and
can (optionally) be automatically parallelized using
&lt;a class="reference external" href="https://pythonhosted.org/joblib/index.html"&gt;joblib&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This is best illustrated with an example. First we’ll make an example dataset
for doing classification against:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.datasets&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;make_classification&lt;/span&gt;

&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_classification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;n_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;n_redundant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To solve this classification problem, we’ll create a pipeline of a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;PCA&lt;/span&gt;&lt;/code&gt; and a
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;LogisticRegression&lt;/span&gt;&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;linear_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decomposition&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.pipeline&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;

&lt;span class="n"&gt;logistic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;linear_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;decomposition&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;pca&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;logistic&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logistic&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Both of these classes take several hyperparameters, we’ll do a grid-search
across only a few of them:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;#Parameters of pipelines can be set using ‘__’ separated parameter names:&lt;/span&gt;
&lt;span class="n"&gt;grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca__n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;logistic__C&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1e4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;logistic__penalty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;l1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;l2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Finally, we can create an instance of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;GridSearchCV&lt;/span&gt;&lt;/code&gt;, and perform the grid
search. The parameter &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_jobs=-1&lt;/span&gt;&lt;/code&gt; tells joblib to use as many processes as I
have cores (8).&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.grid_search&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;estimator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 5.3 s, sys: 243 ms, total: 5.54 s&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 21.6 s&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;What happened here was:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;An estimator was created for each parameter combination and test-train set
(scikit-learn’s grid search also does cross validation across 3-folds by
default).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Each estimator was fit on its corresponding set of training data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Each estimator was then scored on its corresponding set of testing data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The best set of parameters was chosen based on these scores&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A new estimator was then fit on &lt;em&gt;all&lt;/em&gt; of the data, using the best parameters&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The corresponding best score, parameters, and estimator can all be found as
attributes on the resulting object:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_score_&lt;/span&gt;
&lt;span class="go"&gt;0.89290000000000003&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt;
&lt;span class="go"&gt;{&amp;#39;logistic__C&amp;#39;: 0.0001, &amp;#39;logistic__penalty&amp;#39;: &amp;#39;l2&amp;#39;, &amp;#39;pca__n_components&amp;#39;: 50}&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_estimator_&lt;/span&gt;
&lt;span class="go"&gt;Pipeline(steps=[(&amp;#39;pca&amp;#39;, PCA(copy=True, n_components=50, whiten=False)), (&amp;#39;logistic&amp;#39;, LogisticRegression(C=0.0001, class_weight=None, dual=False,&lt;/span&gt;
&lt;span class="go"&gt;        fit_intercept=True, intercept_scaling=1, max_iter=100,&lt;/span&gt;
&lt;span class="go"&gt;        multi_class=&amp;#39;ovr&amp;#39;, n_jobs=1, penalty=&amp;#39;l2&amp;#39;, random_state=None,&lt;/span&gt;
&lt;span class="go"&gt;        solver=&amp;#39;liblinear&amp;#39;, tol=0.0001, verbose=0, warm_start=False))])&amp;lt;div class=md_output&amp;gt;&lt;/span&gt;

&lt;span class="go"&gt;    {&amp;#39;logistic__C&amp;#39;: 0.0001, &amp;#39;logistic__penalty&amp;#39;: &amp;#39;l2&amp;#39;, &amp;#39;pca__n_components&amp;#39;: 50}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/07/12/dask-learn-part-1.md&lt;/span&gt;, line 128)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="grid-search-with-dask-learn"&gt;
&lt;h1&gt;Grid search with dask-learn&lt;/h1&gt;
&lt;p&gt;Here we’ll repeat the same fit using dask-learn. I’ve tried to match the
scikit-learn interface as much as possible, although not everything is
implemented. Here the only thing that really changes is the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;GridSearchCV&lt;/span&gt;&lt;/code&gt;
import. We don’t need the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_jobs&lt;/span&gt;&lt;/code&gt; keyword, as this will be parallelized across
all cores by default.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dklearn.grid_search&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;DaskGridSearchCV&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;destimator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DaskGridSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;destimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="go"&gt;CPU times: user 16.3 s, sys: 1.89 s, total: 18.2 s&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 5.63 s&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;As before, the best score, parameters, and estimator can all be found as
attributes on the object. Here we’ll just show that they’re equivalent:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;destimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_score_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_score_&lt;/span&gt;
&lt;span class="go"&gt;True&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;destimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt;
&lt;span class="go"&gt;True&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;destimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_estimator_&lt;/span&gt;
&lt;span class="go"&gt;Pipeline(steps=[(&amp;#39;pca&amp;#39;, PCA(copy=True, n_components=50, whiten=False)), (&amp;#39;logistic&amp;#39;, LogisticRegression(C=0.0001, class_weight=None, dual=False,&lt;/span&gt;
&lt;span class="go"&gt;        fit_intercept=True, intercept_scaling=1, max_iter=100,&lt;/span&gt;
&lt;span class="go"&gt;        multi_class=&amp;#39;ovr&amp;#39;, n_jobs=1, penalty=&amp;#39;l2&amp;#39;, random_state=None,&lt;/span&gt;
&lt;span class="go"&gt;        solver=&amp;#39;liblinear&amp;#39;, tol=0.0001, verbose=0, warm_start=False))])&amp;lt;div class=md_output&amp;gt;&lt;/span&gt;

&lt;span class="go"&gt;    {&amp;#39;logistic__C&amp;#39;: 0.0001, &amp;#39;logistic__penalty&amp;#39;: &amp;#39;l2&amp;#39;, &amp;#39;pca__n_components&amp;#39;: 50}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/07/12/dask-learn-part-1.md&lt;/span&gt;, line 164)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="why-is-the-dask-version-faster"&gt;
&lt;h1&gt;Why is the dask version faster?&lt;/h1&gt;
&lt;p&gt;If you look at the times above, you’ll note that the dask version was &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;~4X&lt;/span&gt;&lt;/code&gt;
faster than the scikit-learn version. This is not because we have optimized any
of the pieces of the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Pipeline&lt;/span&gt;&lt;/code&gt;, or that there’s a significant amount of
overhead to &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;joblib&lt;/span&gt;&lt;/code&gt; (on the contrary, joblib does some pretty amazing things,
and I had to construct a contrived example to beat it this badly). The reason
is simply that the dask version is doing less work.&lt;/p&gt;
&lt;p&gt;This maybe best explained in pseudocode. The scikit-learn version of the above
(in serial) looks something like (pseudocode):&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;pca__n_components&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;logistic__C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;penalty&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;logistic__penalty&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="c1"&gt;# Create and fit a PCA on the input data&lt;/span&gt;
                &lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="c1"&gt;# Transform both the train and test data&lt;/span&gt;
                &lt;span class="n"&gt;X_train2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;X_test2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="c1"&gt;# Create and fit a LogisticRegression on the transformed data&lt;/span&gt;
                &lt;span class="n"&gt;logistic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;penalty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;penalty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;logistic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="c1"&gt;# Score the total pipeline&lt;/span&gt;
                &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logistic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="c1"&gt;# Save the score and parameters&lt;/span&gt;
                &lt;span class="n"&gt;scores_and_params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Find the best set of parameters (for some definition of best)&lt;/span&gt;
&lt;span class="n"&gt;find_best_parameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This is looping through a cartesian product of the cross-validation sets and
all the parameter combinations, and then creating and fitting a new estimator
for each combination. While embarassingly parallel, this can also result in
repeated work, as earlier stages in the pipeline are refit multiple times on
the same parameter + data combinations.&lt;/p&gt;
&lt;p&gt;In contrast, the dask version hashes all inputs (forming a sort of &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Merkle_tree"&gt;Merkle
DAG&lt;/a&gt;), resulting in the intermediate
results being shared. Keeping with the pseudocode above, the dask version might
look like:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;pca__n_components&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# Create and fit a PCA on the input data&lt;/span&gt;
        &lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Transform both the train and test data&lt;/span&gt;
        &lt;span class="n"&gt;X_train2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;X_test2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;logistic__C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;penalty&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;logistic__penalty&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="c1"&gt;# Create and fit a LogisticRegression on the transformed data&lt;/span&gt;
                &lt;span class="n"&gt;logistic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;penalty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;penalty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;logistic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="c1"&gt;# Score the total pipeline&lt;/span&gt;
                &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logistic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="c1"&gt;# Save the score and parameters&lt;/span&gt;
                &lt;span class="n"&gt;scores_and_params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;penalty&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Find the best set of parameters (for some definition of best)&lt;/span&gt;
&lt;span class="n"&gt;find_best_parameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This can still be parallelized, but in a less straightforward manner - the
graph is a bit more complicated than just a simple map-reduce pattern.
Thankfully the &lt;a class="reference external" href="http://dask.pydata.org/en/latest/scheduler-overview.html"&gt;dask
schedulers&lt;/a&gt; are well
equipped to handle arbitrary graph topologies. Below is a GIF showing how the
dask scheduler (the threaded scheduler specifically) executed the grid search
performed above. Each rectangle represents data, and each circle represents a
task. Each is categorized by color:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Red means actively taking up resources. These are tasks executing in a thread,
or intermediate results occupying memory&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Blue means finished or released. These are already finished tasks, or data
that’s been released from memory because it’s no longer needed&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;img src="/images/grid_search_schedule.gif" alt="Dask Graph Execution" style="width:100%"&gt;
&lt;p&gt;Looking at the trace, a few things stand out:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;We do a good job sharing intermediates. Each step in a pipeline is only fit
once given the same parameters/data, resulting in some intermediates having
many dependent tasks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The scheduler does a decent job of quickly finishing up tasks required to
release data. This doesn’t matter as much here (none of the intermediates
take up much memory), but for other workloads this is very useful. See Matt
Rocklin’s &lt;a class="reference internal" href="../../2015/01/06/Towards-OOC-Scheduling/"&gt;&lt;span class="doc std std-doc"&gt;excellent blogpost
here&lt;/span&gt;&lt;/a&gt;
for more discussion on this.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/07/12/dask-learn-part-1.md&lt;/span&gt;, line 261)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="distributed-grid-search-using-dask-learn"&gt;
&lt;h1&gt;Distributed grid search using dask-learn&lt;/h1&gt;
&lt;p&gt;The &lt;a class="reference external" href="http://dask.pydata.org/en/latest/scheduler-overview.html"&gt;schedulers&lt;/a&gt; used
in dask are configurable. The default (used above) is the threaded scheduler,
but we can just as easily swap it out for the distributed scheduler. Here I’ve
just spun up two local workers to demonstrate, but this works equally well
across multiple machines.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="c1"&gt;# Create an Executor, and set it as the default scheduler&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;10.0.0.3:8786&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;set_as_default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;Executor: scheduler=&amp;quot;10.0.0.3:8786&amp;quot; processes=2 cores=8&amp;gt;&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;destimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 1.69 s, sys: 433 ms, total: 2.12 s&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 7.66 s&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;destimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 1.69 s, sys: 433 ms, total: 2.12 s&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 7.66 s&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;destimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_score_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_score_&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt; &lt;span class="n"&gt;destimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;True&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Note that this is slightly slower than the threaded execution, so it doesn’t
make sense for this workload, but for others it might.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/07/12/dask-learn-part-1.md&lt;/span&gt;, line 293)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-worked-well"&gt;
&lt;h1&gt;What worked well&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="https://github.com/jcrist/dask-learn/blob/master/dklearn/grid_search.py"&gt;code for doing
this&lt;/a&gt;
is quite short. There’s also an implementation of
&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV.html"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomizedSearchCV&lt;/span&gt;&lt;/code&gt;&lt;/a&gt;,
which is only a few extra lines (hooray for good class hierarchies!).
Instead of working with dask graphs directly, both implementations use
&lt;a class="reference external" href="http://dask.pydata.org/en/latest/delayed.html"&gt;dask.delayed&lt;/a&gt; wherever
possible, which also makes the code easy to read.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Due to the internal hashing used in dask (which is extensible!), duplicate
computations are avoided.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Since the graphs are separated from the scheduler, this works both locally
and distributed with only a few extra lines.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/07/12/dask-learn-part-1.md&lt;/span&gt;, line 310)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="caveats-and-what-could-be-better"&gt;
&lt;h1&gt;Caveats and what could be better&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;The scikit-learn api makes use of mutation (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;est.fit(X,&lt;/span&gt; &lt;span class="pre"&gt;y)&lt;/span&gt;&lt;/code&gt; mutates &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;est&lt;/span&gt;&lt;/code&gt;),
while dask collections are mostly immutable. After playing around with a few
different ideas, I settled on dask-learn estimators being immutable (except
for grid-search, more on this in a bit). This made the code easier to reason
about, but does mean that you need to do &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;est&lt;/span&gt; &lt;span class="pre"&gt;=&lt;/span&gt; &lt;span class="pre"&gt;est.fit(X,&lt;/span&gt; &lt;span class="pre"&gt;y)&lt;/span&gt;&lt;/code&gt; when working
with dask-learn estimators.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;GridSearchCV&lt;/span&gt;&lt;/code&gt; posed a different problem. Due to the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;refit&lt;/span&gt;&lt;/code&gt; keyword, the
implementation can’t be done in a single pass over the data. This means that
we can’t build a single graph describing both the grid search and the refit,
which prevents it from being done lazily. I debated removing this keyword,
but decided in the end to make &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;fit&lt;/span&gt;&lt;/code&gt; execute immediately. This means that
there’s a bit of a disconnect between &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;GridSearchCV&lt;/span&gt;&lt;/code&gt; and the other classes in
the library, which I don’t like. On the other hand, it does mean that this
version of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;GridSearchCV&lt;/span&gt;&lt;/code&gt; could be a drop-in for the sckit-learn one.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The approach presented here is nice, but is really &lt;em&gt;only beneficial when
there’s duplicate work to be avoided, and that duplicate work is expensive&lt;/em&gt;.
Repeating the above with only a single estimator (instead of a pipeline)
results in identical (or slightly worse) performance than joblib. Similarly,
if the repeated steps are cheap the difference in performance is much smaller
(try the above using
&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html"&gt;SelectKBest&lt;/a&gt;
instead of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;PCA&lt;/span&gt;&lt;/code&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The ability to swap easily from local to distributed execution is nice, but
&lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/joblib.html"&gt;distributed also contains a joblib
frontend&lt;/a&gt; that can
do this just as easily.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/07/12/dask-learn-part-1.md&lt;/span&gt;, line 342)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="help"&gt;
&lt;h1&gt;Help&lt;/h1&gt;
&lt;p&gt;I am not a machine learning expert. Is any of this useful? Do you have
suggestions for improvements (or better yet PRs for improvements :))? Please
feel free to reach out in the comments below, or &lt;a class="reference external" href="https://github.com/jcrist/dask-learn"&gt;on
github&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io/"&gt;Continuum Analytics&lt;/a&gt; and the
&lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA&lt;/a&gt; program as part of the &lt;a class="reference external" href="http://blaze.pydata.org/"&gt;Blaze
Project&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/07/12/dask-learn-part-1/"/>
    <summary>This post was written by Jim Crist. The original post lives at
http://jcrist.github.io/dask-sklearn-part-1.html
(with better styling)</summary>
    <category term="Programming" label="Programming"/>
    <category term="dask" label="dask"/>
    <published>2016-07-12T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/04/20/dask-distributed-part-5/</id>
    <title>Ad Hoc Distributed Random Forests</title>
    <updated>2016-04-20T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;
and the &lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA Program&lt;/a&gt;
as part of the &lt;a class="reference external" href="http://blaze.pydata.org"&gt;Blaze Project&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;A screencast version of this post is available here:
&lt;a class="reference external" href="https://www.youtube.com/watch?v=FkPlEqB8AnE"&gt;https://www.youtube.com/watch?v=FkPlEqB8AnE&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 16)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="tl-dr"&gt;

&lt;p&gt;Dask.distributed lets you submit individual tasks to the cluster. We use this
ability combined with Scikit Learn to train and run a distributed random forest
on distributed tabular NYC Taxi data.&lt;/p&gt;
&lt;p&gt;Our machine learning model does not perform well, but we do learn how to
execute ad-hoc computations easily.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 25)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="motivation"&gt;
&lt;h1&gt;Motivation&lt;/h1&gt;
&lt;p&gt;In the past few posts we analyzed data on a cluster with Dask collections:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference internal" href="../../2016/02/17/dask-distributed-part1/"&gt;&lt;span class="doc std std-doc"&gt;Dask.bag on JSON records&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference internal" href="../../2016/02/22/dask-distributed-part-2/"&gt;&lt;span class="doc std std-doc"&gt;Dask.dataframe on CSV data&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference internal" href="../../2016/02/26/dask-distributed-part-3/"&gt;&lt;span class="doc std std-doc"&gt;Dask.array on HDF5 data&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Often our computations don’t fit neatly into the bag, dataframe, or array
abstractions. In these cases we want the flexibility of normal code with for
loops, but still with the computational power of a cluster. With the
dask.distributed task interface, we achieve something close to this.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 38)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="application-naive-distributed-random-forest-algorithm"&gt;
&lt;h1&gt;Application: Naive Distributed Random Forest Algorithm&lt;/h1&gt;
&lt;p&gt;As a motivating application we build a random forest algorithm from the ground
up using the single-machine Scikit Learn library, and dask.distributed’s
ability to quickly submit individual tasks to run on the cluster. Our
algorithm will look like the following:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Pull data from some external source (S3) into several dataframes on the
cluster&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For each dataframe, create and train one &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomForestClassifier&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scatter single testing dataframe to all machines&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For each &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomForestClassifier&lt;/span&gt;&lt;/code&gt; predict output on test dataframe&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Aggregate independent predictions from each classifier together by a
majority vote. To avoid bringing too much data to any one machine, perform
this majority vote as a tree reduction.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 54)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="data-nyc-taxi-2015"&gt;
&lt;h1&gt;Data: NYC Taxi 2015&lt;/h1&gt;
&lt;p&gt;As in our &lt;a class="reference internal" href="../../2016/02/22/dask-distributed-part-2/"&gt;&lt;span class="doc std std-doc"&gt;blogpost on distributed
dataframes&lt;/span&gt;&lt;/a&gt;
we use the data on all NYC Taxi rides in 2015. This is around 20GB on disk and
60GB in RAM.&lt;/p&gt;
&lt;p&gt;We predict the number of passengers in each cab given the other
numeric columns like pickup and destination location, fare breakdown, distance,
etc..&lt;/p&gt;
&lt;p&gt;We do this first on a small bit of data on a single machine and then on the
entire dataset on the cluster. Our cluster is composed of twelve m4.xlarges (4
cores, 15GB RAM each).&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Disclaimer and Spoiler Alert&lt;/em&gt;: I am not an expert in machine learning. Our
algorithm will perform very poorly. If you’re excited about machine
learning you can stop reading here. However, if you’re interested in how to
&lt;em&gt;build&lt;/em&gt; distributed algorithms with Dask then you may want to read on,
especially if you happen to know enough machine learning to improve upon my
naive solution.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 76)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="api-submit-map-gather"&gt;
&lt;h1&gt;API: submit, map, gather&lt;/h1&gt;
&lt;p&gt;We use a small number of &lt;a class="reference external" href="http://distributed.readthedocs.org/en/latest/api.html"&gt;dask.distributed
functions&lt;/a&gt; to build our
computation:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                     &lt;span class="c1"&gt;# scatter data&lt;/span&gt;
&lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# submit single task&lt;/span&gt;
&lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sequence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;           &lt;span class="c1"&gt;# submit many tasks&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                   &lt;span class="c1"&gt;# gather results&lt;/span&gt;
&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replicate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;number_of_replications&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In particular, functions like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;executor.submit(function,&lt;/span&gt; &lt;span class="pre"&gt;*args)&lt;/span&gt;&lt;/code&gt; let us send
individual functions out to our cluster thousands of times a second. Because
these functions consume their own results we can create complex workflows that
stay entirely on the cluster and trust the distributed scheduler to move data
around intelligently.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 96)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="load-pandas-from-s3"&gt;
&lt;h1&gt;Load Pandas from S3&lt;/h1&gt;
&lt;p&gt;First we load data from Amazon S3. We use the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;s3.read_csv(...,&lt;/span&gt; &lt;span class="pre"&gt;collection=False)&lt;/span&gt;&lt;/code&gt;
function to load 178 Pandas DataFrames on our cluster from CSV data on S3. We
get back a list of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Future&lt;/span&gt;&lt;/code&gt; objects that refer to these remote dataframes. The
use of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;collection=False&lt;/span&gt;&lt;/code&gt; gives us this list of futures rather than a single
cohesive Dask.dataframe object.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;
&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;52.91.1.177:8786&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;dfs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;dask-data/nyc-taxi/2015&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;parse_dates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tpep_pickup_datetime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                               &lt;span class="s1"&gt;&amp;#39;tpep_dropoff_datetime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                  &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dfs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dfs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Each of these is a lightweight &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Future&lt;/span&gt;&lt;/code&gt; pointing to a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;pandas.DataFrame&lt;/span&gt;&lt;/code&gt; on the
cluster.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;dfs&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="go"&gt;[&amp;lt;Future: status: finished, type: DataFrame, key: finalize-a06c3dd25769f434978fa27d5a4cf24b&amp;gt;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;lt;Future: status: finished, type: DataFrame, key: finalize-7dcb27364a8701f45cb02d2fe034728a&amp;gt;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;lt;Future: status: finished, type: DataFrame, key: finalize-b0dfe075000bd59c3a90bfdf89a990da&amp;gt;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;lt;Future: status: finished, type: DataFrame, key: finalize-1c9bb25cefa1b892fac9b48c0aef7e04&amp;gt;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;lt;Future: status: finished, type: DataFrame, key: finalize-c8254256b09ae287badca3cf6d9e3142&amp;gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;If we’re willing to wait a bit then we can pull data from any future back to
our local process using the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.result()&lt;/span&gt;&lt;/code&gt; method. We don’t want to do this too
much though, data transfer can be expensive and we can’t hold the entire
dataset in the memory of a single machine. Here we just bring back one of the
dataframes:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dfs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;VendorID&lt;/th&gt;
      &lt;th&gt;tpep_pickup_datetime&lt;/th&gt;
      &lt;th&gt;tpep_dropoff_datetime&lt;/th&gt;
      &lt;th&gt;passenger_count&lt;/th&gt;
      &lt;th&gt;trip_distance&lt;/th&gt;
      &lt;th&gt;pickup_longitude&lt;/th&gt;
      &lt;th&gt;pickup_latitude&lt;/th&gt;
      &lt;th&gt;RateCodeID&lt;/th&gt;
      &lt;th&gt;store_and_fwd_flag&lt;/th&gt;
      &lt;th&gt;dropoff_longitude&lt;/th&gt;
      &lt;th&gt;dropoff_latitude&lt;/th&gt;
      &lt;th&gt;payment_type&lt;/th&gt;
      &lt;th&gt;fare_amount&lt;/th&gt;
      &lt;th&gt;extra&lt;/th&gt;
      &lt;th&gt;mta_tax&lt;/th&gt;
      &lt;th&gt;tip_amount&lt;/th&gt;
      &lt;th&gt;tolls_amount&lt;/th&gt;
      &lt;th&gt;improvement_surcharge&lt;/th&gt;
      &lt;th&gt;total_amount&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;2015-01-15 19:05:39&lt;/td&gt;
      &lt;td&gt;2015-01-15 19:23:42&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1.59&lt;/td&gt;
      &lt;td&gt;-73.993896&lt;/td&gt;
      &lt;td&gt;40.750111&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;N&lt;/td&gt;
      &lt;td&gt;-73.974785&lt;/td&gt;
      &lt;td&gt;40.750618&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;12.0&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;3.25&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.3&lt;/td&gt;
      &lt;td&gt;17.05&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:33:38&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:53:28&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;3.30&lt;/td&gt;
      &lt;td&gt;-74.001648&lt;/td&gt;
      &lt;td&gt;40.724243&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;N&lt;/td&gt;
      &lt;td&gt;-73.994415&lt;/td&gt;
      &lt;td&gt;40.759109&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;14.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;2.00&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.3&lt;/td&gt;
      &lt;td&gt;17.80&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:33:38&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:43:41&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1.80&lt;/td&gt;
      &lt;td&gt;-73.963341&lt;/td&gt;
      &lt;td&gt;40.802788&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;N&lt;/td&gt;
      &lt;td&gt;-73.951820&lt;/td&gt;
      &lt;td&gt;40.824413&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;9.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.00&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.3&lt;/td&gt;
      &lt;td&gt;10.80&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:33:39&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:35:31&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.50&lt;/td&gt;
      &lt;td&gt;-74.009087&lt;/td&gt;
      &lt;td&gt;40.713818&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;N&lt;/td&gt;
      &lt;td&gt;-74.004326&lt;/td&gt;
      &lt;td&gt;40.719986&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;3.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.00&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.3&lt;/td&gt;
      &lt;td&gt;4.80&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:33:39&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:52:58&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;3.00&lt;/td&gt;
      &lt;td&gt;-73.971176&lt;/td&gt;
      &lt;td&gt;40.762428&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;N&lt;/td&gt;
      &lt;td&gt;-74.004181&lt;/td&gt;
      &lt;td&gt;40.742653&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;15.0&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.00&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.3&lt;/td&gt;
      &lt;td&gt;16.30&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 277)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="train-on-a-single-machine"&gt;
&lt;h1&gt;Train on a single machine&lt;/h1&gt;
&lt;p&gt;To start lets go through the standard Scikit Learn fit/predict/score cycle with
this small bit of data on a single machine.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.ensemble&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.cross_validation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="n"&gt;df_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;trip_distance&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;pickup_longitude&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;pickup_latitude&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="s1"&gt;&amp;#39;dropoff_longitude&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;dropoff_latitude&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;payment_type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="s1"&gt;&amp;#39;fare_amount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;mta_tax&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;tip_amount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;tolls_amount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;est&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_train&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;df_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passenger_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This builds a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomForestClassifer&lt;/span&gt;&lt;/code&gt; with four decision trees and then trains
it against the numeric columns in the data, trying to predict the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;passenger_count&lt;/span&gt;&lt;/code&gt; column. It takes around 10 seconds to train on a single
core. We now see how well we do on the holdout testing data:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passenger_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;0.65808188654721012&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This 65% accuracy is actually pretty poor. About 70% of the rides in NYC have
a single passenger, so the model of “always guess one” would out-perform our
fancy random forest.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;accuracy_score&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passenger_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;               &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passenger_count&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="go"&gt;0.70669390028780987&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This is where my ignorance in machine learning really
kills us. There is likely a simple way to improve this. However, because I’m
more interested in showing how to build distributed computations with Dask than
in actually doing machine learning I’m going to go ahead with this naive
approach. Spoiler alert: we’re going to do a lot of computation and still not
beat the “always guess one” strategy.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 325)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="fit-across-the-cluster-with-executor-map"&gt;
&lt;h1&gt;Fit across the cluster with executor.map&lt;/h1&gt;
&lt;p&gt;First we build a function that does just what we did before, builds a random
forest and then trains it on a dataframe.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;est&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passenger_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;est&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Second we call this function on all of our training dataframes on the cluster
using the standard &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;e.map(function,&lt;/span&gt; &lt;span class="pre"&gt;sequence)&lt;/span&gt;&lt;/code&gt; function. This sends out many
small tasks for the cluster to run. We use all but the last dataframe for
training data and hold out the last dataframe for testing. There are more
principled ways to do this, but again we’re going to charge ahead here.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dfs&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dfs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;estimators&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This takes around two minutes to train on all of the 177 dataframes and now we
have 177 independent estimators, each capable of guessing how many passengers a
particular ride had. There is relatively little overhead in this computation.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 354)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="predict-on-testing-data"&gt;
&lt;h1&gt;Predict on testing data&lt;/h1&gt;
&lt;p&gt;Recall that we kept separate a future, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;test&lt;/span&gt;&lt;/code&gt;, that points to a Pandas dataframe on
the cluster that was not used to train any of our 177 estimators. We’re going
to replicate this dataframe across all workers on the cluster and then ask each
estimator to predict the number of passengers for each ride in this dataset.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replicate&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;est&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;estimators&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Here we used the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;executor.submit(function,&lt;/span&gt; &lt;span class="pre"&gt;*args,&lt;/span&gt; &lt;span class="pre"&gt;**kwrags)&lt;/span&gt;&lt;/code&gt; function in a
list comprehension to individually launch many tasks. The scheduler determines
when and where to run these tasks for optimal computation time and minimal data
transfer. As with all functions, this returns futures that we can use to
collect data if we want in the future.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Developers note: we explicitly replicate here in order to take advantage of
efficient tree-broadcasting algorithms. This is purely a performance
consideration, everything would have worked fine without this, but the explicit
broadcast turns a 30s communication+computation into a 2s
communication+computation.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 382)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="aggregate-predictions-by-majority-vote"&gt;
&lt;h1&gt;Aggregate predictions by majority vote&lt;/h1&gt;
&lt;p&gt;For each estimator we now have an independent prediction of the passenger
counts for all of the rides in our test data. In other words for each ride we
have 177 different opinions on how many passengers were in the cab. By
averaging these opinions together we hope to achieve a more accurate consensus
opinion.&lt;/p&gt;
&lt;p&gt;For example, consider the first four prediction arrays:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;a_few_predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  &lt;span class="c1"&gt;# remote futures -&amp;gt; local arrays&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;a_few_predictions&lt;/span&gt;
&lt;span class="go"&gt;[array([1, 2, 1, ..., 2, 2, 1]),&lt;/span&gt;
&lt;span class="go"&gt; array([1, 1, 1, ..., 1, 1, 1]),&lt;/span&gt;
&lt;span class="go"&gt; array([2, 1, 1, ..., 1, 1, 1]),&lt;/span&gt;
&lt;span class="go"&gt; array([1, 1, 1, ..., 1, 1, 1])]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;For the first ride/column we see that three of the four predictions are for a
single passenger while one prediction disagrees and is for two passengers. We
create a consensus opinion by taking the mode of the stacked arrays:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;scipy.stats&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;mymode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;arrays&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrays&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;mymode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a_few_predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And so when we average these four prediction arrays together we see that the
majority opinion of one passenger dominates for all of the six rides visible
here.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 421)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="tree-reduction"&gt;
&lt;h1&gt;Tree Reduction&lt;/h1&gt;
&lt;p&gt;We could call our &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mymode&lt;/span&gt;&lt;/code&gt; function on all of our predictions like this:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;mode_prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mymode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# this doesn&amp;#39;t scale well&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Unfortunately this would move all of our results to a single machine to compute
the mode there. This might swamp that single machine.&lt;/p&gt;
&lt;p&gt;Instead we batch our predictions into groups of size 10, average each group,
and then repeat the process with the smaller set of predictions until we have
only one left. This sort of multi-step reduction is called a tree reduction.
We can write it up with a couple nested loops and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;executor.submit&lt;/span&gt;&lt;/code&gt;. This is
only an approximation of the mode, but it’s a much more scalable computation.
This finishes in about 1.5 seconds.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;toolz&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;partition_all&lt;/span&gt;

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mymode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                   &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;partition_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 452)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="final-score"&gt;
&lt;h1&gt;Final Score&lt;/h1&gt;
&lt;p&gt;Finally, after completing all of our work on our cluster we can see how well
our distributed random forest algorithm does.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passenger_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;0.67061974451423045&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Still worse than the naive “always guess one” strategy. This just goes to show
that, no matter how sophisticated your Big Data solution is, there is no
substitute for common sense and a little bit of domain expertise.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 466)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-didn-t-work"&gt;
&lt;h1&gt;What didn’t work&lt;/h1&gt;
&lt;p&gt;As always I’ll have a section like this that honestly says what doesn’t work
well and what I would have done with more time.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Clearly this would have benefited from more machine learning knowledge.
What would have been a good approach for this problem?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;I’ve been thinking a bit about memory management of replicated data on the
cluster. In this exercise we specifically replicated out the test data.
Everything would have worked fine without this step but it would have been
much slower as every worker gathered data from the single worker that
originally had the test dataframe. Replicating data is great until you
start filling up distributed RAM. It will be interesting to think of
policies about when to start cleaning up redundant data and when to keep it
around.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Several people from both open source users and Continuum customers have
asked about a general Dask library for machine learning, something akin to
Spark’s MLlib. Ideally a future Dask.learn module would leverage
Scikit-Learn in the same way that Dask.dataframe leverages Pandas. It’s
not clear how to cleanly break up and parallelize Scikit-Learn algorithms.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 487)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;This blogpost gives a concrete example using basic task submission with
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;executor.map&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;executor.submit&lt;/span&gt;&lt;/code&gt; to build a non-trivial computation. This
approach is straightforward and not restrictive. Personally this interface
excites me more than collections like Dask.dataframe; there is a lot of freedom
in arbitrary task submission.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/20/dask-distributed-part-5.md&lt;/span&gt;, line 495)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="links"&gt;
&lt;h1&gt;Links&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://gist.github.com/mrocklin/9f5720d8658e5f2f66666815b1f03f00"&gt;Notebook&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=FkPlEqB8AnE&amp;amp;amp;list=PLRtz5iA93T4PQvWuoMnIyEIz1fXiJ5Pri&amp;amp;amp;index=11"&gt;Video&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://distributed.readthedocs.org/en/latest/"&gt;distributed&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/04/20/dask-distributed-part-5/"/>
    <summary>This work is supported by Continuum Analytics
and the XDATA Program
as part of the Blaze Project</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2016-04-20T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol/</id>
    <title>Fast Message Serialization</title>
    <updated>2016-04-14T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;
and the &lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA Program&lt;/a&gt;
as part of the &lt;a class="reference external" href="http://blaze.pydata.org"&gt;Blaze Project&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Very high performance isn’t about doing one thing well, it’s about doing
nothing poorly.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This week I optimized the inter-node communication protocol used by
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.distributed&lt;/span&gt;&lt;/code&gt;. It was a fun exercise in optimization that involved
several different and unexpected components. I separately had to deal with
Pickle, NumPy, Tornado, MsgPack, and compression libraries.&lt;/p&gt;
&lt;p&gt;This blogpost is not advertising any particular functionality, rather it’s a
story of the problems I ran into when designing and optimizing a protocol to
quickly send both very small and very large numeric data between machines on
the Python stack.&lt;/p&gt;
&lt;p&gt;We care very strongly about both the many small messages case (thousands of
100 byte messages per second) &lt;em&gt;and&lt;/em&gt; the very large messages case (100-1000 MB).
This spans an interesting range of performance space. We end up with a
protocol that costs around 5 microseconds in the small case and operates at
1-1.5 GB/s in the large case.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 31)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="identify-a-problem"&gt;

&lt;p&gt;This came about as I was preparing a demo using &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.array&lt;/span&gt;&lt;/code&gt; on a distributed
cluster for a Continuum webinar. I noticed that my computations were taking
much longer than expected. The
&lt;a class="reference external" href="http://distributed.readthedocs.org/en/latest/web.html"&gt;Web UI&lt;/a&gt; quickly pointed
me to the fact that my machines were spending 10-20 seconds moving 30 MB chunks
of numpy array data between them. This is very strange because I was on
100MB/s network, and so I expected these transfers to happen in more like 0.3s
than 15s.&lt;/p&gt;
&lt;p&gt;The Web UI made this glaringly apparent, so my first lesson was how valuable
visual profiling tools can be when they make performance issues glaringly
obvious. Thanks here goes to the Bokeh developers who helped the development
of the Dask real-time Web UI.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 47)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="problem-1-tornado-s-sentinels"&gt;
&lt;h1&gt;Problem 1: Tornado’s sentinels&lt;/h1&gt;
&lt;p&gt;Dask’s networking is built off of Tornado’s TCP IOStreams.&lt;/p&gt;
&lt;p&gt;There are two common ways to delineate messages on a socket, sentinel values
that signal the end of a message, and prefixing a length before every message.
Early on we tried both in Dask but found that prefixing a length before every
message was slow. It turns out that this was because TCP sockets try to batch
small messages to increase bandwidth. Turning this optimization off ended up
being an effective and easy solution, see the &lt;a class="reference external" href="http://www.unixguide.net/network/socketfaq/2.16.shtml"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;TCP_NODELAY&lt;/span&gt;&lt;/code&gt; parameter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;However, before we figured that out we used sentinels for a long time.
Unfortunately Tornado does not handle sentinels well for large messages. At
the receipt of every new message it reads through all buffered data to see if
it can find the sentinel. This makes lots and lots of copies and reads through
lots and lots of bytes. This isn’t a problem if your messages are a few
kilobytes, as is common in web development, but it’s terrible if your messages
are millions or billions of bytes long.&lt;/p&gt;
&lt;p&gt;Switching back to prefixing messages with lengths and turning off the no-delay
optimization moved our bandwidth up from 3MB/s to 20MB/s per node. Thanks goes
to Ben Darnell (main Tornado developer) for helping us to track this down.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 70)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="problem-2-memory-copies"&gt;
&lt;h1&gt;Problem 2: Memory Copies&lt;/h1&gt;
&lt;p&gt;A nice machine can copy memory at 5 GB/s. If your network is only 100 MB/s
then you can easily suffer several memory copies in your system without caring.
This leads to code that looks like the following:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;socket.send(header + payload)
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This code concatenates two bytestrings, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;header&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;payload&lt;/span&gt;&lt;/code&gt; before
sending the result down a socket. If we cared deeply about avoiding memory
copies then we might instead send these two separately:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;socket.send(header)
socket.send(payload)
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;But who cares, right? At 5 GB/s copying memory is cheap!&lt;/p&gt;
&lt;p&gt;Unfortunately this breaks down under either of the following conditions&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;You are sloppy enough to do this multiple times&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You find yourself on a machine with surprisingly low memory bandwidth,
like 10 times slower, as is the case on &lt;a class="reference external" href="http://stackoverflow.com/questions/36523142/why-is-copying-memory-on-ec2-machines-slow"&gt;some EC2 machines.&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Both of these were true for me but fortunately it’s usually straightforward to
reduce the number of copies down to a small number (we got down to three),
with moderate effort.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 97)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="problem-3-unwanted-compression"&gt;
&lt;h1&gt;Problem 3: Unwanted Compression&lt;/h1&gt;
&lt;p&gt;Dask compresses all large messages with LZ4 or Snappy if they’re available.
Unfortunately, if your data isn’t very compressible then this is mostly lost
time. Doubly unforutnate is that you also have to decompress the data on the
recipient side. Decompressing not-very-compressible data was surprisingly
slow.&lt;/p&gt;
&lt;p&gt;Now we compress with the following policy:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;If the message is less than 10kB, don’t bother&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pick out five 10kB samples of the data and compress those. If the result
isn’t well compressed then don’t bother compressing the full payload.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compress the full payload, if it doesn’t compress well then just send along
the original to spare the receiver’s side from compressing.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In this case we use cheap checks to guard against unwanted compression. We
also avoid any cost at all for small messages, which we care about deeply.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 116)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="problem-4-cloudpickle-is-not-as-fast-as-pickle"&gt;
&lt;h1&gt;Problem 4: Cloudpickle is not as fast as Pickle&lt;/h1&gt;
&lt;p&gt;This was surprising, because cloudpickle mostly defers to Pickle for the easy
stuff, like NumPy arrays.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;u1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pickle&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cloudpickle&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pickle&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;protocol&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mf"&gt;8.65&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;8.42&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;17.1&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;16.9&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="mi"&gt;10000161&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cloudpickle&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;protocol&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mf"&gt;20.6&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;24.5&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;45.1&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;44.4&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="mi"&gt;10000161&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;But it turns out that cloudpickle is using the Python implementation, while
pickle itself (or &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;cPickle&lt;/span&gt;&lt;/code&gt; in Python 2) is using the compiled C implemenation.
Fortunately this is easy to correct, and a quick typecheck on common large
dataformats in Python (NumPy and Pandas) gets us this speed boost.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 144)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="problem-5-pickle-is-still-slower-than-you-d-expect"&gt;
&lt;h1&gt;Problem 5: Pickle is still slower than you’d expect&lt;/h1&gt;
&lt;p&gt;Pickle runs at about half the speed of memcopy, which is what you’d expect from
a protocol that is mostly just “serialize the dtype, strides, then tack on the
data bytes”. There must be an extraneous memory copy in there.&lt;/p&gt;
&lt;p&gt;See &lt;a class="reference external" href="https://github.com/numpy/numpy/issues/7544"&gt;issue 7544&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 152)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="problem-6-msgpack-is-bad-at-large-bytestrings"&gt;
&lt;h1&gt;Problem 6: MsgPack is bad at large bytestrings&lt;/h1&gt;
&lt;p&gt;Dask serializes most messages with MsgPack, which is ordinarily very fast.
Unfortunately the MsgPack spec doesn’t support bytestrings greater than 4GB
(which do come up for us) and the Python implementations don’t pass through
large bytestrings very efficiently. So we had to handle large bytestrings
separately. Any message that contains bytestrings over 1MB in size will have
them stripped out and sent along in a separate frame. This both avoids the
MsgPack overhead and avoids a memory copy (we can send the bytes directly to
the socket).&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 163)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="problem-7-tornado-makes-a-copy"&gt;
&lt;h1&gt;Problem 7: Tornado makes a copy&lt;/h1&gt;
&lt;p&gt;Sockets on Windows don’t accept payloads greater than 128kB in size. As a
result Tornado chops up large messages into many small ones. On linux this
memory copy is extraneous. It can be removed with a bit of logic within
Tornado. I might do this in the moderate future.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 170)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="results"&gt;
&lt;h1&gt;Results&lt;/h1&gt;
&lt;p&gt;We serialize small messages in about 5 microseconds (thanks msgpack!) and move
large bytes around in the cost of three memory copies (about 1-1.5 GB/s) which
is generally faster than most networks in use.&lt;/p&gt;
&lt;p&gt;Here is a profile of sending and receiving a gigabyte-sized NumPy array of
random values through to the same process over localhost (500 MB/s on my
machine.)&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;         &lt;span class="mi"&gt;381360&lt;/span&gt; &lt;span class="n"&gt;function&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;381323&lt;/span&gt; &lt;span class="n"&gt;primitive&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="mf"&gt;1.451&lt;/span&gt; &lt;span class="n"&gt;seconds&lt;/span&gt;

   &lt;span class="n"&gt;Ordered&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;internal&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

   &lt;span class="n"&gt;ncalls&lt;/span&gt;  &lt;span class="n"&gt;tottime&lt;/span&gt;  &lt;span class="n"&gt;percall&lt;/span&gt;  &lt;span class="n"&gt;cumtime&lt;/span&gt;  &lt;span class="n"&gt;percall&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;lineno&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="mi"&gt;1&lt;/span&gt;    &lt;span class="mf"&gt;0.366&lt;/span&gt;    &lt;span class="mf"&gt;0.366&lt;/span&gt;    &lt;span class="mf"&gt;0.366&lt;/span&gt;    &lt;span class="mf"&gt;0.366&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;built&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="n"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="mi"&gt;8&lt;/span&gt;    &lt;span class="mf"&gt;0.289&lt;/span&gt;    &lt;span class="mf"&gt;0.036&lt;/span&gt;    &lt;span class="mf"&gt;0.291&lt;/span&gt;    &lt;span class="mf"&gt;0.036&lt;/span&gt; &lt;span class="n"&gt;iostream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;360&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="mi"&gt;15353&lt;/span&gt;    &lt;span class="mf"&gt;0.228&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.228&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;join&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;bytes&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;objects&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="mi"&gt;15355&lt;/span&gt;    &lt;span class="mf"&gt;0.166&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.166&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;recv&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;_socket.socket&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;objects&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="mi"&gt;15362&lt;/span&gt;    &lt;span class="mf"&gt;0.156&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.398&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="n"&gt;iostream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1510&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_merge_prefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="mi"&gt;7759&lt;/span&gt;    &lt;span class="mf"&gt;0.101&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.101&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;send&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;_socket.socket&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;objects&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;    &lt;span class="mf"&gt;0.026&lt;/span&gt;    &lt;span class="mf"&gt;0.002&lt;/span&gt;    &lt;span class="mf"&gt;0.686&lt;/span&gt;    &lt;span class="mf"&gt;0.049&lt;/span&gt; &lt;span class="n"&gt;gen&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;990&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="mi"&gt;15355&lt;/span&gt;    &lt;span class="mf"&gt;0.021&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.198&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="n"&gt;iostream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;721&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_read_to_buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="mi"&gt;8&lt;/span&gt;    &lt;span class="mf"&gt;0.018&lt;/span&gt;    &lt;span class="mf"&gt;0.002&lt;/span&gt;    &lt;span class="mf"&gt;0.203&lt;/span&gt;    &lt;span class="mf"&gt;0.025&lt;/span&gt; &lt;span class="n"&gt;iostream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;876&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_consume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="mi"&gt;91&lt;/span&gt;    &lt;span class="mf"&gt;0.017&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.335&lt;/span&gt;    &lt;span class="mf"&gt;0.004&lt;/span&gt; &lt;span class="n"&gt;iostream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;827&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_handle_write&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="mi"&gt;89&lt;/span&gt;    &lt;span class="mf"&gt;0.015&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.217&lt;/span&gt;    &lt;span class="mf"&gt;0.002&lt;/span&gt; &lt;span class="n"&gt;iostream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;585&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_read_to_buffer_loop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="mi"&gt;122567&lt;/span&gt;    &lt;span class="mf"&gt;0.009&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.009&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;built&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="mi"&gt;15355&lt;/span&gt;    &lt;span class="mf"&gt;0.008&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.173&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="n"&gt;iostream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1010&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read_from_fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="mi"&gt;38369&lt;/span&gt;    &lt;span class="mf"&gt;0.004&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.004&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;append&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;list&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;objects&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
     &lt;span class="mi"&gt;7759&lt;/span&gt;    &lt;span class="mf"&gt;0.004&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.104&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="n"&gt;iostream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1023&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;write_to_fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="mi"&gt;1&lt;/span&gt;    &lt;span class="mf"&gt;0.003&lt;/span&gt;    &lt;span class="mf"&gt;0.003&lt;/span&gt;    &lt;span class="mf"&gt;1.451&lt;/span&gt;    &lt;span class="mf"&gt;1.451&lt;/span&gt; &lt;span class="n"&gt;ioloop&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;746&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Dominant unwanted costs include the following:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;400ms: Pickling the NumPy array&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;400ms: Bytestring handling within Tornado&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After this we’re just bound by pushing bytes down a wire.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 211)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Writing fast code isn’t about writing any one thing particularly well, it’s
about mitigating everything that can get in your way. As you approch peak
performance, previously minor flaws suddenly become your dominant bottleneck.
Success here depends on frequent profiling and keeping your mind open to
unexpected and surprising costs.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/04/14/dask-distributed-optimizing-protocol.md&lt;/span&gt;, line 219)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="links"&gt;
&lt;h1&gt;Links&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://stackoverflow.com/questions/36523142/why-is-copying-memory-on-ec2-machines-slow"&gt;EC2 slow memory copy StackOverflow question.&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/tornadoweb/tornado/issues/1685"&gt;Tornado issue for sending large messages&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Nagle%27s_algorithm"&gt;Wikipedia page on Nagle’s algorithm for TCP protocol for small packets&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/numpy/numpy/issues/7544"&gt;NumPy issue for double memory copy&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/cloudpipe/cloudpickle/issues/59"&gt;Cloudpickle issue for memoryview support&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://distributed.readthedocs.org/en/latest/"&gt;dask.distributed&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol/"/>
    <summary>This work is supported by Continuum Analytics
and the XDATA Program
as part of the Blaze Project</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2016-04-14T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/02/26/dask-distributed-part-3/</id>
    <title>Distributed Dask Arrays</title>
    <updated>2016-02-26T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;
and the &lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA Program&lt;/a&gt;
as part of the &lt;a class="reference external" href="http://blaze.pydata.org"&gt;Blaze Project&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In this post we analyze weather data across a cluster using NumPy in
parallel with dask.array. We focus on the following:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;How to set up the distributed scheduler with a job scheduler like Sun
GridEngine.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How to load NetCDF data from a network file system (NFS) into distributed
RAM&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How to manipulate data with dask.arrays&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How to interact with distributed data using IPython widgets&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This blogpost has an accompanying
&lt;a class="reference external" href="https://www.youtube.com/watch?v=ZpMXEVp-iaY"&gt;screencast&lt;/a&gt; which might be a bit
more fun than this text version.&lt;/p&gt;
&lt;p&gt;This is the third in a sequence of blogposts about dask.distributed:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference internal" href="../../2016/02/17/dask-distributed-part1/"&gt;&lt;span class="doc std std-doc"&gt;Dask Bags on GitHub Data&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference internal" href="../../2016/02/22/dask-distributed-part-2/"&gt;&lt;span class="doc std std-doc"&gt;Dask DataFrames on HDFS&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dask Arrays on NetCDF data&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 32)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="setup"&gt;

&lt;p&gt;We wanted to emulate the typical academic cluster setup using a job scheduler
like SunGridEngine (similar to SLURM, Torque, PBS scripts and other
technologies), a shared network file system, and typical binary stored arrays
in NetCDF files (similar to HDF5).&lt;/p&gt;
&lt;p&gt;To this end we used &lt;a class="reference external" href="http://star.mit.edu/cluster/"&gt;Starcluster&lt;/a&gt;, a quick way to
set up such a cluster on EC2 with SGE and NFS, and we downloaded data from the
&lt;a class="reference external" href="http://www.ecmwf.int/en/research/climate-reanalysis/era-interim"&gt;European Centre for Meteorology and Weather
Forecasting&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;To deploy dask’s distributed scheduler with SGE we made a scheduler on the
master node:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;sgeadmin@master:~$ dscheduler
distributed.scheduler - INFO - Start Scheduler at:  172.31.7.88:8786
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And then used the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;qsub&lt;/span&gt;&lt;/code&gt; command to start four dask workers, pointing to the
scheduler address:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 1 (&amp;quot;dworker&amp;quot;) has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 2 (&amp;quot;dworker&amp;quot;) has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 3 (&amp;quot;dworker&amp;quot;) has been submitted
sgeadmin@master:~$ qsub -b y -V dworker 172.31.7.88:8786
Your job 4 (&amp;quot;dworker&amp;quot;) has been submitted
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;After a few seconds these workers start on various nodes in the cluster and
connect to the scheduler.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 65)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="load-sample-data-on-a-single-machine"&gt;
&lt;h1&gt;Load sample data on a single machine&lt;/h1&gt;
&lt;p&gt;On the shared NFS drive we’ve downloaded several NetCDF3 files, each holding
the global temperature every six hours for a single day:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;glob&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;glob&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;*.nc3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="go"&gt;[&amp;#39;2014-01-01.nc3&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;2014-01-02.nc3&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;2014-01-03.nc3&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;2014-01-04.nc3&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;2014-01-05.nc3&amp;#39;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We use conda to install the netCDF4 library and make a small function to
read the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;t2m&lt;/span&gt;&lt;/code&gt; variable for “temperature at two meters elevation” from a single
filename:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;conda install netcdf4
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;netCDF4&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;load_temperature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;netCDF4&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;t2m&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][:]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This converts a single file into a single numpy array in memory. We could call
this on an individual file locally as follows:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;load_temperature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="go"&gt;array([[[ 253.96238624,  253.96238624,  253.96238624, ...,  253.96238624,&lt;/span&gt;
&lt;span class="go"&gt;          253.96238624,  253.96238624],&lt;/span&gt;
&lt;span class="go"&gt;        [ 252.80590921,  252.81070124,  252.81389593, ...,  252.79792249,&lt;/span&gt;
&lt;span class="go"&gt;          252.80111718,  252.80271452],&lt;/span&gt;
&lt;span class="go"&gt;          ...&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;load_temperature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;span class="go"&gt;(4, 721, 1440)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Our dataset has dimensions of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;(time,&lt;/span&gt; &lt;span class="pre"&gt;latitude,&lt;/span&gt; &lt;span class="pre"&gt;longitude)&lt;/span&gt;&lt;/code&gt;. Note above that
each day has four time entries (measurements every six hours).&lt;/p&gt;
&lt;p&gt;The NFS set up by Starcluster is unfortunately quite small. We were only able
to fit around five months of data (136 days) in shared disk.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 114)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="load-data-across-cluster"&gt;
&lt;h1&gt;Load data across cluster&lt;/h1&gt;
&lt;p&gt;We want to call the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;load_temperature&lt;/span&gt;&lt;/code&gt; function on our list &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;filenames&lt;/span&gt;&lt;/code&gt; on each
of our four workers. We connect a dask Executor to our scheduler address and
then map our function on our filenames:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;progress&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;172.31.7.88:8786&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;Executor: scheduler=172.31.7.88:8786 workers=4 threads=32&amp;gt;&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load_temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;progress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img src="/images/load-netcdf.gif"&gt;
&lt;p&gt;After this completes we have several numpy arrays scattered about the memory of
each of our four workers.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 135)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="coordinate-with-dask-array"&gt;
&lt;h1&gt;Coordinate with dask.array&lt;/h1&gt;
&lt;p&gt;We coordinate these many numpy arrays into a single logical dask array as
follows:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed.collections&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;futures_to_dask_arrays&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;xs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;futures_to_dask_arrays&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# many small dask arrays&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# one large dask array, joined by time&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;
&lt;span class="go"&gt;dask.array&amp;lt;concate..., shape=(544, 721, 1440), dtype=float64, chunksize=(4, 721, 1440)&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This single logical dask array is comprised of 136 numpy arrays spread across
our cluster. Operations on the single dask array will trigger many operations
on each of our numpy arrays.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 154)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="interact-with-distributed-data"&gt;
&lt;h1&gt;Interact with Distributed Data&lt;/h1&gt;
&lt;p&gt;We can now interact with our dataset using standard NumPy syntax and other
PyData libraries. Below we pull out a single time slice and render it to the
screen with matplotlib.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;matplotlib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;viridis&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;colorbar&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img src="/images/temperature-viridis.png"&gt;
&lt;p&gt;In the &lt;a class="reference external" href="https://www.youtube.com/watch?v=ZpMXEVp-iaY"&gt;screencast version of this
post&lt;/a&gt; we hook this up to an
IPython slider widget and scroll around time, which is fun.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 172)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="speed"&gt;
&lt;h1&gt;Speed&lt;/h1&gt;
&lt;p&gt;We benchmark a few representative operations to look at the strengths and
weaknesses of the distributed system.&lt;/p&gt;
&lt;section id="single-element"&gt;
&lt;h2&gt;Single element&lt;/h2&gt;
&lt;p&gt;This single element computation accesses a single number from a single NumPy
array of our dataset. It is bound by a network roundtrip from client to
scheduler, to worker, and back.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 4 ms, sys: 0 ns, total: 4 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 9.72 ms&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="single-time-slice"&gt;
&lt;h2&gt;Single time slice&lt;/h2&gt;
&lt;p&gt;This time slice computation pulls around 8 MB from a single NumPy array on a
single worker. It is likely bound by network bandwidth.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 24 ms, sys: 24 ms, total: 48 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 274 ms&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="mean-computation"&gt;
&lt;h2&gt;Mean computation&lt;/h2&gt;
&lt;p&gt;This mean computation touches every number in every NumPy array across all of
our workers. Computing means is quite fast, so this is likely bound by
scheduler overhead.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 88 ms, sys: 0 ns, total: 88 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 422 ms&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 212)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="interactive-widgets"&gt;
&lt;h1&gt;Interactive Widgets&lt;/h1&gt;
&lt;p&gt;To make these times feel more visceral we hook up these computations to IPython
Widgets.&lt;/p&gt;
&lt;p&gt;This first example looks fairly fluid. This only touches a single worker and
returns a small result. It is cheap because it indexes in a way that is well
aligned with how our NumPy arrays are split up by time.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@interact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img src="/images/mean-time.gif"&gt;
&lt;p&gt;This second example is less fluid because we index across our NumPy chunks.
Each computation touches all of our data. It’s still not bad though and quite
acceptable by today’s standards of interactive distributed data
science.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@interact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lat&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lat&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="n"&gt;lat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img src="/images/mean-latitude.gif"&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 242)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="normalize-data"&gt;
&lt;h1&gt;Normalize Data&lt;/h1&gt;
&lt;p&gt;Until now we’ve only performed simple calculations on our data, usually grabbing
out means. The image of the temperature above looks unsurprising. The image
is dominated by the facts that land is warmer than oceans and that the equator
is warmer than the poles. No surprises there.&lt;/p&gt;
&lt;p&gt;To make things more interesting we subtract off the mean and divide by the
standard deviation over time. This will tell us how unexpectedly hot or cold a
particular point was, relative to all measurements of that point over time.
This gives us something like a geo-located Z-Score.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;progress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img src="/images/normalize.gif"&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;RdBu_r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;colorbar&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img src="/images/temperature-denormalized.png"&gt;
&lt;p&gt;We can now see much more fine structure of the currents of the day. In the
&lt;a class="reference external" href="https://www.youtube.com/watch?v=ZpMXEVp-iaY"&gt;screencast version&lt;/a&gt; we hook this
dataset up to a slider as well and inspect various times.&lt;/p&gt;
&lt;p&gt;I’ve avoided displaying GIFs of full images changing in this post to keep the
size down, however we can easily render a plot of average temperature by
latitude changing over time here:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="n"&gt;xrange&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

&lt;span class="nd"&gt;@interact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xrange&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Normalized Temperature&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Latitude (degrees)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img src="/images/latitude-plot.gif"&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 291)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;We showed how to use distributed dask.arrays on a typical academic cluster.
I’ve had several conversations with different groups about this topic; it seems
to be a common case. I hope that the instructions at the beginning of this
post prove to be helpful to others.&lt;/p&gt;
&lt;p&gt;It is really satisfying to me to couple interactive widgets with data on a
cluster in an intuitive way. This sort of fluid interaction on larger datasets
is a core problem in modern data science.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 302)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-didn-t-work"&gt;
&lt;h1&gt;What didn’t work&lt;/h1&gt;
&lt;p&gt;As always I’ll include a section like this on what didn’t work well or what I
would have done with more time:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;No high-level &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_netcdf&lt;/span&gt;&lt;/code&gt; function: We had to use the mid-level API of
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;executor.map&lt;/span&gt;&lt;/code&gt; to construct our dask array. This is a bit of a pain for
novice users. We should probably adapt existing high-level functions in
dask.array to robustly handle the distributed data case.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Need a larger problem: Our dataset could have fit into a Macbook Pro.
A larger dataset that could not have been efficiently investigated from a
single machine would have really cemented the need for this technology.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Easier deployment: The solution above with &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;qsub&lt;/span&gt;&lt;/code&gt; was straightforward but
not always accessible to novice users. Additionally while SGE is common
there are several other systems that are just as common. We need to think
of nice ways to automate this for the user.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;XArray integration: Many people use dask.array on single machines through
&lt;a class="reference external" href="http://xarray.pydata.org/en/stable/"&gt;XArray&lt;/a&gt;, an excellent library for the
analysis of labeled nd-arrays especially common in climate science. It
would be good to integrate this new distributed work into the XArray
project. I suspect that doing this mostly involves handling the data
ingest problem described above.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reduction speed: The computation of normalized temperature, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;z&lt;/span&gt;&lt;/code&gt;, took a
surprisingly long time. I’d like to look into what is holding up that
computation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/26/dask-distributed-part-3.md&lt;/span&gt;, line 328)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="links"&gt;
&lt;h1&gt;Links&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask.pydata.org/en/latest/array.html"&gt;dask.array&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://distributed.readthedocs.org/en/latest/"&gt;dask.distributed&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/02/26/dask-distributed-part-3/"/>
    <summary>This work is supported by Continuum Analytics
and the XDATA Program
as part of the Blaze Project</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2016-02-26T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/02/22/dask-distributed-part-2/</id>
    <title>Pandas on HDFS with Dask Dataframes</title>
    <updated>2016-02-22T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;
and the &lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA Program&lt;/a&gt;
as part of the &lt;a class="reference external" href="http://blaze.pydata.org"&gt;Blaze Project&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In this post we use Pandas in parallel across an HDFS cluster to read CSV data.
We coordinate these computations with dask.dataframe. A screencast version of
this blogpost is available &lt;a class="reference external" href="https://www.youtube.com/watch?v=LioaeHsZDBQ"&gt;here&lt;/a&gt;
and the previous post in this series is available
&lt;a class="reference internal" href="../../2016/02/17/dask-distributed-part1/"&gt;&lt;span class="doc std std-doc"&gt;here&lt;/span&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To start, we connect to our scheduler, import the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;hdfs&lt;/span&gt;&lt;/code&gt; module from the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;distributed&lt;/span&gt;&lt;/code&gt; library, and read our CSV data from HDFS.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hdfs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;progress&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;127.0.0.1:8786&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;Executor: scheduler=127.0.0.1:8786 workers=64 threads=64&amp;gt;&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;nyc2014&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hdfs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/nyctaxi/2014/*.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;              &lt;span class="n"&gt;parse_dates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;pickup_datetime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;dropoff_datetime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;              &lt;span class="n"&gt;skipinitialspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;nyc2015&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hdfs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/nyctaxi/2015/*.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;              &lt;span class="n"&gt;parse_dates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tpep_pickup_datetime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;tpep_dropoff_datetime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;nyc2014&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nyc2015&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;nyc2014&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;progress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nyc2014&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img src="/images/distributed-hdfs-read-csv.gif"&gt;
&lt;p&gt;Our data comes from the New York City Taxi and Limousine Commission which
publishes &lt;a class="reference external" href="http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml"&gt;all yellow cab taxi rides in
NYC&lt;/a&gt; for various
years. This is a nice model dataset for computational tabular data because
it’s large enough to be annoying while also deep enough to be broadly
appealing. Each year is about 25GB on disk and about 60GB in memory as a
Pandas DataFrame.&lt;/p&gt;
&lt;p&gt;HDFS breaks up our CSV files into 128MB chunks on various hard drives spread
throughout the cluster. The dask.distributed workers each read the chunks of
bytes local to them and call the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;pandas.read_csv&lt;/span&gt;&lt;/code&gt; function on these bytes,
producing 391 separate Pandas DataFrame objects spread throughout the memory of
our eight worker nodes. The returned objects, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;nyc2014&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;nyc2015&lt;/span&gt;&lt;/code&gt;, are
&lt;a class="reference external" href="http://dask.pydata.org/en/latest/dataframe.html"&gt;dask.dataframe&lt;/a&gt; objects which
present a subset of the Pandas API to the user, but farm out all of the work to
the many Pandas dataframes they control across the network.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/22/dask-distributed-part-2.md&lt;/span&gt;, line 57)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="play-with-distributed-data"&gt;

&lt;p&gt;If we wait for the data to load fully into memory then we can perform
pandas-style analysis at interactive speeds.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;VendorID&lt;/th&gt;
      &lt;th&gt;tpep_pickup_datetime&lt;/th&gt;
      &lt;th&gt;tpep_dropoff_datetime&lt;/th&gt;
      &lt;th&gt;passenger_count&lt;/th&gt;
      &lt;th&gt;trip_distance&lt;/th&gt;
      &lt;th&gt;pickup_longitude&lt;/th&gt;
      &lt;th&gt;pickup_latitude&lt;/th&gt;
      &lt;th&gt;RateCodeID&lt;/th&gt;
      &lt;th&gt;store_and_fwd_flag&lt;/th&gt;
      &lt;th&gt;dropoff_longitude&lt;/th&gt;
      &lt;th&gt;dropoff_latitude&lt;/th&gt;
      &lt;th&gt;payment_type&lt;/th&gt;
      &lt;th&gt;fare_amount&lt;/th&gt;
      &lt;th&gt;extra&lt;/th&gt;
      &lt;th&gt;mta_tax&lt;/th&gt;
      &lt;th&gt;tip_amount&lt;/th&gt;
      &lt;th&gt;tolls_amount&lt;/th&gt;
      &lt;th&gt;improvement_surcharge&lt;/th&gt;
      &lt;th&gt;total_amount&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;2015-01-15 19:05:39&lt;/td&gt;
      &lt;td&gt;2015-01-15 19:23:42&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1.59&lt;/td&gt;
      &lt;td&gt;-73.993896&lt;/td&gt;
      &lt;td&gt;40.750111&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;N&lt;/td&gt;
      &lt;td&gt;-73.974785&lt;/td&gt;
      &lt;td&gt;40.750618&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;12.0&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;3.25&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.3&lt;/td&gt;
      &lt;td&gt;17.05&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:33:38&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:53:28&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;3.30&lt;/td&gt;
      &lt;td&gt;-74.001648&lt;/td&gt;
      &lt;td&gt;40.724243&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;N&lt;/td&gt;
      &lt;td&gt;-73.994415&lt;/td&gt;
      &lt;td&gt;40.759109&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;14.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;2.00&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.3&lt;/td&gt;
      &lt;td&gt;17.80&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:33:38&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:43:41&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1.80&lt;/td&gt;
      &lt;td&gt;-73.963341&lt;/td&gt;
      &lt;td&gt;40.802788&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;N&lt;/td&gt;
      &lt;td&gt;-73.951820&lt;/td&gt;
      &lt;td&gt;40.824413&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;9.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.00&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.3&lt;/td&gt;
      &lt;td&gt;10.80&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:33:39&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:35:31&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.50&lt;/td&gt;
      &lt;td&gt;-74.009087&lt;/td&gt;
      &lt;td&gt;40.713818&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;N&lt;/td&gt;
      &lt;td&gt;-74.004326&lt;/td&gt;
      &lt;td&gt;40.719986&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;3.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.00&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.3&lt;/td&gt;
      &lt;td&gt;4.80&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:33:39&lt;/td&gt;
      &lt;td&gt;2015-01-10 20:52:58&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;3.00&lt;/td&gt;
      &lt;td&gt;-73.971176&lt;/td&gt;
      &lt;td&gt;40.762428&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;N&lt;/td&gt;
      &lt;td&gt;-74.004181&lt;/td&gt;
      &lt;td&gt;40.742653&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;15.0&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.5&lt;/td&gt;
      &lt;td&gt;0.00&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.3&lt;/td&gt;
      &lt;td&gt;16.30&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nyc2014&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;165114373&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;146112989&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Interestingly it appears that the NYC cab industry has contracted a bit in the
last year. There are &lt;em&gt;fewer&lt;/em&gt; cab rides in 2015 than in 2014.&lt;/p&gt;
&lt;p&gt;When we ask for something like the length of the full dask.dataframe we
actually ask for the length of all of the hundreds of Pandas dataframes and
then sum them up. This process of reaching out to all of the workers completes
in around 200-300 ms, which is generally fast enough to feel snappy in an
interactive session.&lt;/p&gt;
&lt;p&gt;The dask.dataframe API looks just like the Pandas API, except that we call
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.compute()&lt;/span&gt;&lt;/code&gt; when we want an actual result.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;nyc2014&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passenger_count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;279997507.0&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passenger_count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;245566747&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Dask.dataframes build a plan to get your result and the distributed scheduler
coordinates that plan on all of the little Pandas dataframes on the workers
that make up our dataset.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/22/dask-distributed-part-2.md&lt;/span&gt;, line 237)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="pandas-for-metadata"&gt;
&lt;h1&gt;Pandas for Metadata&lt;/h1&gt;
&lt;p&gt;Let’s appreciate for a moment all the work we didn’t have to do around CSV
handling because Pandas magically handled it for us.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtypes&lt;/span&gt;
&lt;span class="go"&gt;VendorID                          int64&lt;/span&gt;
&lt;span class="go"&gt;tpep_pickup_datetime     datetime64[ns]&lt;/span&gt;
&lt;span class="go"&gt;tpep_dropoff_datetime    datetime64[ns]&lt;/span&gt;
&lt;span class="go"&gt;passenger_count                   int64&lt;/span&gt;
&lt;span class="go"&gt;trip_distance                   float64&lt;/span&gt;
&lt;span class="go"&gt;pickup_longitude                float64&lt;/span&gt;
&lt;span class="go"&gt;pickup_latitude                 float64&lt;/span&gt;
&lt;span class="go"&gt;RateCodeID                        int64&lt;/span&gt;
&lt;span class="go"&gt;store_and_fwd_flag               object&lt;/span&gt;
&lt;span class="go"&gt;dropoff_longitude               float64&lt;/span&gt;
&lt;span class="go"&gt;dropoff_latitude                float64&lt;/span&gt;
&lt;span class="go"&gt;payment_type                      int64&lt;/span&gt;
&lt;span class="go"&gt;fare_amount                     float64&lt;/span&gt;
&lt;span class="go"&gt;extra                           float64&lt;/span&gt;
&lt;span class="go"&gt;mta_tax                         float64&lt;/span&gt;
&lt;span class="go"&gt;tip_amount                      float64&lt;/span&gt;
&lt;span class="go"&gt;tolls_amount                    float64&lt;/span&gt;
&lt;span class="go"&gt;improvement_surcharge           float64&lt;/span&gt;
&lt;span class="go"&gt;total_amount\r                  float64&lt;/span&gt;
&lt;span class="go"&gt;dtype: object&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We didn’t have to find columns or specify data-types. We didn’t have to parse
each value with an &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;int&lt;/span&gt;&lt;/code&gt; or &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;float&lt;/span&gt;&lt;/code&gt; function as appropriate. We didn’t have to
parse the datetimes, but instead just specified a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;parse_datetimes=&lt;/span&gt;&lt;/code&gt; keyword.
The CSV parsing happened about as quickly as can be expected for this format,
clocking in at a network total of a bit under 1 GB/s.&lt;/p&gt;
&lt;p&gt;Pandas is well loved because it removes all of these little hurdles from the
life of the analyst. If we tried to reinvent a new
“Big-Data-Frame” we would have to reimplement all of the work already well done
inside of Pandas. Instead, dask.dataframe just coordinates and reuses the code
within the Pandas library. It is successful largely due to work from core
Pandas developers, notably Masaaki Horikoshi
(&lt;a class="reference external" href="https://github.com/sinhrks/"&gt;&amp;#64;sinhrks&lt;/a&gt;), who have done tremendous work to
align the API precisely with the Pandas core library.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/22/dask-distributed-part-2.md&lt;/span&gt;, line 281)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="analyze-tips-and-payment-types"&gt;
&lt;h1&gt;Analyze Tips and Payment Types&lt;/h1&gt;
&lt;p&gt;In an effort to demonstrate the abilities of dask.dataframe we ask a simple
question of our data, &lt;em&gt;“how do New Yorkers tip?”&lt;/em&gt;. The 2015 NYCTaxi data is
quite good about breaking down the total cost of each ride into the fare
amount, tip amount, and various taxes and fees. In particular this lets us
measure the percentage that each rider decided to pay in tip.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;fare_amount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;tip_amount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;payment_type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;fare_amount&lt;/th&gt;
      &lt;th&gt;tip_amount&lt;/th&gt;
      &lt;th&gt;payment_type&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;12.0&lt;/td&gt;
      &lt;td&gt;3.25&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;14.5&lt;/td&gt;
      &lt;td&gt;2.00&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;9.5&lt;/td&gt;
      &lt;td&gt;0.00&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;3.5&lt;/td&gt;
      &lt;td&gt;0.00&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;15.0&lt;/td&gt;
      &lt;td&gt;0.00&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In the first two lines we see evidence supporting the 15-20% tip standard
common in the US. The following three lines interestingly show zero tip.
Judging only by these first five lines (a very small sample) we see a strong
correlation here with the payment type. We analyze this a bit more by counting
occurrences in the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;payment_type&lt;/span&gt;&lt;/code&gt; column both for the full dataset, and
filtered by zero tip:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payment_type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 132 ms, sys: 0 ns, total: 132 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 558 ms&lt;/span&gt;

&lt;span class="go"&gt;1    91574644&lt;/span&gt;
&lt;span class="go"&gt;2    53864648&lt;/span&gt;
&lt;span class="go"&gt;3      503070&lt;/span&gt;
&lt;span class="go"&gt;4      170599&lt;/span&gt;
&lt;span class="go"&gt;5          28&lt;/span&gt;
&lt;span class="go"&gt;Name: payment_type, dtype: int64&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tip_amount&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payment_type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 212 ms, sys: 4 ms, total: 216 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 1.69 s&lt;/span&gt;

&lt;span class="go"&gt;2    53862557&lt;/span&gt;
&lt;span class="go"&gt;1     3365668&lt;/span&gt;
&lt;span class="go"&gt;3      502025&lt;/span&gt;
&lt;span class="go"&gt;4      170234&lt;/span&gt;
&lt;span class="go"&gt;5          26&lt;/span&gt;
&lt;span class="go"&gt;Name: payment_type, dtype: int64&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We find that almost all zero-tip rides correspond to payment type 2, and that
almost all payment type 2 rides don’t tip. My un-scientific hypothesis here is
payment type 2 corresponds to cash fares and that we’re observing a tendancy of
drivers not to record cash tips. However we would need more domain knowledge
about our data to actually make this claim with any degree of authority.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/22/dask-distributed-part-2.md&lt;/span&gt;, line 373)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="analyze-tips-fractions"&gt;
&lt;h1&gt;Analyze Tips Fractions&lt;/h1&gt;
&lt;p&gt;Lets make a new column, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;tip_fraction&lt;/span&gt;&lt;/code&gt;, and then look at the average of this
column grouped by day of week and grouped by hour of day.&lt;/p&gt;
&lt;p&gt;First, we need to filter out bad rows, both rows with this odd payment type,
and rows with zero fare (there are a surprising number of free cab rides in
NYC.) Second we create a new column equal to the ratio of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;tip_amount&lt;/span&gt; &lt;span class="pre"&gt;/&lt;/span&gt; &lt;span class="pre"&gt;fare_amount&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fare_amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payment_type&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tip_fraction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tip_amount&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fare_amount&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Next we choose to groupby the pickup datetime column in order to see how the
average tip fraction changes by day of week and by hour. The groupby and
datetime handling of Pandas makes these operations trivial.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;dayofweek&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dayofweek&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tip_fraction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tip_fraction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;dayofweek&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;dayofweek&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;progress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dayofweek&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img src="/images/distributed-hdfs-groupby-tip-fraction.gif"&gt;
&lt;p&gt;Grouping by day-of-week doesn’t show anything too striking to my eye. However
I would like to note at how generous NYC cab riders seem to be. A 23-25% tip
can be quite nice:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;dayofweek&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;tpep_pickup_datetime&lt;/span&gt;
&lt;span class="go"&gt;0    0.237510&lt;/span&gt;
&lt;span class="go"&gt;1    0.236494&lt;/span&gt;
&lt;span class="go"&gt;2    0.236073&lt;/span&gt;
&lt;span class="go"&gt;3    0.246007&lt;/span&gt;
&lt;span class="go"&gt;4    0.242081&lt;/span&gt;
&lt;span class="go"&gt;5    0.232415&lt;/span&gt;
&lt;span class="go"&gt;6    0.259974&lt;/span&gt;
&lt;span class="go"&gt;Name: tip_fraction, dtype: float64&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;But grouping by hour shows that late night and early morning riders are more
likely to tip extravagantly:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;tpep_pickup_datetime&lt;/span&gt;
&lt;span class="go"&gt;0     0.263602&lt;/span&gt;
&lt;span class="go"&gt;1     0.278828&lt;/span&gt;
&lt;span class="go"&gt;2     0.293536&lt;/span&gt;
&lt;span class="go"&gt;3     0.276784&lt;/span&gt;
&lt;span class="go"&gt;4     0.348649&lt;/span&gt;
&lt;span class="go"&gt;5     0.248618&lt;/span&gt;
&lt;span class="go"&gt;6     0.233257&lt;/span&gt;
&lt;span class="go"&gt;7     0.216003&lt;/span&gt;
&lt;span class="go"&gt;8     0.221508&lt;/span&gt;
&lt;span class="go"&gt;9     0.217018&lt;/span&gt;
&lt;span class="go"&gt;10    0.225618&lt;/span&gt;
&lt;span class="go"&gt;11    0.231396&lt;/span&gt;
&lt;span class="go"&gt;12    0.225186&lt;/span&gt;
&lt;span class="go"&gt;13    0.235662&lt;/span&gt;
&lt;span class="go"&gt;14    0.237636&lt;/span&gt;
&lt;span class="go"&gt;15    0.228832&lt;/span&gt;
&lt;span class="go"&gt;16    0.234086&lt;/span&gt;
&lt;span class="go"&gt;17    0.240635&lt;/span&gt;
&lt;span class="go"&gt;18    0.237488&lt;/span&gt;
&lt;span class="go"&gt;19    0.272792&lt;/span&gt;
&lt;span class="go"&gt;20    0.235866&lt;/span&gt;
&lt;span class="go"&gt;21    0.242157&lt;/span&gt;
&lt;span class="go"&gt;22    0.243244&lt;/span&gt;
&lt;span class="go"&gt;23    0.244586&lt;/span&gt;
&lt;span class="go"&gt;Name: tip_fraction, dtype: float64&lt;/span&gt;
&lt;span class="go"&gt;In [24]:&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We plot this with matplotlib and see a nice trough during business hours with a
surge in the early morning with an astonishing peak of 34% at 4am:&lt;/p&gt;
&lt;img src="/images/nyctaxi-2015-hourly-tips.png"&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/22/dask-distributed-part-2.md&lt;/span&gt;, line 457)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="performance"&gt;
&lt;h1&gt;Performance&lt;/h1&gt;
&lt;p&gt;Lets dive into a few operations that run at different time scales. This gives
a good understanding of the strengths and limits of the scheduler.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 4 ms, sys: 0 ns, total: 4 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 20.9 ms&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This head computation is about as fast as a film projector. You could perform
this roundtrip computation between every consecutive frame of a movie; to a
human eye this appears fluid. In the &lt;a class="reference internal" href="../../2016/02/17/dask-distributed-part1/"&gt;&lt;span class="doc std std-doc"&gt;last post&lt;/span&gt;&lt;/a&gt;
we asked about how low we could bring latency. In that post we were running
computations from my laptop in California and so were bound by transcontinental
latencies of 200ms. This time, because we’re operating from the cluster, we
can get down to 20ms. We’re only able to be this fast because we touch only a
single data element, the first partition. Things change when we need to touch
the entire dataset.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nyc2015&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 48 ms, sys: 0 ns, total: 48 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 271 ms&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The length computation takes 200-300 ms. This computation takes longer because we
touch every individual partition of the data, of which there are 178. The
scheduler incurs about 1ms of overhead per task, add a bit of latency
and you get the ~200ms total. This means that the scheduler will likely be the
bottleneck whenever computations are very fast, such as is the case for
computing &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;len&lt;/span&gt;&lt;/code&gt;. Really, this is good news; it means that by improving the
scheduler we can reduce these durations even further.&lt;/p&gt;
&lt;p&gt;If you look at the groupby computations above you can add the numbers in the
progress bars to show that we computed around 3000 tasks in around 7s. It
looks like this computation is about half scheduler overhead and about half
bound by actual computation.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/22/dask-distributed-part-2.md&lt;/span&gt;, line 497)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;We used dask+distributed on a cluster to read CSV data from HDFS
into a dask dataframe. We then used dask.dataframe, which looks identical to
the Pandas dataframe, to manipulate our distributed dataset intuitively and
efficiently.&lt;/p&gt;
&lt;p&gt;We looked a bit at the performance characteristics of simple computations.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/22/dask-distributed-part-2.md&lt;/span&gt;, line 506)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-doesn-t-work"&gt;
&lt;h1&gt;What doesn’t work&lt;/h1&gt;
&lt;p&gt;As always I’ll have a section like this that honestly says what doesn’t work
well and what I would have done with more time.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Dask dataframe implements a commonly used &lt;em&gt;subset&lt;/em&gt; of Pandas functionality,
not all of it. It’s surprisingly hard to communicate the exact bounds of
this subset to users. Notably, in the distributed setting we don’t have a
shuffle algorithm, so &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;groupby(...).apply(...)&lt;/span&gt;&lt;/code&gt; and some joins are not
yet possible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If you want to use threads, you’ll need Pandas 0.18.0 which, at the time of
this writing, was still in release candidate stage. This Pandas release
fixes some important GIL related issues.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The 1ms overhead per task limit is significant. While we can still scale
out to clusters far larger than what we have here, we probably won’t be
able to strongly accelerate very quick operations until we reduce this
number.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We use the &lt;a class="reference external" href="http://hdfs3.readthedocs.org/en/latest/"&gt;hdfs3 library&lt;/a&gt; to read
data from HDFS. This library seems to work great but is new and could use
more active users to flush out bug reports.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/22/dask-distributed-part-2.md&lt;/span&gt;, line 530)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="links"&gt;
&lt;h1&gt;Links&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask.pydata.org/en/latest/"&gt;dask&lt;/a&gt;, the original project&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://distributed.readthedocs.org/en/latest/"&gt;dask.distributed&lt;/a&gt;, the
distributed memory scheduler powering the cluster computing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://dask.pydata.org/en/latest/dataframe.html"&gt;dask.dataframe&lt;/a&gt;, the user
API we’ve used in this post.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml"&gt;NYC Taxi Data Downloads&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://hdfs3.readthedocs.org/en/latest"&gt;hdfs3&lt;/a&gt;: Python library we use for
HDFS interations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference internal" href="../../2016/02/17/dask-distributed-part1/"&gt;&lt;span class="doc std std-doc"&gt;previous post&lt;/span&gt;&lt;/a&gt; in this blog series.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/22/dask-distributed-part-2.md&lt;/span&gt;, line 542)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="setup-and-data"&gt;
&lt;h1&gt;Setup and Data&lt;/h1&gt;
&lt;p&gt;You can obtain public data from the New York City Taxi and Limousine Commission
&lt;a class="reference external" href="http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml"&gt;here&lt;/a&gt;. I
downloaded this onto the head node and dumped it into HDFS with commands like
the following:&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;wget&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;googleapis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;tlc&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;trip&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2015&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;yellow_tripdata_2015&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mf"&gt;01..12&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;csv&lt;/span&gt;
&lt;span class="n"&gt;hdfs&lt;/span&gt; &lt;span class="n"&gt;dfs&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;nyctaxi&lt;/span&gt;
&lt;span class="n"&gt;hdfs&lt;/span&gt; &lt;span class="n"&gt;dfs&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;nyctaxi&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2015&lt;/span&gt;
&lt;span class="n"&gt;hdfs&lt;/span&gt; &lt;span class="n"&gt;dfs&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt; &lt;span class="n"&gt;yellow&lt;/span&gt;&lt;span class="o"&gt;*.&lt;/span&gt;&lt;span class="n"&gt;csv&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;nyctaxi&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2015&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The cluster was hosted on EC2 and was comprised of nine &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;m3.2xlarges&lt;/span&gt;&lt;/code&gt; with 8
cores and 30GB of RAM each. Eight of these nodes were used as workers; they
used processes for parallelism, not threads.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/02/22/dask-distributed-part-2/"/>
    <summary>This work is supported by Continuum Analytics
and the XDATA Program
as part of the Blaze Project</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2016-02-22T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2016/02/17/dask-distributed-part1/</id>
    <title>Introducing Dask distributed</title>
    <updated>2016-02-17T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://continuum.io"&gt;Continuum Analytics&lt;/a&gt;
and the &lt;a class="reference external" href="http://www.darpa.mil/program/XDATA"&gt;XDATA Program&lt;/a&gt;
as part of the &lt;a class="reference external" href="http://blaze.pydata.org"&gt;Blaze Project&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt;: We analyze JSON data on a cluster using pure Python projects.&lt;/p&gt;
&lt;p&gt;Dask, a Python library for parallel computing, now works on clusters. During
the past few months I and others have extended dask with a new distributed
memory scheduler. This enables dask’s existing parallel algorithms to scale
across 10s to 100s of nodes, and extends a subset of PyData to distributed
computing. Over the next few weeks I and others will write about this system.
Please note that dask+distributed is developing quickly and so the API is
likely to shift around a bit.&lt;/p&gt;
&lt;p&gt;Today we start simple with the typical cluster computing problem, parsing JSON
records, filtering, and counting events using dask.bag and the new distributed
scheduler. We’ll dive into more advanced problems in future posts.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;A video version of this blogpost is available
&lt;a class="reference external" href="https://www.youtube.com/watch?v=W0Q0uwmYD6o"&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/17/dask-distributed-part1.md&lt;/span&gt;, line 29)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="github-archive-data-on-s3"&gt;

&lt;p&gt;GitHub releases data dumps of their public event stream as gzipped compressed,
line-delimited, JSON. This data is too large to fit comfortably into memory,
even on a sizable workstation. We could stream it from disk but, due to the
compression and JSON encoding this takes a while and so slogs down interactive
use. For an interactive experience with data like this we need a distributed
cluster.&lt;/p&gt;
&lt;section id="setup-and-data"&gt;
&lt;h2&gt;Setup and Data&lt;/h2&gt;
&lt;p&gt;We provision nine &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;m3.2xlarge&lt;/span&gt;&lt;/code&gt; nodes on EC2. These have eight cores and 30GB
of RAM each. On this cluster we provision one scheduler and nine workers (see
&lt;a class="reference external" href="http://distributed.readthedocs.org/en/latest/setup.html"&gt;setup docs&lt;/a&gt;). (More
on launching in later posts.) We have five months of data, from 2015-01-01 to
2015-05-31 on the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;githubarchive-data&lt;/span&gt;&lt;/code&gt; bucket in S3. This data is publicly
avaialble if you want to play with it on EC2. You can download the full
dataset at https://www.githubarchive.org/ .&lt;/p&gt;
&lt;p&gt;The first record looks like the following:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;actor&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;avatar_url&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;https://avatars.githubusercontent.com/u/9152315?&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;gravatar_id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;9152315&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;login&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;davidjhulse&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;url&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;https://api.github.com/users/davidjhulse&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;created_at&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2015-01-01T00:00:00Z&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2489368070&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;payload&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;before&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;86ffa724b4d70fce46e760f8cc080f5ec3d7d85f&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;commits&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;author&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;email&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;david.hulse@live.com&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s1"&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;davidjhulse&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
     &lt;span class="s1"&gt;&amp;#39;distinct&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="s1"&gt;&amp;#39;message&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Altered BingBot.jar&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;Fixed issue with multiple account support&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="s1"&gt;&amp;#39;sha&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="s1"&gt;&amp;#39;url&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;distinct_size&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;head&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;push_id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;536740396&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;ref&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;refs/heads/master&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;size&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;public&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;repo&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;28635890&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;davidjhulse/davesbingrewardsbot&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s1"&gt;&amp;#39;url&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;https://api.github.com/repos/davidjhulse/davesbingrewardsbot&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;PushEvent&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So we have a large dataset on S3 and a moderate sized play cluster on EC2,
which has access to S3 data at about 100MB/s per node. We’re ready to play.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/17/dask-distributed-part1.md&lt;/span&gt;, line 80)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="play"&gt;
&lt;h1&gt;Play&lt;/h1&gt;
&lt;p&gt;We start an &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;ipython&lt;/span&gt;&lt;/code&gt; interpreter on our local laptop and connect to the
dask scheduler running on the cluster. For the purposes of timing, the cluster
is on the East Coast while the local machine is in California on commercial
broadband internet.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;54.173.84.107:8786&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;Executor: scheduler=54.173.84.107:8786 workers=72 threads=72&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Our seventy-two worker processes come from nine workers with eight processes
each. We chose processes rather than threads for this task because
computations will be bound by the GIL. We will change this to threads in later
examples.&lt;/p&gt;
&lt;p&gt;We start by loading a single month of data into distributed memory.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;json&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;githubarchive-data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2015-01&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The data lives in S3 in hourly files as gzipped encoded, line delimited JSON.
The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;s3.read_text&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;text.map&lt;/span&gt;&lt;/code&gt; functions produce
&lt;a class="reference external" href="http://dask.pydata.org/en/latest/bag.html"&gt;dask.bag&lt;/a&gt; objects which track our
operations in a lazily built task graph. When we ask the executor to &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;persist&lt;/span&gt;&lt;/code&gt;
this collection we ship those tasks off to the scheduler to run on all of the
workers in parallel. The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;persist&lt;/span&gt;&lt;/code&gt; function gives us back another &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.bag&lt;/span&gt;&lt;/code&gt;
pointing to these remotely running results. This persist function returns
immediately, and the computation happens on the cluster in the background
asynchronously. We gain control of our interpreter immediately while the
cluster hums along.&lt;/p&gt;
&lt;p&gt;The cluster takes around 40 seconds to download, decompress, and parse this
data. If you watch the video embedded above you’ll see fancy progress-bars.&lt;/p&gt;
&lt;p&gt;We ask for a single record. This returns in around 200ms, which is fast enough
that it feels instantaneous to a human.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;({&amp;#39;actor&amp;#39;: {&amp;#39;avatar_url&amp;#39;: &amp;#39;https://avatars.githubusercontent.com/u/9152315?&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;gravatar_id&amp;#39;: &amp;#39;&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;id&amp;#39;: 9152315,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;login&amp;#39;: &amp;#39;davidjhulse&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/users/davidjhulse&amp;#39;},&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;created_at&amp;#39;: &amp;#39;2015-01-01T00:00:00Z&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;id&amp;#39;: &amp;#39;2489368070&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;payload&amp;#39;: {&amp;#39;before&amp;#39;: &amp;#39;86ffa724b4d70fce46e760f8cc080f5ec3d7d85f&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;commits&amp;#39;: [{&amp;#39;author&amp;#39;: {&amp;#39;email&amp;#39;: &amp;#39;david.hulse@live.com&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;      &amp;#39;name&amp;#39;: &amp;#39;davidjhulse&amp;#39;},&lt;/span&gt;
&lt;span class="go"&gt;     &amp;#39;distinct&amp;#39;: True,&lt;/span&gt;
&lt;span class="go"&gt;     &amp;#39;message&amp;#39;: &amp;#39;Altered BingBot.jar\n\nFixed issue with multiple account support&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;     &amp;#39;sha&amp;#39;: &amp;#39;a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;     &amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81&amp;#39;}],&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;distinct_size&amp;#39;: 1,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;head&amp;#39;: &amp;#39;a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;push_id&amp;#39;: 536740396,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;ref&amp;#39;: &amp;#39;refs/heads/master&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;size&amp;#39;: 1},&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;public&amp;#39;: True,&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;repo&amp;#39;: {&amp;#39;id&amp;#39;: 28635890,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;name&amp;#39;: &amp;#39;davidjhulse/davesbingrewardsbot&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/repos/davidjhulse/davesbingrewardsbot&amp;#39;},&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;type&amp;#39;: &amp;#39;PushEvent&amp;#39;},)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This particular event is a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;'PushEvent'&lt;/span&gt;&lt;/code&gt;. Let’s quickly see all the kinds of
events. For fun, we’ll also time the interaction:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pluck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;frequencies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 112 ms, sys: 0 ns, total: 112 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 2.41 s&lt;/span&gt;

&lt;span class="go"&gt;[(&amp;#39;ReleaseEvent&amp;#39;, 44312),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;MemberEvent&amp;#39;, 69757),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;IssuesEvent&amp;#39;, 693363),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;PublicEvent&amp;#39;, 14614),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;CreateEvent&amp;#39;, 1651300),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;PullRequestReviewCommentEvent&amp;#39;, 214288),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;PullRequestEvent&amp;#39;, 680879),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;ForkEvent&amp;#39;, 491256),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;DeleteEvent&amp;#39;, 256987),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;PushEvent&amp;#39;, 7028566),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;IssueCommentEvent&amp;#39;, 1322509),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;GollumEvent&amp;#39;, 150861),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;CommitCommentEvent&amp;#39;, 96468),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;WatchEvent&amp;#39;, 1321546)]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And we compute the total count of all commits for this month.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 134 ms, sys: 133 µs, total: 134 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 1.49 s&lt;/span&gt;

&lt;span class="go"&gt;14036706&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We see that it takes a few seconds to walk through the data (and perform all
scheduling overhead.) The scheduler adds about a millisecond overhead per
task, and there are about 1000 partitions/files here (the GitHub data is split
by hour and there are 730 hours in a month) so most of the cost here is
overhead.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/17/dask-distributed-part1.md&lt;/span&gt;, line 193)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="investigate-jupyter"&gt;
&lt;h1&gt;Investigate Jupyter&lt;/h1&gt;
&lt;p&gt;We investigate the activities of &lt;a class="reference external" href="http://jupyter.org/"&gt;Project Jupyter&lt;/a&gt;. We
chose this project because it’s sizable and because we understand the players
involved and so can check our accuracy. This will require us to filter our
data to a much smaller subset, then find popular repositories and members.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;repo&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;jupyter/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="go"&gt;                      .repartition(10))&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;All records, regardless of event type, have a repository which has a name like
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;'organization/repository'&lt;/span&gt;&lt;/code&gt; in typical GitHub fashion. We filter all records
that start with &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;'jupyter/'&lt;/span&gt;&lt;/code&gt;. Additionally, because this dataset is likely
much smaller, we push all of these records into just ten partitions. This
dramatically reduces scheduling overhead. The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;persist&lt;/span&gt;&lt;/code&gt; call hands this
computation off to the scheduler and then gives us back our collection that
points to that computing result. Filtering this month for Jupyter events takes
about 7.5 seconds. Afterwards computations on this subset feel snappy.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 5.19 ms, sys: 97 µs, total: 5.28 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 199 ms&lt;/span&gt;

&lt;span class="go"&gt;747&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 7.01 ms, sys: 259 µs, total: 7.27 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 182 ms&lt;/span&gt;

&lt;span class="go"&gt;({&amp;#39;actor&amp;#39;: {&amp;#39;avatar_url&amp;#39;: &amp;#39;https://avatars.githubusercontent.com/u/26679?&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;gravatar_id&amp;#39;: &amp;#39;&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;id&amp;#39;: 26679,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;login&amp;#39;: &amp;#39;marksteve&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/users/marksteve&amp;#39;},&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;created_at&amp;#39;: &amp;#39;2015-01-01T13:25:44Z&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;id&amp;#39;: &amp;#39;2489612400&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;org&amp;#39;: {&amp;#39;avatar_url&amp;#39;: &amp;#39;https://avatars.githubusercontent.com/u/7388996?&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;gravatar_id&amp;#39;: &amp;#39;&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;id&amp;#39;: 7388996,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;login&amp;#39;: &amp;#39;jupyter&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/orgs/jupyter&amp;#39;},&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;payload&amp;#39;: {&amp;#39;action&amp;#39;: &amp;#39;started&amp;#39;},&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;public&amp;#39;: True,&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;repo&amp;#39;: {&amp;#39;id&amp;#39;: 5303123,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;name&amp;#39;: &amp;#39;jupyter/nbviewer&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;   &amp;#39;url&amp;#39;: &amp;#39;https://api.github.com/repos/jupyter/nbviewer&amp;#39;},&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;type&amp;#39;: &amp;#39;WatchEvent&amp;#39;},)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So the first event of the year was by &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;'marksteve'&lt;/span&gt;&lt;/code&gt; who decided to watch the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;'nbviewer'&lt;/span&gt;&lt;/code&gt; repository on new year’s day.&lt;/p&gt;
&lt;p&gt;Notice that these computations take around 200ms. I can’t get below this from
my local machine, so we’re likely bound by communicating to such a remote
location. A 200ms latency is not great if you’re playing a video game, but
it’s decent for interactive computing.&lt;/p&gt;
&lt;p&gt;Here are all of the Jupyter repositories touched in the month of January,&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pluck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;repo&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pluck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distinct&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 2.84 ms, sys: 4.03 ms, total: 6.86 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 204 ms&lt;/span&gt;

&lt;span class="go"&gt;[&amp;#39;jupyter/dockerspawner&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/design&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/docker-demo-images&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/jupyterhub&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/configurable-http-proxy&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/nbshot&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/sudospawner&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/colaboratory&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/strata-sv-2015-tutorial&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/tmpnb-deploy&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/nature-demo&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/nbcache&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/jupyter.github.io&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/try.jupyter.org&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/jupyter-drive&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/tmpnb&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/tmpnb-redirector&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/nbgrader&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/nbindex&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/nbviewer&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt; &amp;#39;jupyter/oauthenticator&amp;#39;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And the top ten most active people on GitHub.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pluck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;actor&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;                  .pluck(&amp;#39;login&amp;#39;)&lt;/span&gt;
&lt;span class="go"&gt;                  .frequencies()&lt;/span&gt;
&lt;span class="go"&gt;                  .topk(10, lambda kv: kv[1])&lt;/span&gt;
&lt;span class="go"&gt;                  .compute())&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 8.03 ms, sys: 90 µs, total: 8.12 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 226 ms&lt;/span&gt;

&lt;span class="go"&gt;[(&amp;#39;rgbkrk&amp;#39;, 156),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;minrk&amp;#39;, 87),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;Carreau&amp;#39;, 87),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;KesterTong&amp;#39;, 74),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jhamrick&amp;#39;, 70),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;bollwyvl&amp;#39;, 25),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;pkt&amp;#39;, 18),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;ssanderson&amp;#39;, 13),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;smashwilson&amp;#39;, 13),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;ellisonbg&amp;#39;, 13)]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Nothing too surprising here if you know these folks.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/17/dask-distributed-part1.md&lt;/span&gt;, line 309)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="full-dataset"&gt;
&lt;h1&gt;Full Dataset&lt;/h1&gt;
&lt;p&gt;The full five months of data is too large to fit in memory, even for this
cluster. When we represent semi-structured data like this with dynamic data
structures like lists and dictionaries there is quite a bit of memory bloat.
Some careful attention to efficient semi-structured storage here could save us
from having to switch to such a large cluster, but that will have to be
the topic of another post.&lt;/p&gt;
&lt;p&gt;Instead, we operate efficiently on this dataset by flowing it through
memory, persisting only the records we care about. The distributed dask
scheduler descends from the single-machine dask scheduler, which was quite good
at flowing through a computation and intelligently removing intermediate
results.&lt;/p&gt;
&lt;p&gt;From a user API perspective, we call &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;persist&lt;/span&gt;&lt;/code&gt; only on the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;jupyter&lt;/span&gt;&lt;/code&gt; dataset,
and not the full &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;records&lt;/span&gt;&lt;/code&gt; dataset.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;full&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;githubarchive-data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2015&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;              .map(json.loads)&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;repo&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;jupyter/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="go"&gt;                   .repartition(10))&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;It takes 2m36s to download, decompress, and parse the five months of publicly
available GitHub events for all Jupyter events on nine &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;m3.2xlarges&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There were seven thousand such events.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;7065&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We find which repositories saw the most activity during that time:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pluck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;repo&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;                  .pluck(&amp;#39;name&amp;#39;)&lt;/span&gt;
&lt;span class="go"&gt;                  .frequencies()&lt;/span&gt;
&lt;span class="go"&gt;                  .topk(20, lambda kv: kv[1])&lt;/span&gt;
&lt;span class="go"&gt;                  .compute())&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 6.98 ms, sys: 474 µs, total: 7.46 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 219 ms&lt;/span&gt;

&lt;span class="go"&gt;[(&amp;#39;jupyter/jupyterhub&amp;#39;, 1262),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/nbgrader&amp;#39;, 1235),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/nbviewer&amp;#39;, 846),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter_notebook&amp;#39;, 507),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter-drive&amp;#39;, 505),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/notebook&amp;#39;, 451),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/docker-demo-images&amp;#39;, 363),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/tmpnb&amp;#39;, 284),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter_client&amp;#39;, 162),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/dockerspawner&amp;#39;, 149),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/colaboratory&amp;#39;, 134),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter_core&amp;#39;, 127),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/strata-sv-2015-tutorial&amp;#39;, 108),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter_nbconvert&amp;#39;, 103),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/configurable-http-proxy&amp;#39;, 89),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/hubpress.io&amp;#39;, 85),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter.github.io&amp;#39;, 84),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/tmpnb-deploy&amp;#39;, 76),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/nbconvert&amp;#39;, 66),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter_qtconsole&amp;#39;, 59)]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We see that projects like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;jupyterhub&lt;/span&gt;&lt;/code&gt; were quite active during that time
while, surprisingly, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;nbconvert&lt;/span&gt;&lt;/code&gt; saw relatively little action.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/17/dask-distributed-part1.md&lt;/span&gt;, line 383)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="local-data"&gt;
&lt;h1&gt;Local Data&lt;/h1&gt;
&lt;p&gt;The Jupyter data is quite small and easily fits in a single machine. Let’s
bring the data to our local machine so that we can compare times:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 4.74 s, sys: 10.9 s, total: 15.7 s&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 30.2 s&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;It takes surprisingly long to download the data, but once its here, we can
iterate far more quickly with basic Python.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;toolz.curried&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pluck&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frequencies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pluck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;repo&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;pluck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;frequencies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="go"&gt;               dict.items, topk(20, key=lambda kv: kv[1]), list)&lt;/span&gt;
&lt;span class="go"&gt;CPU times: user 11.8 ms, sys: 0 ns, total: 11.8 ms&lt;/span&gt;
&lt;span class="go"&gt;Wall time: 11.5 ms&lt;/span&gt;

&lt;span class="go"&gt;[(&amp;#39;jupyter/jupyterhub&amp;#39;, 1262),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/nbgrader&amp;#39;, 1235),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/nbviewer&amp;#39;, 846),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter_notebook&amp;#39;, 507),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter-drive&amp;#39;, 505),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/notebook&amp;#39;, 451),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/docker-demo-images&amp;#39;, 363),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/tmpnb&amp;#39;, 284),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter_client&amp;#39;, 162),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/dockerspawner&amp;#39;, 149),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/colaboratory&amp;#39;, 134),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter_core&amp;#39;, 127),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/strata-sv-2015-tutorial&amp;#39;, 108),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter_nbconvert&amp;#39;, 103),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/configurable-http-proxy&amp;#39;, 89),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/hubpress.io&amp;#39;, 85),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter.github.io&amp;#39;, 84),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/tmpnb-deploy&amp;#39;, 76),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/nbconvert&amp;#39;, 66),&lt;/span&gt;
&lt;span class="go"&gt; (&amp;#39;jupyter/jupyter_qtconsole&amp;#39;, 59)]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The difference here is 20x, which is a good reminder that, once you no longer
have a large problem you should probably eschew distributed systems and act
locally.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/17/dask-distributed-part1.md&lt;/span&gt;, line 430)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Downloading, decompressing, parsing, filtering, and counting JSON records
is the new wordcount. It’s the first problem anyone sees. Fortunately it’s
both easy to solve and the common case. Woo hoo!&lt;/p&gt;
&lt;p&gt;Here we saw that dask+distributed handle the common case decently well and with
a Pure Python stack. Typically Python users rely on a JVM technology like
Hadoop/Spark/Storm to distribute their computations. Here we have Python
distributing Python; there are some usability gains to be had here like nice
stack traces, a bit less serialization overhead, and attention to other
Pythonic style choices.&lt;/p&gt;
&lt;p&gt;Over the next few posts I intend to deviate from this common case. Most “Big
Data” technologies were designed to solve typical data munging problems found
in web companies or with simple database operations in mind. Python users care
about these things too, but they also reach out to a wide variety of fields.
In dask+distributed development we care about the common case, but also support
less traditional workflows that are commonly found in the life, physical, and
algorithmic sciences.&lt;/p&gt;
&lt;p&gt;By designing to support these more extreme cases we’ve nailed some common pain
points in current distributed systems. Today we’ve seen low latency and remote
control; in the future we’ll see others.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/17/dask-distributed-part1.md&lt;/span&gt;, line 455)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-doesn-t-work"&gt;
&lt;h1&gt;What doesn’t work&lt;/h1&gt;
&lt;p&gt;I’ll have an honest section like this at the end of each upcoming post
describing what doesn’t work, what still feels broken, or what I would have
done differently with more time.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;The imports for dask and distributed are still strange. They’re two
separate codebases that play very nicely together. Unfortunately the
functionality you need is sometimes in one or in the other and it’s not
immediately clear to the novice user where to go. For example dask.bag, the
collection we’re using for &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;records&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;jupyter&lt;/span&gt;&lt;/code&gt;, etc. is in &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; but the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;s3&lt;/span&gt;&lt;/code&gt; module is within the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;distributed&lt;/span&gt;&lt;/code&gt; library. We’ll have to merge things
at some point in the near-to-moderate future. Ditto for the API: there are
compute methods both on the dask collections (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;records.compute()&lt;/span&gt;&lt;/code&gt;) and on
the distributed executor (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;e.compute(records)&lt;/span&gt;&lt;/code&gt;) that behave slightly
differently.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We lack an efficient distributed shuffle algorithm. This is very important
if you want to use operations like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.groupby&lt;/span&gt;&lt;/code&gt; (which you should avoid
anyway). The user API here doesn’t even cleanly warn users that this is
missing in the distributed case which is kind of a mess. (It works fine on a
single machine.) Efficient alternatives like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;foldby&lt;/span&gt;&lt;/code&gt; &lt;em&gt;are&lt;/em&gt; available.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;I would have liked to run this experiment directly on the cluster to see
how low we could have gone below the 200ms barrier we ran into here.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2016/02/17/dask-distributed-part1.md&lt;/span&gt;, line 481)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="links"&gt;
&lt;h1&gt;Links&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask.pydata.org/en/latest/"&gt;dask&lt;/a&gt;, the original project&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://distributed.readthedocs.org/en/latest/"&gt;dask.distributed&lt;/a&gt;, the
distributed memory scheduler powering the cluster computing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://dask.pydata.org/en/latest/bag.html"&gt;dask.bag&lt;/a&gt;, the user API we’ve
used in this post.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This post largely repeats work by &lt;a class="reference external" href="https://github.com/cowlicks"&gt;Blake Griffith&lt;/a&gt; in a
&lt;a class="reference external" href="https://www.continuum.io/content/dask-distributed-and-anaconda-cluster"&gt;similar post&lt;/a&gt;
last year with an older iteration of the dask distributed scheduler&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2016/02/17/dask-distributed-part1/"/>
    <summary>This work is supported by Continuum Analytics
and the XDATA Program
as part of the Blaze Project</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2016-02-17T00:00:00+00:00</published>
  </entry>
</feed>
