<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts by Matthew Rocklin (Coiled)</title>
  <updated>2026-03-05T15:05:20.178558+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/author/matthew-rocklin-coiled/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2020/07/21/faster-scheduling/</id>
    <title>Faster Scheduling</title>
    <updated>2020-07-21T00:00:00+00:00</updated>
    <author>
      <name>Matthew Rocklin (Coiled)</name>
    </author>
    <content type="html">&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/07/21/faster-scheduling.md&lt;/span&gt;, line 8)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="summary"&gt;

&lt;p&gt;This post discusses Dask overhead costs for task scheduling,
and then lays out a rough plan for acceleration.&lt;/p&gt;
&lt;p&gt;This post is written for other maintainers, and often refers to internal
details. It is not intended for broad readability.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/07/21/faster-scheduling.md&lt;/span&gt;, line 16)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="how-does-this-problem-present"&gt;
&lt;h1&gt;How does this problem present?&lt;/h1&gt;
&lt;p&gt;When we submit large graphs there is a bit of a delay between us calling
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.compute()&lt;/span&gt;&lt;/code&gt; and work actually starting on the workers. In some cases, that
delay can affect usability and performance.&lt;/p&gt;
&lt;p&gt;Additionally, in far fewer cases, the gaps in between tasks can be an issue,
especially if those tasks are very short and for some reason can not be made
longer.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/07/21/faster-scheduling.md&lt;/span&gt;, line 26)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="who-cares"&gt;
&lt;h1&gt;Who cares?&lt;/h1&gt;
&lt;p&gt;First, this is a problem that affects about 1-5% of Dask users. These are people
who want to process millions of tasks relatively quickly. Let’s list a few use
cases:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Xarray/Pangeo workloads at the 10-100TB scale&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;NVIDIA RAPIDS workloads on large tabular data (GPUs make computing fast, so other costs become
relatively larger)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Some mystery use cases inside of some hedge funds&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It does not affect the everyday user, who processes 100GB to a few TB of data,
and doesn’t mind waiting 10s for things to start running.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/07/21/faster-scheduling.md&lt;/span&gt;, line 40)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="coarse-breakdown-of-costs"&gt;
&lt;h1&gt;Coarse breakdown of costs&lt;/h1&gt;
&lt;p&gt;When you call &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;x.sum().compute()&lt;/span&gt;&lt;/code&gt; a few things happen:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph generation:&lt;/strong&gt; Some Python code in a Dask collection library, like
dask array, calls the sum function, which generates a task graph on the
client side.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph Optimization:&lt;/strong&gt; We then optimize that graph, also on the client
side, in order to remove unnecessary work, fuse tasks, apply important high
level optimizations, and more.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph Serializtion:&lt;/strong&gt; We now pack up that graph in a form that can be
sent over to the scheduler.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph Communication:&lt;/strong&gt; We fire those bytes across a wire over to the
scheduler&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scheduler.update_graph:&lt;/strong&gt; The scheduler receives these bytes, unpacks
them, and then updates its own internal data structures&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scheduling:&lt;/strong&gt; The scheduler then assigns ready tasks to workers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Communicate to workers:&lt;/strong&gt; The scheduler sends out lots of smaller
messages to each of the workers with the tasks that they can perform&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Workers work:&lt;/strong&gt; The workers then perform this work, and start
communicating back and forth with the scheduler to receive new instructions&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Generally most people today are concerned with steps 1-6. Once things get out
to the workers and progress bars start moving people tend to care a bit less
(but not zero).&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/07/21/faster-scheduling.md&lt;/span&gt;, line 66)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-do-other-people-do"&gt;
&lt;h1&gt;What do other people do?&lt;/h1&gt;
&lt;p&gt;Let’s look at a few things that other projects do, and see if there are things
that we can learn. These are commonly suggested, but there are challenges with
most of them.&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rewrite the scheduler it in C++/Rust/C/Cython&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Proposal: Python is slow. Want to make it faster? Don’t use Python. See
academic projects.&lt;/p&gt;
&lt;p&gt;Challenge: This makes sense for some parts of the pipeline above, but not
for others. It also makes it harder to attract maintainers.&lt;/p&gt;
&lt;p&gt;What we should consider: Some parts of the scheduler and optimization
algorithms could be written in a lower level language, maybe Cython. We’ll
need to be careful about maintainability.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Distributed scheduling&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Proposal: The scheduler is slow, maybe have many schedulers? See Ray.&lt;/p&gt;
&lt;p&gt;Challenge: It’s actually really hard to make the kinds of decisions that
Dask has to make if scheduling state is spread on many computers.
Distributed scheduling works better when the workload is very either
uniform or highly decoupled.
Distributed scheduling is really attractive to people who like solving
interesting/hard problems.&lt;/p&gt;
&lt;p&gt;What we should consider: We can move some simple logic down to the workers.
We’ve already done this with the easy stuff though.
It’s not clear how much additional benefit there is here.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build specialty scheduling around collections&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Proposal: If Dask were to become just a dataframe library or just an array
computing library then it could special-case things more effectively. See
Spark, Mars, and others.&lt;/p&gt;
&lt;p&gt;Challenge: Yes, but Dask is not a dataframe library or an array library.
The three use cases we mention above are all very different.&lt;/p&gt;
&lt;p&gt;What we should consider: modules like dask array and dask dataframe should
develop high level query blocks, and we should endeavor to
communicate these subgraphs over the wire directly so that they are more
compact.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/07/21/faster-scheduling.md&lt;/span&gt;, line 113)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-should-we-actually-do"&gt;
&lt;h1&gt;What should we actually do?&lt;/h1&gt;
&lt;p&gt;Because our pipeline has many stages, each of which can be slow for different
reasons, we’ll have to do many things. Additionally, this is a hard problem
because changing one piece of the project at this level has repurcussions for
many other pieces. The rest of this post tries to lay out a consistent set of
changes. Let’s start with a summary:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;For Dask array/dataframe let’s use high level graphs more aggressively so
that we can communicate only abstract representations between the client
and scheduler.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;But this breaks low level graph optimizations, fuse, cull, and slice fusion
in particular. We can make these unnecessary with two changes:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;We can make high level graphs considerably smarter to handle cull and slice fusion&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We can move a bit more of the scheduling down to the workers to
replicate the advantages of low-level fusion there&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Then, once all of the graph manipulation happens on the scheduler, let’s
try to accelerate it, hopefully in a language that the current dev
community can understand, like Cython&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;At the same time in parallel, let’s take a look at our network stack&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We’ll go into these in more depth below&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/07/21/faster-scheduling.md&lt;/span&gt;, line 136)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="graph-generation"&gt;
&lt;h1&gt;Graph Generation&lt;/h1&gt;
&lt;section id="high-level-graph-history"&gt;
&lt;h2&gt;High Level Graph History&lt;/h2&gt;
&lt;p&gt;A year or two ago we moved graph generation costs from user-code-typing time to
graph-optimization-time with high level graphs&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;                 &lt;span class="c1"&gt;# graph generation used to happen here&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,)&lt;/span&gt;  &lt;span class="c1"&gt;# now it happens here&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This really improved usability, and also let us do some high level
optimizations which sometimes allowed us to skip some lower-level optimization
costs.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="can-we-push-this-further"&gt;
&lt;h2&gt;Can we push this further?&lt;/h2&gt;
&lt;p&gt;The first four stages of our pipeline happen on the client:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph generation:&lt;/strong&gt; Some Python code in a Dask collection library, like
dask array, calls the sum function, which generates a task graph on the
client side.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph Optimization:&lt;/strong&gt; We then optimize that graph, also on the client
side, in order to remove unnecessary work, fuse tasks, apply important high
level optimizations, and more.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph Serializtion:&lt;/strong&gt; We now pack up that graph in a form that can be
sent over to the scheduler.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph Communication:&lt;/strong&gt; We fire those bytes across a wire over to the
scheduler&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If we’re able to stay with the high level graph representation through these
stages all the way until graph communication, then we can communicate a far
more compact representation up to the scheduler. We can drop a lot of these
costs, at least for the high level collection APIs (delayed and client.submit
would still be slow, client.map might be ok though).&lt;/p&gt;
&lt;p&gt;This has a couple of other nice benefits:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;User’s code won’t block, and we can alert the user immediately that we’re
on the job&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We’ve centralized costs in just the scheduler,
so there is now only one place where we might have to think about low-level code&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;(some conversation here: https://github.com/dask/distributed/issues/3872)&lt;/p&gt;
&lt;/section&gt;
&lt;section id="however-low-level-graph-optimizations-are-going-to-be-a-problem"&gt;
&lt;h2&gt;However, low-level graph optimizations are going to be a problem&lt;/h2&gt;
&lt;p&gt;In principle changing the distributed scheduler to accept a variety of graph
layer types is a tedious but straightforward problem. I’m not concerned.&lt;/p&gt;
&lt;p&gt;The bigger concern is what to do with low-level graph optimizations.
Today we have three of these that really matter:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Task fusion: this is what keeps your &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_parquet&lt;/span&gt;&lt;/code&gt; task merged with your
subsequent blockwise tasks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Culling: this is what makes &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;df.head()&lt;/span&gt;&lt;/code&gt; or &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;x[0]&lt;/span&gt;&lt;/code&gt; fast&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Slice fusion: this is why &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;x[:100][5]&lt;/span&gt;&lt;/code&gt; works well&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In order for us to transmit abstract graph layers up to the scheduler, we need
to remove the need for these low level graph optimizations. I think that we
can do this with a combination of two approaches:&lt;/p&gt;
&lt;section id="more-clever-high-level-graph-manipulation"&gt;
&lt;h3&gt;More clever high level graph manipulation&lt;/h3&gt;
&lt;p&gt;We already do this a bit with blockwise, which has its own fusion, and which
removes much of the need for fusion generally. But other blockwise-like
operations, like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_*&lt;/span&gt;&lt;/code&gt; will probably have to join the Blockwise family.&lt;/p&gt;
&lt;p&gt;Getting culling to work properly may require us to teach each of the individual
graph layers how to track dependencies in each layer type and cull themselves.
This may get tricky.&lt;/p&gt;
&lt;p&gt;Slicing is doable, we just need someone to go in, grok all of the current
slicing optimizations, and make high level graph layers for these
computations. This would be a great project for a sharp masters student&lt;/p&gt;
&lt;/section&gt;
&lt;section id="send-speculative-tasks-to-the-workers"&gt;
&lt;h3&gt;Send speculative tasks to the workers&lt;/h3&gt;
&lt;p&gt;High level Blockwise fusion handles many of the use cases for low-level fusion,
but not all. For example I/O layers like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dd.read_parquet&lt;/span&gt;&lt;/code&gt; or &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;da.from_zarr&lt;/span&gt;&lt;/code&gt;
aren’t fused at a high level.&lt;/p&gt;
&lt;p&gt;We can resolve this either by making them blockwise layers (this requires
expanding the blockwise abstraction, which may be hard) or alternatively we can
start sending not-yet-ready tasks to workers before all of their dependencies
are finished if we’re highly confident that we know where they’re going to go.
This would give us some of the same results of fusion, but would keep all of
the task types separate (which would be nice for diagnostics) and might still
give us some of the same performance benefits that we get from fusion.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="unpack-abstract-graph-layers-on-the-scheduler"&gt;
&lt;h2&gt;Unpack abstract graph layers on the scheduler&lt;/h2&gt;
&lt;p&gt;So after we’ve removed the need for low level optimizations, and we just send
the abstract graph layers up to the scheduler directly, we’ll need to teach the
scheduler how to unpack those graph layers.&lt;/p&gt;
&lt;p&gt;This is a little tricky because the Scheduler can’t run user Python code (for
security reasons). We’ll have to register layer types (like blockwise,
rechunk, dataframe shuffle) that the scheduler knows about and trusts ahead of
time. We’ll still always support custom layers, and these will be at the same
speed that they’ve always been, but hopefully there will be far less need for
these if we go all-in on high level layers.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/07/21/faster-scheduling.md&lt;/span&gt;, line 240)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="rewrite-scheduler-in-low-level-language"&gt;
&lt;h1&gt;Rewrite scheduler in low-level language&lt;/h1&gt;
&lt;p&gt;Once most of the finicky bits are moved to the scheduler, we’ll have one place
where we can focus on low level graph state manipulation.&lt;/p&gt;
&lt;p&gt;Dask’s distributed scheduler is two things:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;A Tornado TCP application that receives signals from clients and workers
and send signals out to clients and workers&lt;/p&gt;
&lt;p&gt;This is async-heavy networking code&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A complex state machine internally that responds to those state changes&lt;/p&gt;
&lt;p&gt;This is a complex data structure heavy Python code&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;section id="networking"&gt;
&lt;h2&gt;Networking&lt;/h2&gt;
&lt;p&gt;Jim has an interesting project here that shows promise: https://github.com/jcrist/ery
Reducing latency between workers and the scheduler would be good, and would
help to accelerate stages 7-8 in the pipeline listed at the top of this post.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="state-machine"&gt;
&lt;h2&gt;State machine&lt;/h2&gt;
&lt;p&gt;Rewriting the state machine in some lower level language would be fine.
Ideally this would be in a language that was easy for the current maintainer
community to maintain, (Cython?) but we may also consider making a more firm
interface here that would allow other groups to experiment safely.&lt;/p&gt;
&lt;p&gt;There are some advantages to this (more experimentation by different groups)
but also some costs (splitting of core efforts and mismatches for users).
Also, I suspect that splitting out also probably means that we’ll probably lose the dashboard,
unless those other groups are very careful to expose the same state to Bokeh.&lt;/p&gt;
&lt;p&gt;There is more exploration to do here. Regardless I think that it probably makes
sense to try to isolate the state machine from the networking system.
Maybe this also makes it easier for people to profile in isolation.&lt;/p&gt;
&lt;p&gt;In speaking with a few different groups most people have expressed reservation
about having multiple different state machine codes. This was done in
MapReduce and Spark and resulted in difficult to maintain community dynamics.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/07/21/faster-scheduling.md&lt;/span&gt;, line 282)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="high-level-graph-optimizations"&gt;
&lt;h1&gt;High Level Graph Optimizations&lt;/h1&gt;
&lt;p&gt;Once we have everything in smarter high level graph layers,
we will also be more ripe for optimization.&lt;/p&gt;
&lt;p&gt;We’ll need a better way to write down these optimizations with a separated
traversal system and a set of rules. A few of us have
written these things before, maybe it’s time we revisit them.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/07/21/faster-scheduling.md&lt;/span&gt;, line 291)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-we-need"&gt;
&lt;h1&gt;What we need&lt;/h1&gt;
&lt;p&gt;This would require some effort, but I think that it would hit several high
profile problems at once. There are a few tricky things to get right:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;A framework for high level graph layers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;An optimization system for high level graph layers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Separation of the scheduler into two parts&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For this I think that we’ll need people who are fairly familiar with Dask to do this right.&lt;/p&gt;
&lt;p&gt;And there there is a fair amount of follow-on work&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Build a hierarchy of layers for dask dataframe&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Build a hierarchy of layers for dask array&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Build optimizations for those to remove the need for low level graph
optimizations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rewrite core parts of the scheduler in Cython&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Experiment with the networking layer, maybe with a new Comm&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I’ve been thinking about the right way to enact this change.
Historically most Dask changes over the past few years have been incremental or
peripheral, due to how burdened the maintainers are. There might be enough
pressure on this problem though that we can get some dedicated engineering
effort from a few organizations though, which might change how possible this is.
We’ve gotten 25% time from a few groups. I’m curious if we can gate 100% time
for some people for a few months.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2020/07/21/faster-scheduling/"/>
    <summary>Document headings start at H2, not H1 [myst.header]</summary>
    <published>2020-07-21T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2020/05/13/large-svds/</id>
    <title>Large SVDs</title>
    <updated>2020-05-13T00:00:00+00:00</updated>
    <author>
      <name>Alistair Miles (Oxford University)</name>
    </author>
    <content type="html">&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 10)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="summary"&gt;

&lt;p&gt;We perform Singular Value Decomposition (SVD) calculations on large datasets.&lt;/p&gt;
&lt;p&gt;We modify the computation both by using fully precise and approximate methods,
and by using both CPUs and GPUs.&lt;/p&gt;
&lt;p&gt;In the end we compute an approximate SVD of 200GB of simulated data and using a mutli-GPU machine in 15-20 seconds.
Then we run this from a dataset stored in the cloud
where we find that I/O is, predictably, a major bottleneck.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 21)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="svd-the-simple-case"&gt;
&lt;h1&gt;SVD - The simple case&lt;/h1&gt;
&lt;p&gt;Dask arrays contain a relatively sophisticated SVD algorithm that works in the
tall-and-skinny or short-and-fat cases, but not so well in the roughly-square
case. It works by taking QR decompositions of each block of the array,
combining the R matrices, doing another smaller SVD on those, and then
performing some matrix multiplication to get back to the full result. It’s
numerically stable and decently fast, assuming that the intermediate R
matrices of the QR decompositions mostly fit in memory.&lt;/p&gt;
&lt;p&gt;The memory constraints here are that if you have an &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n&lt;/span&gt;&lt;/code&gt; by &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;m&lt;/span&gt;&lt;/code&gt; tall and
skinny array (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n&lt;/span&gt; &lt;span class="pre"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pre"&gt;m&lt;/span&gt;&lt;/code&gt;) cut into &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;k&lt;/span&gt;&lt;/code&gt; blocks then you need to have about &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;m**2&lt;/span&gt; &lt;span class="pre"&gt;*&lt;/span&gt; &lt;span class="pre"&gt;k&lt;/span&gt;&lt;/code&gt; space. This is true in many cases, including typical PCA machine learning
workloads, where you have tabular data with a few columns (hundreds at most)
and many rows.&lt;/p&gt;
&lt;p&gt;It’s easy to use and quite robust.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;10000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;table&gt;  &lt;thead&gt;    &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;th&gt; Array &lt;/th&gt;&lt;th&gt; Chunk &lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;th&gt; Bytes &lt;/th&gt;&lt;td&gt; 1.60 GB &lt;/td&gt; &lt;td&gt; 100.00 MB &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Shape &lt;/th&gt;&lt;td&gt; (10000000, 20) &lt;/td&gt; &lt;td&gt; (625000, 20) &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Count &lt;/th&gt;&lt;td&gt; 16 Tasks &lt;/td&gt;&lt;td&gt; 16 Chunks &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Type &lt;/th&gt;&lt;td&gt; float64 &lt;/td&gt;&lt;td&gt; numpy.ndarray &lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;svg width="75" height="170" style="stroke:rgb(0,0,0);stroke-width:1" &gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="0" y1="0" x2="25" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="0" y1="7" x2="25" y2="7" /&gt;
  &lt;line x1="0" y1="15" x2="25" y2="15" /&gt;
  &lt;line x1="0" y1="22" x2="25" y2="22" /&gt;
  &lt;line x1="0" y1="30" x2="25" y2="30" /&gt;
  &lt;line x1="0" y1="37" x2="25" y2="37" /&gt;
  &lt;line x1="0" y1="45" x2="25" y2="45" /&gt;
  &lt;line x1="0" y1="52" x2="25" y2="52" /&gt;
  &lt;line x1="0" y1="60" x2="25" y2="60" /&gt;
  &lt;line x1="0" y1="67" x2="25" y2="67" /&gt;
  &lt;line x1="0" y1="75" x2="25" y2="75" /&gt;
  &lt;line x1="0" y1="82" x2="25" y2="82" /&gt;
  &lt;line x1="0" y1="90" x2="25" y2="90" /&gt;
  &lt;line x1="0" y1="97" x2="25" y2="97" /&gt;
  &lt;line x1="0" y1="105" x2="25" y2="105" /&gt;
  &lt;line x1="0" y1="112" x2="25" y2="112" /&gt;
  &lt;line x1="0" y1="120" x2="25" y2="120" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="0" y1="0" x2="0" y2="120" style="stroke-width:2" /&gt;
  &lt;line x1="25" y1="0" x2="25" y2="120" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="0.000000,0.000000 25.412617,0.000000 25.412617,120.000000 0.000000,120.000000" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="12.706308" y="140.000000" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;20&lt;/text&gt;
&lt;text x="45.412617" y="60.000000" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(-90,45.412617,60.000000)"&gt;10000000&lt;/text&gt;
&lt;/svg&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;svd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This works fine in the short and fat case too (when you have far more columns
than rows) but we’re always going to assume that one of your dimensions is
unchunked, and that the other dimension has chunks that are quite a bit
longer, otherwise, things might not fit into memory.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 105)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="approximate-svd"&gt;
&lt;h1&gt;Approximate SVD&lt;/h1&gt;
&lt;p&gt;If your dataset is large in both dimensions then the algorithm above won’t work
as is. However, if you don’t need exact results, or if you only need a few of
the components, then there are a number of excellent approximation algorithms.&lt;/p&gt;
&lt;p&gt;Dask array has one of these approximation algorithms implemented in the
&lt;a class="reference external" href="https://docs.dask.org/en/latest/array-api.html#dask.array.linalg.svd_compressed"&gt;da.linalg.svd_compressed&lt;/a&gt;
function. And with it we can compute the approximate SVD of very large
matrices.&lt;/p&gt;
&lt;p&gt;We were recently working on a problem (explained below) and found that we were
still running out of memory when dealing with this algorithm. There were two
challenges that we ran into:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;The algorithm requires multiple passes over the data, but the Dask task
scheduler was keeping the input matrix in memory after it had been loaded once
in order to avoid recomputation.
Things still worked, but Dask had to move the data to disk and back
repeatedly, which reduced performance significantly.&lt;/p&gt;
&lt;p&gt;We resolved this by including explicit recomputation steps in the algorithm.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Related chunks of data would be loaded at different times, and so would
need to stick around longer than necessary to wait for their associated
chunks.&lt;/p&gt;
&lt;p&gt;We resolved this by engaging task fusion as an optimization pass.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Before diving further into the technical solution
we quickly provide the use case that was motivating this work.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 137)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="application-genomics"&gt;
&lt;h1&gt;Application - Genomics&lt;/h1&gt;
&lt;p&gt;Many studies are using genome sequencing to study genetic variation
between different individuals within a species. These includes
studies of human populations, but also other species such as mice,
mosquitoes or disease-causing parasites. These studies will, in
general, find a large number of sites in the genome sequence where
individuals differ from each other. For example, humans have more
than 100 million variable sites in the genome, and modern studies
like the &lt;a class="reference external" href="https://www.ukbiobank.ac.uk/"&gt;UK BioBank&lt;/a&gt; are working towards
sequencing the genomes of 1 million individuals or more.&lt;/p&gt;
&lt;p&gt;In diploid species like humans, mice or mosquitoes, each individual
carries two genome sequences, one inherited from each parent. At each
of the 100 million variable genome sites there will be two or more
“alleles” that a single genome might carry. One way to think about
this is via the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Punnett_square"&gt;Punnett
square&lt;/a&gt;, which
represents the different possible genotypes that one individual might
carry at one of these variable sites:&lt;/p&gt;
&lt;td&gt;
&lt;img src="https://upload.wikimedia.org/wikipedia/commons/9/93/Punnett_Square_Genetic_Carriers.PNG" alt="punnet square" height="40%" width="40%"&gt;
&lt;/td&gt;
&lt;p&gt;In the above there are three possible genotypes: AA, Aa, and aa. For
computational genomics, these genotypes can be encoded as 0, 1, or 2.
In a study of a species with M genetic variants assayed in N
individual samples, we can represent these genotypes as an (M x N)
array of integers. For a modern human genetics study, the scale of
this array might approach (100 million x 1 million). (Although in
practice, the size of the first dimension (number of variants) can be
reduced somewhat, by at least an order of magnitude, because many
genetic variants will carry little information and/or be correlated
with each other.)&lt;/p&gt;
&lt;p&gt;These genetic differences are not random, but carry information about
patterns of genetic similarity and shared ancestry between
individuals, because of the way they have been inherited through many
generations. A common task is to perform a dimensionality reduction
analysis on these data, such as a &lt;a class="reference external" href="https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020190"&gt;principal components
analysis&lt;/a&gt;
(SVD), to identify genetic structure reflecting these differencies in
degree of shared ancestry. This is an essential part of discovering
genetic variants associated with different diseases, and for learning
more about the genetic history of populations and species.&lt;/p&gt;
&lt;p&gt;Reducing the time taken to compute an analysis such as SVD, like all
science, allows for exploring larger datasets and testing more
hypotheses in less time. Practically, this means not simply a fast
SVD but an accelerated pipeline end-to-end, from data loading to
analysis, to understanding.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;We want to run an experiment in less time than it takes to make a cup of tea&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 192)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="performant-svds-w-dask"&gt;
&lt;h1&gt;Performant SVDs w/ Dask&lt;/h1&gt;
&lt;p&gt;Now that we have that scientific background, let’s transition back to talking about computation.&lt;/p&gt;
&lt;p&gt;To stop Dask from holding onto the data we intentionally trigger computation as
we build up the graph. This is a bit atypical in Dask calculations (we prefer
to have as much of the computation at once before computing) but useful given
the multiple-pass nature of this problem. This was a fairly easy change, and
is available in &lt;a class="reference external" href="https://github.com/dask/dask/pull/5041"&gt;dask/dask #5041&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Additionally, we found that it was helpful to turn on moderately wide task
fusion.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask&lt;/span&gt;
&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;optimization.fuse.ave-width&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 210)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="then-things-work-fine"&gt;
&lt;h1&gt;Then things work fine&lt;/h1&gt;
&lt;p&gt;We’re going to try this SVD on a few different choices of hardware including:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;A MacBook Pro&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A DGX-2, an NVIDIA worksation with 16 high-end GPUs and fast interconnect&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A twenty-node cluster on AWS&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;section id="macbook-pro"&gt;
&lt;h2&gt;Macbook Pro&lt;/h2&gt;
&lt;p&gt;We can happily perform an SVD on a 20GB array on a Macbook Pro&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20_000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5_000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;svd_compressed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This call is no longer entirely lazy, and it recomputes &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;x&lt;/span&gt;&lt;/code&gt; a couple times, but
it works, and it works using only a few GB of RAM on a consumer laptop.&lt;/p&gt;
&lt;p&gt;It takes around 2min 30s time to compute that on a laptop.
That’s great! It was super easy to try out, didn’t require any special
hardware or setup, and in many cases is fast enough.
By working locally we can iterate quickly.&lt;/p&gt;
&lt;p&gt;Now that things work, we can experiment with different hardware.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="adding-gpus-a-15-second-svd"&gt;
&lt;h2&gt;Adding GPUs (a 15 second SVD)&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Disclaimer: one of the authors (Ben Zaitlen) works for NVIDIA&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We can dramatically increase performance by using a multi-GPU machine.
NVIDIA and other manufacturers now make machines with multiple GPUs co-located in the same physical box.
In the following section, we will run the calculations on a &lt;strong&gt;DGX2&lt;/strong&gt;, a machine with 16 GPUs and fast interconnect between the GPUs.&lt;/p&gt;
&lt;p&gt;Below is almost the same code, running in significantly less same time but we make the
following changes:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;We increase the size of the array by a factor of &lt;strong&gt;10x&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We switch out NumPy for CuPy, a GPU NumPy implementation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We use a sixteen-GPU DGX-2 machine with NVLink interconnects between GPUs (NVLink will dramatically decrease transfer time between workers)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 to 15 seconds.&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4"&gt;full notebook is here&lt;/a&gt;,
but the relevant code snippets are below:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Some GPU specific setup&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_cuda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LocalCluster&lt;/span&gt;

&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LocalCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cupy&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;
&lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create the data and run the SVD as normal&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10_000_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20_000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
               &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5_000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;uint8&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;svd_compressed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To see this run, we recommend viewing this screencast:&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/6hmt1gARqp0" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/section&gt;
&lt;section id="read-dataset-from-disk"&gt;
&lt;h2&gt;Read dataset from Disk&lt;/h2&gt;
&lt;p&gt;While impressive, the computation above is mostly bound by generating random
data and then performing matrix calculations. GPUs are good at both of these
things.&lt;/p&gt;
&lt;p&gt;In practice though, our input array won’t be randomly generated, it will be
coming from some dataset stored on disk or increasingly more common, stored in the cloud.
To make things more realistic we perform a similar calculation with data
stored in a &lt;a class="reference external" href="https://zarr.readthedocs.io/en/stable/"&gt;Zarr format&lt;/a&gt;
in &lt;a class="reference external" href="https://cloud.google.com/storage"&gt;GCS&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In this &lt;a class="reference external" href="https://gist.github.com/quasiben/e52bc740ae22ae321f30987c65998078"&gt;Zarr SVD example&lt;/a&gt;,
we load a 25GB GCS backed data set onto a DGX2,
run a few processing steps, then perform an SVD.
The combination of preprocessing and SVD calculations ran in 18.7 sec and the data loading took 14.3 seconds.&lt;/p&gt;
&lt;p&gt;Again, on a DGX2, from data loading to SVD we are running in time less than it would take to make a cup of tea.
However, the data loading can be accelerated.
From GCS we are reading into data into the main memory of the machine (host memory), uncompressing the zarr bits,
then moving the data from host memory to the GPU (device memory). Passing data back and forth between host and device memory can significantly decrease performance. Reading directly into the GPU, bypassing host memory, would improve the overall pipeline.&lt;/p&gt;
&lt;p&gt;And so we come back to a common lesson of high performance computing:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;High performance computing isn’t about doing one thing exceedingly well, it’s
about doing nothing poorly&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;In this case GPUs made our computation fast enough that we now need to focus on
other components of our system, notably disk bandwidth, and a direct reader for
Zarr data to GPU memory.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="cloud"&gt;
&lt;h2&gt;Cloud&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Diclaimer: one of the authors (Matthew Rocklin) works for Coiled Computing&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We can also run this on the cloud with any number of frameworks.
In this case we used the &lt;a class="reference external" href="https://coiled.io"&gt;Coiled Cloud&lt;/a&gt; service to deploy on AWS&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;coiled_cloud&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Cloud&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Cluster&lt;/span&gt;
&lt;span class="n"&gt;cloud&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Cloud&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;cloud&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_cluster_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;friends&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;genomics&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;worker_cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;worker_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;16 GiB&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;worker_environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;&amp;quot;OMP_NUM_THREADS&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;&amp;quot;OPENBLAS_NUM_THREADS&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# &amp;quot;EXTRA_PIP_PACKAGES&amp;quot;: &amp;quot;zarr&amp;quot;&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;friends&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;typename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;genomics&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# then proceed as normal&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Using 20 machines with a total of 80 CPU cores on a dataset that was 10x larger
than the MacBook pro example we were able to run in about the same amount of
time. This shows near optimal scaling for this problem, which is nice to see
given how complex the SVD calculation is.&lt;/p&gt;
&lt;p&gt;A screencast of this problem is viewable here&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/qaJcAvhgLy4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 360)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="compression"&gt;
&lt;h1&gt;Compression&lt;/h1&gt;
&lt;p&gt;One of the easiest ways for us to improve performance is to reduce the size of
this data through compression.
This data is highly compressible for two reasons:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;The real-world data itself has structure and repetition
(although the random play data does not)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We’re storing entries that take on only four values.
We’re using eight-bit integers when we only needed two-bit integers&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let’s solve the second problem first.&lt;/p&gt;
&lt;section id="compression-with-bit-twiddling"&gt;
&lt;h2&gt;Compression with bit twiddling&lt;/h2&gt;
&lt;p&gt;Ideally Numpy would have a two-bit integer datatype.
Unfortunately it doesn’t, and this is hard because memory in computers is
generally thought of in full bytes.&lt;/p&gt;
&lt;p&gt;To work around this we can use bit arithmetic to shove four values into a single value
Here are functions that do that, assuming that our array is square,
and the last dimension is divisible by four.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;decompress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;back&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;back&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mb"&gt;0b00000011&lt;/span&gt;
    &lt;span class="n"&gt;back&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mb"&gt;0b00001100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;back&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mb"&gt;0b00110000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
    &lt;span class="n"&gt;back&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mb"&gt;0b11000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;back&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Then, we can use these functions along with Dask to store our data in a
compressed state, but lazily decompress on-demand.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_blocks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_blocks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decompress&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;That’s it. We compress each block our data and store that in memory.
However the output variable that we have, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;x&lt;/span&gt;&lt;/code&gt; will decompress each chunk before
we operate on it, so we don’t need to worry about handling compressed blocks.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="compression-with-zarr"&gt;
&lt;h2&gt;Compression with Zarr&lt;/h2&gt;
&lt;p&gt;A slightly more general but probably less efficient route would be to compress
our arrays with a proper compression library like Zarr.&lt;/p&gt;
&lt;p&gt;The example below shows how we do this in practice.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;zarr&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numcodecs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Blosc&lt;/span&gt;
&lt;span class="n"&gt;compressor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Blosc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lz4&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clevel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Blosc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BITSHUFFLE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_blocks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;zarr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compressor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;compressor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_blocks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Additionally, if we’re using the dask-distributed scheduler then we want to
make sure that the Blosc compression library doesn’t use additional threads.
That way we don’t have parallel calls of a parallel library, which can cause
some contention&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;set_no_threads_blosc&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot; Stop blosc from using multiple threads &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numcodecs&lt;/span&gt;
    &lt;span class="n"&gt;numcodecs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blosc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use_threads&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;

&lt;span class="c1"&gt;# Run on all workers&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register_worker_plugin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set_no_threads_blosc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This approach is more general, and probably a good trick to have up ones’
sleeve, but it also doesn’t work on GPUs, which in the end is why we ended up
going with the bit-twiddling approach one section above, which uses API that
was uniformly accessible within the Numpy and CuPy libraries.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 452)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="final-thoughts"&gt;
&lt;h1&gt;Final Thoughts&lt;/h1&gt;
&lt;p&gt;In this post we did a few things, all around a single important problems in
genomics.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;We learned a bit of science&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We translated a science problem into a computational problem,
and in particular into a request to perform large singular value decompositions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We used a canned algorithm in dask.array that performed pretty well,
assuming that we’re comfortable going over the array in a few passes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We then tried that algorithm on three architectures quickly&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;A Macbook Pro&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A multi-GPU machine&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A cluster in the cloud&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Finally we talked about some tricks to pack more data into the same memory
with compression&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This problem was nice in that we got to dive deep into a technical science
problem, and yet also try a bunch of architecture quickly to investigate
hardware choices that we might make in the future.&lt;/p&gt;
&lt;p&gt;We used several technologies here today, made by several different communities
and companies. It was great to see how they all worked together seamlessly to
provide a flexible-yet-consistent experience.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2020/05/13/large-svds/"/>
    <summary>Document headings start at H2, not H1 [myst.header]</summary>
    <category term="CuPy" label="CuPy"/>
    <category term="GPU" label="GPU"/>
    <category term="array" label="array"/>
    <published>2020-05-13T00:00:00+00:00</published>
  </entry>
</feed>
