<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts tagged array</title>
  <updated>2026-03-05T15:05:25.157800+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/tag/array/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2020/05/13/large-svds/</id>
    <title>Large SVDs</title>
    <updated>2020-05-13T00:00:00+00:00</updated>
    <author>
      <name>Alistair Miles (Oxford University)</name>
    </author>
    <content type="html">&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 10)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="summary"&gt;

&lt;p&gt;We perform Singular Value Decomposition (SVD) calculations on large datasets.&lt;/p&gt;
&lt;p&gt;We modify the computation both by using fully precise and approximate methods,
and by using both CPUs and GPUs.&lt;/p&gt;
&lt;p&gt;In the end we compute an approximate SVD of 200GB of simulated data and using a mutli-GPU machine in 15-20 seconds.
Then we run this from a dataset stored in the cloud
where we find that I/O is, predictably, a major bottleneck.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 21)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="svd-the-simple-case"&gt;
&lt;h1&gt;SVD - The simple case&lt;/h1&gt;
&lt;p&gt;Dask arrays contain a relatively sophisticated SVD algorithm that works in the
tall-and-skinny or short-and-fat cases, but not so well in the roughly-square
case. It works by taking QR decompositions of each block of the array,
combining the R matrices, doing another smaller SVD on those, and then
performing some matrix multiplication to get back to the full result. It’s
numerically stable and decently fast, assuming that the intermediate R
matrices of the QR decompositions mostly fit in memory.&lt;/p&gt;
&lt;p&gt;The memory constraints here are that if you have an &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n&lt;/span&gt;&lt;/code&gt; by &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;m&lt;/span&gt;&lt;/code&gt; tall and
skinny array (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n&lt;/span&gt; &lt;span class="pre"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pre"&gt;m&lt;/span&gt;&lt;/code&gt;) cut into &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;k&lt;/span&gt;&lt;/code&gt; blocks then you need to have about &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;m**2&lt;/span&gt; &lt;span class="pre"&gt;*&lt;/span&gt; &lt;span class="pre"&gt;k&lt;/span&gt;&lt;/code&gt; space. This is true in many cases, including typical PCA machine learning
workloads, where you have tabular data with a few columns (hundreds at most)
and many rows.&lt;/p&gt;
&lt;p&gt;It’s easy to use and quite robust.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;10000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;table&gt;  &lt;thead&gt;    &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;th&gt; Array &lt;/th&gt;&lt;th&gt; Chunk &lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;th&gt; Bytes &lt;/th&gt;&lt;td&gt; 1.60 GB &lt;/td&gt; &lt;td&gt; 100.00 MB &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Shape &lt;/th&gt;&lt;td&gt; (10000000, 20) &lt;/td&gt; &lt;td&gt; (625000, 20) &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Count &lt;/th&gt;&lt;td&gt; 16 Tasks &lt;/td&gt;&lt;td&gt; 16 Chunks &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Type &lt;/th&gt;&lt;td&gt; float64 &lt;/td&gt;&lt;td&gt; numpy.ndarray &lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;svg width="75" height="170" style="stroke:rgb(0,0,0);stroke-width:1" &gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="0" y1="0" x2="25" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="0" y1="7" x2="25" y2="7" /&gt;
  &lt;line x1="0" y1="15" x2="25" y2="15" /&gt;
  &lt;line x1="0" y1="22" x2="25" y2="22" /&gt;
  &lt;line x1="0" y1="30" x2="25" y2="30" /&gt;
  &lt;line x1="0" y1="37" x2="25" y2="37" /&gt;
  &lt;line x1="0" y1="45" x2="25" y2="45" /&gt;
  &lt;line x1="0" y1="52" x2="25" y2="52" /&gt;
  &lt;line x1="0" y1="60" x2="25" y2="60" /&gt;
  &lt;line x1="0" y1="67" x2="25" y2="67" /&gt;
  &lt;line x1="0" y1="75" x2="25" y2="75" /&gt;
  &lt;line x1="0" y1="82" x2="25" y2="82" /&gt;
  &lt;line x1="0" y1="90" x2="25" y2="90" /&gt;
  &lt;line x1="0" y1="97" x2="25" y2="97" /&gt;
  &lt;line x1="0" y1="105" x2="25" y2="105" /&gt;
  &lt;line x1="0" y1="112" x2="25" y2="112" /&gt;
  &lt;line x1="0" y1="120" x2="25" y2="120" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="0" y1="0" x2="0" y2="120" style="stroke-width:2" /&gt;
  &lt;line x1="25" y1="0" x2="25" y2="120" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="0.000000,0.000000 25.412617,0.000000 25.412617,120.000000 0.000000,120.000000" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="12.706308" y="140.000000" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;20&lt;/text&gt;
&lt;text x="45.412617" y="60.000000" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(-90,45.412617,60.000000)"&gt;10000000&lt;/text&gt;
&lt;/svg&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;svd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This works fine in the short and fat case too (when you have far more columns
than rows) but we’re always going to assume that one of your dimensions is
unchunked, and that the other dimension has chunks that are quite a bit
longer, otherwise, things might not fit into memory.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 105)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="approximate-svd"&gt;
&lt;h1&gt;Approximate SVD&lt;/h1&gt;
&lt;p&gt;If your dataset is large in both dimensions then the algorithm above won’t work
as is. However, if you don’t need exact results, or if you only need a few of
the components, then there are a number of excellent approximation algorithms.&lt;/p&gt;
&lt;p&gt;Dask array has one of these approximation algorithms implemented in the
&lt;a class="reference external" href="https://docs.dask.org/en/latest/array-api.html#dask.array.linalg.svd_compressed"&gt;da.linalg.svd_compressed&lt;/a&gt;
function. And with it we can compute the approximate SVD of very large
matrices.&lt;/p&gt;
&lt;p&gt;We were recently working on a problem (explained below) and found that we were
still running out of memory when dealing with this algorithm. There were two
challenges that we ran into:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;The algorithm requires multiple passes over the data, but the Dask task
scheduler was keeping the input matrix in memory after it had been loaded once
in order to avoid recomputation.
Things still worked, but Dask had to move the data to disk and back
repeatedly, which reduced performance significantly.&lt;/p&gt;
&lt;p&gt;We resolved this by including explicit recomputation steps in the algorithm.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Related chunks of data would be loaded at different times, and so would
need to stick around longer than necessary to wait for their associated
chunks.&lt;/p&gt;
&lt;p&gt;We resolved this by engaging task fusion as an optimization pass.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Before diving further into the technical solution
we quickly provide the use case that was motivating this work.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 137)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="application-genomics"&gt;
&lt;h1&gt;Application - Genomics&lt;/h1&gt;
&lt;p&gt;Many studies are using genome sequencing to study genetic variation
between different individuals within a species. These includes
studies of human populations, but also other species such as mice,
mosquitoes or disease-causing parasites. These studies will, in
general, find a large number of sites in the genome sequence where
individuals differ from each other. For example, humans have more
than 100 million variable sites in the genome, and modern studies
like the &lt;a class="reference external" href="https://www.ukbiobank.ac.uk/"&gt;UK BioBank&lt;/a&gt; are working towards
sequencing the genomes of 1 million individuals or more.&lt;/p&gt;
&lt;p&gt;In diploid species like humans, mice or mosquitoes, each individual
carries two genome sequences, one inherited from each parent. At each
of the 100 million variable genome sites there will be two or more
“alleles” that a single genome might carry. One way to think about
this is via the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Punnett_square"&gt;Punnett
square&lt;/a&gt;, which
represents the different possible genotypes that one individual might
carry at one of these variable sites:&lt;/p&gt;
&lt;td&gt;
&lt;img src="https://upload.wikimedia.org/wikipedia/commons/9/93/Punnett_Square_Genetic_Carriers.PNG" alt="punnet square" height="40%" width="40%"&gt;
&lt;/td&gt;
&lt;p&gt;In the above there are three possible genotypes: AA, Aa, and aa. For
computational genomics, these genotypes can be encoded as 0, 1, or 2.
In a study of a species with M genetic variants assayed in N
individual samples, we can represent these genotypes as an (M x N)
array of integers. For a modern human genetics study, the scale of
this array might approach (100 million x 1 million). (Although in
practice, the size of the first dimension (number of variants) can be
reduced somewhat, by at least an order of magnitude, because many
genetic variants will carry little information and/or be correlated
with each other.)&lt;/p&gt;
&lt;p&gt;These genetic differences are not random, but carry information about
patterns of genetic similarity and shared ancestry between
individuals, because of the way they have been inherited through many
generations. A common task is to perform a dimensionality reduction
analysis on these data, such as a &lt;a class="reference external" href="https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020190"&gt;principal components
analysis&lt;/a&gt;
(SVD), to identify genetic structure reflecting these differencies in
degree of shared ancestry. This is an essential part of discovering
genetic variants associated with different diseases, and for learning
more about the genetic history of populations and species.&lt;/p&gt;
&lt;p&gt;Reducing the time taken to compute an analysis such as SVD, like all
science, allows for exploring larger datasets and testing more
hypotheses in less time. Practically, this means not simply a fast
SVD but an accelerated pipeline end-to-end, from data loading to
analysis, to understanding.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;We want to run an experiment in less time than it takes to make a cup of tea&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 192)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="performant-svds-w-dask"&gt;
&lt;h1&gt;Performant SVDs w/ Dask&lt;/h1&gt;
&lt;p&gt;Now that we have that scientific background, let’s transition back to talking about computation.&lt;/p&gt;
&lt;p&gt;To stop Dask from holding onto the data we intentionally trigger computation as
we build up the graph. This is a bit atypical in Dask calculations (we prefer
to have as much of the computation at once before computing) but useful given
the multiple-pass nature of this problem. This was a fairly easy change, and
is available in &lt;a class="reference external" href="https://github.com/dask/dask/pull/5041"&gt;dask/dask #5041&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Additionally, we found that it was helpful to turn on moderately wide task
fusion.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask&lt;/span&gt;
&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;optimization.fuse.ave-width&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 210)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="then-things-work-fine"&gt;
&lt;h1&gt;Then things work fine&lt;/h1&gt;
&lt;p&gt;We’re going to try this SVD on a few different choices of hardware including:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;A MacBook Pro&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A DGX-2, an NVIDIA worksation with 16 high-end GPUs and fast interconnect&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A twenty-node cluster on AWS&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;section id="macbook-pro"&gt;
&lt;h2&gt;Macbook Pro&lt;/h2&gt;
&lt;p&gt;We can happily perform an SVD on a 20GB array on a Macbook Pro&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20_000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5_000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;svd_compressed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This call is no longer entirely lazy, and it recomputes &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;x&lt;/span&gt;&lt;/code&gt; a couple times, but
it works, and it works using only a few GB of RAM on a consumer laptop.&lt;/p&gt;
&lt;p&gt;It takes around 2min 30s time to compute that on a laptop.
That’s great! It was super easy to try out, didn’t require any special
hardware or setup, and in many cases is fast enough.
By working locally we can iterate quickly.&lt;/p&gt;
&lt;p&gt;Now that things work, we can experiment with different hardware.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="adding-gpus-a-15-second-svd"&gt;
&lt;h2&gt;Adding GPUs (a 15 second SVD)&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Disclaimer: one of the authors (Ben Zaitlen) works for NVIDIA&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We can dramatically increase performance by using a multi-GPU machine.
NVIDIA and other manufacturers now make machines with multiple GPUs co-located in the same physical box.
In the following section, we will run the calculations on a &lt;strong&gt;DGX2&lt;/strong&gt;, a machine with 16 GPUs and fast interconnect between the GPUs.&lt;/p&gt;
&lt;p&gt;Below is almost the same code, running in significantly less same time but we make the
following changes:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;We increase the size of the array by a factor of &lt;strong&gt;10x&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We switch out NumPy for CuPy, a GPU NumPy implementation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We use a sixteen-GPU DGX-2 machine with NVLink interconnects between GPUs (NVLink will dramatically decrease transfer time between workers)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 to 15 seconds.&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4"&gt;full notebook is here&lt;/a&gt;,
but the relevant code snippets are below:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Some GPU specific setup&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_cuda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LocalCluster&lt;/span&gt;

&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LocalCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cupy&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;
&lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create the data and run the SVD as normal&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10_000_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20_000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
               &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5_000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;uint8&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;svd_compressed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To see this run, we recommend viewing this screencast:&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/6hmt1gARqp0" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/section&gt;
&lt;section id="read-dataset-from-disk"&gt;
&lt;h2&gt;Read dataset from Disk&lt;/h2&gt;
&lt;p&gt;While impressive, the computation above is mostly bound by generating random
data and then performing matrix calculations. GPUs are good at both of these
things.&lt;/p&gt;
&lt;p&gt;In practice though, our input array won’t be randomly generated, it will be
coming from some dataset stored on disk or increasingly more common, stored in the cloud.
To make things more realistic we perform a similar calculation with data
stored in a &lt;a class="reference external" href="https://zarr.readthedocs.io/en/stable/"&gt;Zarr format&lt;/a&gt;
in &lt;a class="reference external" href="https://cloud.google.com/storage"&gt;GCS&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In this &lt;a class="reference external" href="https://gist.github.com/quasiben/e52bc740ae22ae321f30987c65998078"&gt;Zarr SVD example&lt;/a&gt;,
we load a 25GB GCS backed data set onto a DGX2,
run a few processing steps, then perform an SVD.
The combination of preprocessing and SVD calculations ran in 18.7 sec and the data loading took 14.3 seconds.&lt;/p&gt;
&lt;p&gt;Again, on a DGX2, from data loading to SVD we are running in time less than it would take to make a cup of tea.
However, the data loading can be accelerated.
From GCS we are reading into data into the main memory of the machine (host memory), uncompressing the zarr bits,
then moving the data from host memory to the GPU (device memory). Passing data back and forth between host and device memory can significantly decrease performance. Reading directly into the GPU, bypassing host memory, would improve the overall pipeline.&lt;/p&gt;
&lt;p&gt;And so we come back to a common lesson of high performance computing:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;High performance computing isn’t about doing one thing exceedingly well, it’s
about doing nothing poorly&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;In this case GPUs made our computation fast enough that we now need to focus on
other components of our system, notably disk bandwidth, and a direct reader for
Zarr data to GPU memory.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="cloud"&gt;
&lt;h2&gt;Cloud&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Diclaimer: one of the authors (Matthew Rocklin) works for Coiled Computing&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We can also run this on the cloud with any number of frameworks.
In this case we used the &lt;a class="reference external" href="https://coiled.io"&gt;Coiled Cloud&lt;/a&gt; service to deploy on AWS&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;coiled_cloud&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Cloud&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Cluster&lt;/span&gt;
&lt;span class="n"&gt;cloud&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Cloud&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;cloud&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_cluster_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;friends&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;genomics&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;worker_cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;worker_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;16 GiB&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;worker_environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;&amp;quot;OMP_NUM_THREADS&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;&amp;quot;OPENBLAS_NUM_THREADS&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# &amp;quot;EXTRA_PIP_PACKAGES&amp;quot;: &amp;quot;zarr&amp;quot;&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;friends&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;typename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;genomics&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# then proceed as normal&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Using 20 machines with a total of 80 CPU cores on a dataset that was 10x larger
than the MacBook pro example we were able to run in about the same amount of
time. This shows near optimal scaling for this problem, which is nice to see
given how complex the SVD calculation is.&lt;/p&gt;
&lt;p&gt;A screencast of this problem is viewable here&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/qaJcAvhgLy4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 360)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="compression"&gt;
&lt;h1&gt;Compression&lt;/h1&gt;
&lt;p&gt;One of the easiest ways for us to improve performance is to reduce the size of
this data through compression.
This data is highly compressible for two reasons:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;The real-world data itself has structure and repetition
(although the random play data does not)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We’re storing entries that take on only four values.
We’re using eight-bit integers when we only needed two-bit integers&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let’s solve the second problem first.&lt;/p&gt;
&lt;section id="compression-with-bit-twiddling"&gt;
&lt;h2&gt;Compression with bit twiddling&lt;/h2&gt;
&lt;p&gt;Ideally Numpy would have a two-bit integer datatype.
Unfortunately it doesn’t, and this is hard because memory in computers is
generally thought of in full bytes.&lt;/p&gt;
&lt;p&gt;To work around this we can use bit arithmetic to shove four values into a single value
Here are functions that do that, assuming that our array is square,
and the last dimension is divisible by four.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;decompress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;back&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;back&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mb"&gt;0b00000011&lt;/span&gt;
    &lt;span class="n"&gt;back&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mb"&gt;0b00001100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;back&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mb"&gt;0b00110000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
    &lt;span class="n"&gt;back&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mb"&gt;0b11000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;back&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Then, we can use these functions along with Dask to store our data in a
compressed state, but lazily decompress on-demand.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_blocks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_blocks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decompress&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;That’s it. We compress each block our data and store that in memory.
However the output variable that we have, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;x&lt;/span&gt;&lt;/code&gt; will decompress each chunk before
we operate on it, so we don’t need to worry about handling compressed blocks.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="compression-with-zarr"&gt;
&lt;h2&gt;Compression with Zarr&lt;/h2&gt;
&lt;p&gt;A slightly more general but probably less efficient route would be to compress
our arrays with a proper compression library like Zarr.&lt;/p&gt;
&lt;p&gt;The example below shows how we do this in practice.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;zarr&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numcodecs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Blosc&lt;/span&gt;
&lt;span class="n"&gt;compressor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Blosc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lz4&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clevel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Blosc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BITSHUFFLE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_blocks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;zarr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compressor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;compressor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_blocks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Additionally, if we’re using the dask-distributed scheduler then we want to
make sure that the Blosc compression library doesn’t use additional threads.
That way we don’t have parallel calls of a parallel library, which can cause
some contention&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;set_no_threads_blosc&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot; Stop blosc from using multiple threads &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numcodecs&lt;/span&gt;
    &lt;span class="n"&gt;numcodecs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blosc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use_threads&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;

&lt;span class="c1"&gt;# Run on all workers&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register_worker_plugin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set_no_threads_blosc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This approach is more general, and probably a good trick to have up ones’
sleeve, but it also doesn’t work on GPUs, which in the end is why we ended up
going with the bit-twiddling approach one section above, which uses API that
was uniformly accessible within the Numpy and CuPy libraries.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/05/13/large-svds.md&lt;/span&gt;, line 452)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="final-thoughts"&gt;
&lt;h1&gt;Final Thoughts&lt;/h1&gt;
&lt;p&gt;In this post we did a few things, all around a single important problems in
genomics.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;We learned a bit of science&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We translated a science problem into a computational problem,
and in particular into a request to perform large singular value decompositions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We used a canned algorithm in dask.array that performed pretty well,
assuming that we’re comfortable going over the array in a few passes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We then tried that algorithm on three architectures quickly&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;A Macbook Pro&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A multi-GPU machine&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A cluster in the cloud&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Finally we talked about some tricks to pack more data into the same memory
with compression&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This problem was nice in that we got to dive deep into a technical science
problem, and yet also try a bunch of architecture quickly to investigate
hardware choices that we might make in the future.&lt;/p&gt;
&lt;p&gt;We used several technologies here today, made by several different communities
and companies. It was great to see how they all worked together seamlessly to
provide a flexible-yet-consistent experience.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2020/05/13/large-svds/"/>
    <summary>Document headings start at H2, not H1 [myst.header]</summary>
    <category term="CuPy" label="CuPy"/>
    <category term="GPU" label="GPU"/>
    <category term="array" label="array"/>
    <published>2020-05-13T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2019/01/03/dask-array-gpus-first-steps/</id>
    <title>GPU Dask Arrays, first steps</title>
    <updated>2019-01-03T00:00:00+00:00</updated>
    <author>
      <name>Matthew Rocklin</name>
    </author>
    <content type="html">&lt;p&gt;The following code creates and manipulates 2 TB of randomly generated data.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;

&lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;threads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;On a single CPU, this computation takes two hours.&lt;/p&gt;
&lt;p&gt;On an eight-GPU single-node system this computation takes nineteen seconds.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/03/dask-array-gpus-first-steps.md&lt;/span&gt;, line 24)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="combine-dask-array-with-cupy"&gt;

&lt;p&gt;Actually this computation isn’t that impressive.
It’s a simple workload,
for which most of the time is spent creating and destroying random data.
The computation and communication patterns are simple,
reflecting the simplicity commonly found in data processing workloads.&lt;/p&gt;
&lt;p&gt;What &lt;em&gt;is&lt;/em&gt; impressive is that we were able to create a distributed parallel GPU
array quickly by composing these four existing libraries:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://cupy.chainer.org/"&gt;CuPy&lt;/a&gt; provides a partial implementation of
Numpy on the GPU.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://docs.dask.org/en/latest/array.html"&gt;Dask Array&lt;/a&gt; provides chunked
algorithms on top of Numpy-like libraries like Numpy and CuPy.&lt;/p&gt;
&lt;p&gt;This enables us to operate on more data than we could fit in memory
by operating on that data in chunks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="https://distributed.dask.org"&gt;Dask distributed&lt;/a&gt; task scheduler runs
those algorithms in parallel, easily coordinating work across many CPU
cores.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="https://github.com/rapidsai/dask-cuda"&gt;Dask CUDA&lt;/a&gt; to extend Dask
distributed with GPU support.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These tools already exist. We had to connect them together with a small amount
of glue code and minor modifications. By mashing these tools together we can
quickly build and switch between different architectures to explore what is
best for our application.&lt;/p&gt;
&lt;p&gt;For this example we relied on the following changes upstream:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/cupy/cupy/pull/1689"&gt;cupy/cupy #1689: Support Numpy arrays as seeds in RandomState&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/pull/4041"&gt;dask/dask #4041 Make da.RandomState accessible to other modules&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/pull/2432"&gt;dask/distributed #2432: Add LocalCUDACluster&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/03/dask-array-gpus-first-steps.md&lt;/span&gt;, line 62)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="comparison-among-single-multi-cpu-gpu"&gt;
&lt;h1&gt;Comparison among single/multi CPU/GPU&lt;/h1&gt;
&lt;p&gt;We can now easily run some experiments on different architectures. This is
easy because …&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;We can switch between CPU and GPU by switching between Numpy and CuPy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We can switch between single/multi-CPU-core and single/multi-GPU
by switching between Dask’s different task schedulers.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These libraries allow us to quickly judge the costs of this computation for
the following hardware choices:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Single-threaded CPU&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-threaded CPU with 40 cores (80 H/T)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Single-GPU&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-GPU on a single machine with 8 GPUs&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We present code for these four choices below,
but first,
we present a table of results.&lt;/p&gt;
&lt;section id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
  &lt;tr&gt;
    &lt;th&gt;Architecture&lt;/th&gt;
    &lt;th&gt;Time&lt;/th&gt;
  &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt; Single CPU Core &lt;/th&gt;
      &lt;td&gt; 2hr 39min &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Forty CPU Cores &lt;/th&gt;
      &lt;td&gt; 11min 30s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; One GPU &lt;/th&gt;
      &lt;td&gt; 1 min 37s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Eight GPUs &lt;/th&gt;
      &lt;td&gt; 19s &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/section&gt;
&lt;section id="setup"&gt;
&lt;h2&gt;Setup&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cupy&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;

&lt;span class="c1"&gt;# generate chunked dask arrays of mamy numpy random arrays&lt;/span&gt;
&lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nbytes&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 2 TB&lt;/span&gt;
&lt;span class="c1"&gt;# 2000.0&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="cpu-timing"&gt;
&lt;h2&gt;CPU timing&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;single-threaded&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;threads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="single-gpu-timing"&gt;
&lt;h2&gt;Single GPU timing&lt;/h2&gt;
&lt;p&gt;We switch from CPU to GPU by changing our data source to generate CuPy arrays
rather than NumPy arrays. Everything else should more or less work the same
without special handling for CuPy.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;(This actually isn’t true yet, many things in dask.array will break for
non-NumPy arrays, but we’re working on it actively both within Dask, within
NumPy, and within the GPU array libraries. Regardless, everything in this
example works fine.)&lt;/em&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# generate chunked dask arrays of mamy cupy random arrays&lt;/span&gt;
&lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- we specify cupy here&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;single-threaded&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="multi-gpu-timing"&gt;
&lt;h2&gt;Multi GPU timing&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_cuda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LocalCUDACluster&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LocalCUDACluster&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And again, here are the results:&lt;/p&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
  &lt;tr&gt;
    &lt;th&gt;Architecture&lt;/th&gt;
    &lt;th&gt;Time&lt;/th&gt;
  &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt; Single CPU Core &lt;/th&gt;
      &lt;td&gt; 2hr 39min &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Forty CPU Cores &lt;/th&gt;
      &lt;td&gt; 11min 30s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; One GPU &lt;/th&gt;
      &lt;td&gt; 1 min 37s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Eight GPUs &lt;/th&gt;
      &lt;td&gt; 19s &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;First, this is my first time playing with an 40-core system. I was surprised
to see that many cores. I was also pleased to see that Dask’s normal threaded
scheduler happily saturates many cores.&lt;/p&gt;
&lt;img src="https://matthewrocklin.com/blog/images/python-gil-8000-percent.png" width="100%"&gt;
&lt;p&gt;Although later on it did dive down to around 5000-6000%, and if you do the math
you’ll see that we’re not getting a 40x speedup. My &lt;em&gt;guess&lt;/em&gt; is that
performance would improve if we were to play with some mixture of threads and
processes, like having ten processes with eight threads each.&lt;/p&gt;
&lt;p&gt;The jump from the biggest multi-core CPU to a single GPU is still an order of
magnitude though. The jump to multi-GPU is another order of magnitude, and
brings the computation down to 19s, which is short enough that I’m willing to
wait for it to finish before walking away from my computer.&lt;/p&gt;
&lt;p&gt;Actually, it’s quite fun to watch on the dashboard (especially after you’ve
been waiting for three hours for the sequential solution to run):&lt;/p&gt;
&lt;blockquote class="imgur-embed-pub"
            lang="en"
            data-id="a/6hkPPwA"&gt;
&lt;a href="//imgur.com/6hkPPwA"&gt;&lt;/a&gt;
&lt;/blockquote&gt;
&lt;script async src="//s.imgur.com/min/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/03/dask-array-gpus-first-steps.md&lt;/span&gt;, line 221)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;This computation was simple, but the range in architecture just explored was
extensive. We swapped out the underlying architecture from CPU to GPU (which
had an entirely different codebase) and tried both multi-core CPU parallelism
as well as multi-GPU many-core parallelism.&lt;/p&gt;
&lt;p&gt;We did this in less than twenty lines of code, making this experiment something
that an undergraduate student or other novice could perform at home.
We’re approaching a point where experimenting with multi-GPU systems is
approachable to non-experts (at least for array computing).&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://gist.github.com/mrocklin/57be0ca4143974e6015732d0baacc1cb"&gt;Here is a notebook for the experiment above&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/03/dask-array-gpus-first-steps.md&lt;/span&gt;, line 235)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="room-for-improvement"&gt;
&lt;h1&gt;Room for improvement&lt;/h1&gt;
&lt;p&gt;We can work to expand the computation above in a variety of directions.
There is a ton of work we still have to do to make this reliable.&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use more complex array computing workloads&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Dask Array algorithms were designed first around Numpy. We’ve only
recently started making them more generic to other kinds of arrays (like
GPU arrays, sparse arrays, and so on). As a result there are still many
bugs when exploring these non-Numpy workloads.&lt;/p&gt;
&lt;p&gt;For example if you were to switch &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sum&lt;/span&gt;&lt;/code&gt; for &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mean&lt;/span&gt;&lt;/code&gt; in the computation above
you would get an error because our &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mean&lt;/span&gt;&lt;/code&gt; computation contains an easy to
fix error that assumes Numpy arrays exactly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use Pandas and cuDF instead of Numpy and CuPy&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The cuDF library aims to reimplement the Pandas API on the GPU,
much like how CuPy reimplements the NumPy API.
Using Dask DataFrame with cuDF will require some work on both sides,
but is quite doable.&lt;/p&gt;
&lt;p&gt;I believe that there is plenty of low-hanging fruit here.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Improve and move LocalCUDACluster&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;LocalCUDAClutster&lt;/span&gt;&lt;/code&gt; class used above is an experimental &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Cluster&lt;/span&gt;&lt;/code&gt; type
that creates as many workers locally as you have GPUs, and assigns each
worker to prefer a different GPU. This makes it easy for people to load
balance across GPUs on a single-node system without thinking too much about
it. This appears to be a common pain-point in the ecosystem today.&lt;/p&gt;
&lt;p&gt;However, the LocalCUDACluster probably shouldn’t live in the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask/distributed&lt;/span&gt;&lt;/code&gt; repository (it seems too CUDA specific) so will probably
move to some dask-cuda repository. Additionally there are still many
questions about how to handle concurrency on top of GPUs, balancing between
CPU cores and GPU cores, and so on.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-node computation&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There’s no reason that we couldn’t accelerate computations like these
further by using multiple multi-GPU nodes. This is doable today with
manual setup, but we should also improve the existing deployment solutions
&lt;a class="reference external" href="https://kubernetes.dask.org"&gt;dask-kubernetes&lt;/a&gt;,
&lt;a class="reference external" href="https://yarn.dask.org"&gt;dask-yarn&lt;/a&gt;, and
&lt;a class="reference external" href="https://jobqueue.dask.org"&gt;dask-jobqueue&lt;/a&gt;, to make this easier for
non-experts who want to use a cluster of multi-GPU resources.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Expense&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The machine I ran this on is expensive. Well, it’s nowhere close to as
expensive to own and operate as a traditional cluster that you would need
for these kinds of results, but it’s still well beyond the price point of a
hobbyist or student.&lt;/p&gt;
&lt;p&gt;It would be useful to run this on a more budget system to get a sense of
the tradeoffs on more reasonably priced systems. I should probably also
learn more about provisioning GPUs on the cloud.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;section id="come-help"&gt;
&lt;h2&gt;Come help!&lt;/h2&gt;
&lt;p&gt;If the work above sounds interesting to you then come help!
There is a lot of low-hanging and high impact work to do.&lt;/p&gt;
&lt;p&gt;If you’re interested in being paid to focus more on these topics, then consider
applying for a job. The NVIDIA corporation is hiring around the use of Dask
with GPUs.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-TX-Austin/Senior-Library-Software-Engineer---RAPIDS_JR1919608-1"&gt;Senior Library Software Engineer - RAPIDS&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s a fairly generic posting. If you’re interested the posting doesn’t seem
to fit then please apply anyway and we’ll tweak things.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2019/01/03/dask-array-gpus-first-steps/"/>
    <summary>The following code creates and manipulates 2 TB of randomly generated data.</summary>
    <category term="GPU" label="GPU"/>
    <category term="array" label="array"/>
    <category term="cupy" label="cupy"/>
    <published>2019-01-03T00:00:00+00:00</published>
  </entry>
</feed>
