<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts tagged cupy</title>
  <updated>2026-03-05T15:05:25.209332+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/tag/cupy/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2019/01/03/dask-array-gpus-first-steps/</id>
    <title>GPU Dask Arrays, first steps</title>
    <updated>2019-01-03T00:00:00+00:00</updated>
    <author>
      <name>Matthew Rocklin</name>
    </author>
    <content type="html">&lt;p&gt;The following code creates and manipulates 2 TB of randomly generated data.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;

&lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;threads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;On a single CPU, this computation takes two hours.&lt;/p&gt;
&lt;p&gt;On an eight-GPU single-node system this computation takes nineteen seconds.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/03/dask-array-gpus-first-steps.md&lt;/span&gt;, line 24)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="combine-dask-array-with-cupy"&gt;

&lt;p&gt;Actually this computation isn’t that impressive.
It’s a simple workload,
for which most of the time is spent creating and destroying random data.
The computation and communication patterns are simple,
reflecting the simplicity commonly found in data processing workloads.&lt;/p&gt;
&lt;p&gt;What &lt;em&gt;is&lt;/em&gt; impressive is that we were able to create a distributed parallel GPU
array quickly by composing these four existing libraries:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://cupy.chainer.org/"&gt;CuPy&lt;/a&gt; provides a partial implementation of
Numpy on the GPU.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://docs.dask.org/en/latest/array.html"&gt;Dask Array&lt;/a&gt; provides chunked
algorithms on top of Numpy-like libraries like Numpy and CuPy.&lt;/p&gt;
&lt;p&gt;This enables us to operate on more data than we could fit in memory
by operating on that data in chunks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="https://distributed.dask.org"&gt;Dask distributed&lt;/a&gt; task scheduler runs
those algorithms in parallel, easily coordinating work across many CPU
cores.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="https://github.com/rapidsai/dask-cuda"&gt;Dask CUDA&lt;/a&gt; to extend Dask
distributed with GPU support.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These tools already exist. We had to connect them together with a small amount
of glue code and minor modifications. By mashing these tools together we can
quickly build and switch between different architectures to explore what is
best for our application.&lt;/p&gt;
&lt;p&gt;For this example we relied on the following changes upstream:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/cupy/cupy/pull/1689"&gt;cupy/cupy #1689: Support Numpy arrays as seeds in RandomState&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/pull/4041"&gt;dask/dask #4041 Make da.RandomState accessible to other modules&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/pull/2432"&gt;dask/distributed #2432: Add LocalCUDACluster&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/03/dask-array-gpus-first-steps.md&lt;/span&gt;, line 62)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="comparison-among-single-multi-cpu-gpu"&gt;
&lt;h1&gt;Comparison among single/multi CPU/GPU&lt;/h1&gt;
&lt;p&gt;We can now easily run some experiments on different architectures. This is
easy because …&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;We can switch between CPU and GPU by switching between Numpy and CuPy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We can switch between single/multi-CPU-core and single/multi-GPU
by switching between Dask’s different task schedulers.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These libraries allow us to quickly judge the costs of this computation for
the following hardware choices:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Single-threaded CPU&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-threaded CPU with 40 cores (80 H/T)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Single-GPU&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-GPU on a single machine with 8 GPUs&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We present code for these four choices below,
but first,
we present a table of results.&lt;/p&gt;
&lt;section id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
  &lt;tr&gt;
    &lt;th&gt;Architecture&lt;/th&gt;
    &lt;th&gt;Time&lt;/th&gt;
  &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt; Single CPU Core &lt;/th&gt;
      &lt;td&gt; 2hr 39min &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Forty CPU Cores &lt;/th&gt;
      &lt;td&gt; 11min 30s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; One GPU &lt;/th&gt;
      &lt;td&gt; 1 min 37s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Eight GPUs &lt;/th&gt;
      &lt;td&gt; 19s &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/section&gt;
&lt;section id="setup"&gt;
&lt;h2&gt;Setup&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cupy&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;

&lt;span class="c1"&gt;# generate chunked dask arrays of mamy numpy random arrays&lt;/span&gt;
&lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nbytes&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 2 TB&lt;/span&gt;
&lt;span class="c1"&gt;# 2000.0&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="cpu-timing"&gt;
&lt;h2&gt;CPU timing&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;single-threaded&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;threads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="single-gpu-timing"&gt;
&lt;h2&gt;Single GPU timing&lt;/h2&gt;
&lt;p&gt;We switch from CPU to GPU by changing our data source to generate CuPy arrays
rather than NumPy arrays. Everything else should more or less work the same
without special handling for CuPy.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;(This actually isn’t true yet, many things in dask.array will break for
non-NumPy arrays, but we’re working on it actively both within Dask, within
NumPy, and within the GPU array libraries. Regardless, everything in this
example works fine.)&lt;/em&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# generate chunked dask arrays of mamy cupy random arrays&lt;/span&gt;
&lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- we specify cupy here&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;single-threaded&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="multi-gpu-timing"&gt;
&lt;h2&gt;Multi GPU timing&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_cuda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LocalCUDACluster&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LocalCUDACluster&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And again, here are the results:&lt;/p&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
  &lt;tr&gt;
    &lt;th&gt;Architecture&lt;/th&gt;
    &lt;th&gt;Time&lt;/th&gt;
  &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt; Single CPU Core &lt;/th&gt;
      &lt;td&gt; 2hr 39min &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Forty CPU Cores &lt;/th&gt;
      &lt;td&gt; 11min 30s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; One GPU &lt;/th&gt;
      &lt;td&gt; 1 min 37s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Eight GPUs &lt;/th&gt;
      &lt;td&gt; 19s &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;First, this is my first time playing with an 40-core system. I was surprised
to see that many cores. I was also pleased to see that Dask’s normal threaded
scheduler happily saturates many cores.&lt;/p&gt;
&lt;img src="https://matthewrocklin.com/blog/images/python-gil-8000-percent.png" width="100%"&gt;
&lt;p&gt;Although later on it did dive down to around 5000-6000%, and if you do the math
you’ll see that we’re not getting a 40x speedup. My &lt;em&gt;guess&lt;/em&gt; is that
performance would improve if we were to play with some mixture of threads and
processes, like having ten processes with eight threads each.&lt;/p&gt;
&lt;p&gt;The jump from the biggest multi-core CPU to a single GPU is still an order of
magnitude though. The jump to multi-GPU is another order of magnitude, and
brings the computation down to 19s, which is short enough that I’m willing to
wait for it to finish before walking away from my computer.&lt;/p&gt;
&lt;p&gt;Actually, it’s quite fun to watch on the dashboard (especially after you’ve
been waiting for three hours for the sequential solution to run):&lt;/p&gt;
&lt;blockquote class="imgur-embed-pub"
            lang="en"
            data-id="a/6hkPPwA"&gt;
&lt;a href="//imgur.com/6hkPPwA"&gt;&lt;/a&gt;
&lt;/blockquote&gt;
&lt;script async src="//s.imgur.com/min/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/03/dask-array-gpus-first-steps.md&lt;/span&gt;, line 221)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;This computation was simple, but the range in architecture just explored was
extensive. We swapped out the underlying architecture from CPU to GPU (which
had an entirely different codebase) and tried both multi-core CPU parallelism
as well as multi-GPU many-core parallelism.&lt;/p&gt;
&lt;p&gt;We did this in less than twenty lines of code, making this experiment something
that an undergraduate student or other novice could perform at home.
We’re approaching a point where experimenting with multi-GPU systems is
approachable to non-experts (at least for array computing).&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://gist.github.com/mrocklin/57be0ca4143974e6015732d0baacc1cb"&gt;Here is a notebook for the experiment above&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/03/dask-array-gpus-first-steps.md&lt;/span&gt;, line 235)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="room-for-improvement"&gt;
&lt;h1&gt;Room for improvement&lt;/h1&gt;
&lt;p&gt;We can work to expand the computation above in a variety of directions.
There is a ton of work we still have to do to make this reliable.&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use more complex array computing workloads&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Dask Array algorithms were designed first around Numpy. We’ve only
recently started making them more generic to other kinds of arrays (like
GPU arrays, sparse arrays, and so on). As a result there are still many
bugs when exploring these non-Numpy workloads.&lt;/p&gt;
&lt;p&gt;For example if you were to switch &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sum&lt;/span&gt;&lt;/code&gt; for &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mean&lt;/span&gt;&lt;/code&gt; in the computation above
you would get an error because our &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mean&lt;/span&gt;&lt;/code&gt; computation contains an easy to
fix error that assumes Numpy arrays exactly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use Pandas and cuDF instead of Numpy and CuPy&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The cuDF library aims to reimplement the Pandas API on the GPU,
much like how CuPy reimplements the NumPy API.
Using Dask DataFrame with cuDF will require some work on both sides,
but is quite doable.&lt;/p&gt;
&lt;p&gt;I believe that there is plenty of low-hanging fruit here.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Improve and move LocalCUDACluster&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;LocalCUDAClutster&lt;/span&gt;&lt;/code&gt; class used above is an experimental &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Cluster&lt;/span&gt;&lt;/code&gt; type
that creates as many workers locally as you have GPUs, and assigns each
worker to prefer a different GPU. This makes it easy for people to load
balance across GPUs on a single-node system without thinking too much about
it. This appears to be a common pain-point in the ecosystem today.&lt;/p&gt;
&lt;p&gt;However, the LocalCUDACluster probably shouldn’t live in the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask/distributed&lt;/span&gt;&lt;/code&gt; repository (it seems too CUDA specific) so will probably
move to some dask-cuda repository. Additionally there are still many
questions about how to handle concurrency on top of GPUs, balancing between
CPU cores and GPU cores, and so on.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-node computation&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There’s no reason that we couldn’t accelerate computations like these
further by using multiple multi-GPU nodes. This is doable today with
manual setup, but we should also improve the existing deployment solutions
&lt;a class="reference external" href="https://kubernetes.dask.org"&gt;dask-kubernetes&lt;/a&gt;,
&lt;a class="reference external" href="https://yarn.dask.org"&gt;dask-yarn&lt;/a&gt;, and
&lt;a class="reference external" href="https://jobqueue.dask.org"&gt;dask-jobqueue&lt;/a&gt;, to make this easier for
non-experts who want to use a cluster of multi-GPU resources.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Expense&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The machine I ran this on is expensive. Well, it’s nowhere close to as
expensive to own and operate as a traditional cluster that you would need
for these kinds of results, but it’s still well beyond the price point of a
hobbyist or student.&lt;/p&gt;
&lt;p&gt;It would be useful to run this on a more budget system to get a sense of
the tradeoffs on more reasonably priced systems. I should probably also
learn more about provisioning GPUs on the cloud.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;section id="come-help"&gt;
&lt;h2&gt;Come help!&lt;/h2&gt;
&lt;p&gt;If the work above sounds interesting to you then come help!
There is a lot of low-hanging and high impact work to do.&lt;/p&gt;
&lt;p&gt;If you’re interested in being paid to focus more on these topics, then consider
applying for a job. The NVIDIA corporation is hiring around the use of Dask
with GPUs.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-TX-Austin/Senior-Library-Software-Engineer---RAPIDS_JR1919608-1"&gt;Senior Library Software Engineer - RAPIDS&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s a fairly generic posting. If you’re interested the posting doesn’t seem
to fit then please apply anyway and we’ll tweak things.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2019/01/03/dask-array-gpus-first-steps/"/>
    <summary>The following code creates and manipulates 2 TB of randomly generated data.</summary>
    <category term="GPU" label="GPU"/>
    <category term="array" label="array"/>
    <category term="cupy" label="cupy"/>
    <published>2019-01-03T00:00:00+00:00</published>
  </entry>
</feed>
