<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts tagged python</title>
  <updated>2026-03-05T15:05:26.736165+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/tag/python/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2019/06/20/load-image-data/</id>
    <title>Load Large Image Data with Dask Array</title>
    <updated>2019-06-20T00:00:00+00:00</updated>
    <author>
      <name>John Kirkham</name>
    </author>
    <content type="html">&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/20/load-image-data.md&lt;/span&gt;, line 9)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="executive-summary"&gt;

&lt;p&gt;This post explores simple workflows to load large stacks of image data with Dask array.&lt;/p&gt;
&lt;p&gt;In particular, we start with a &lt;a class="reference external" href="https://drive.google.com/drive/folders/13mpIfqspKTIINkfoWbFsVtFF8D7jbTqJ"&gt;directory full of TIFF
files&lt;/a&gt;
of images like the following:&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ $ ls raw/ | head
ex6-2_CamA_ch1_CAM1_stack0000_560nm_0000000msec_0001291795msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0001_560nm_0043748msec_0001335543msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0002_560nm_0087497msec_0001379292msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0003_560nm_0131245msec_0001423040msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0004_560nm_0174993msec_0001466788msecAbs_000x_000y_000z_0000t.tif
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;and show how to stitch these together into large lazy arrays
using the &lt;a class="reference external" href="https://image.dask.org/en/latest/"&gt;dask-image&lt;/a&gt; library&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_image&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask_image&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;raw/*.tif&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;or by writing your own Dask delayed image reader function.&lt;/p&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;table&gt;  &lt;thead&gt;    &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;th&gt; Array &lt;/th&gt;&lt;th&gt; Chunk &lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;th&gt; Bytes &lt;/th&gt;&lt;td&gt; 3.16 GB &lt;/td&gt; &lt;td&gt; 316.15 MB &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Shape &lt;/th&gt;&lt;td&gt; (2010, 1024, 768) &lt;/td&gt; &lt;td&gt; (201, 1024, 768) &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Count &lt;/th&gt;&lt;td&gt; 30 Tasks &lt;/td&gt;&lt;td&gt; 10 Chunks &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Type &lt;/th&gt;&lt;td&gt; uint16 &lt;/td&gt;&lt;td&gt; numpy.ndarray &lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;svg width="176" height="181" style="stroke:rgb(0,0,0);stroke-width:1" &gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="10" y1="0" x2="80" y2="70" style="stroke-width:2" /&gt;
  &lt;line x1="10" y1="61" x2="80" y2="131" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="10" y1="0" x2="10" y2="61" style="stroke-width:2" /&gt;
  &lt;line x1="17" y1="7" x2="17" y2="68" /&gt;
  &lt;line x1="24" y1="14" x2="24" y2="75" /&gt;
  &lt;line x1="31" y1="21" x2="31" y2="82" /&gt;
  &lt;line x1="38" y1="28" x2="38" y2="89" /&gt;
  &lt;line x1="45" y1="35" x2="45" y2="96" /&gt;
  &lt;line x1="52" y1="42" x2="52" y2="103" /&gt;
  &lt;line x1="59" y1="49" x2="59" y2="110" /&gt;
  &lt;line x1="66" y1="56" x2="66" y2="117" /&gt;
  &lt;line x1="73" y1="63" x2="73" y2="124" /&gt;
  &lt;line x1="80" y1="70" x2="80" y2="131" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="10.000000,0.000000 80.588235,70.588235 80.588235,131.722564 10.000000,61.134328" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="10" y1="0" x2="55" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="17" y1="7" x2="62" y2="7" /&gt;
  &lt;line x1="24" y1="14" x2="69" y2="14" /&gt;
  &lt;line x1="31" y1="21" x2="77" y2="21" /&gt;
  &lt;line x1="38" y1="28" x2="84" y2="28" /&gt;
  &lt;line x1="45" y1="35" x2="91" y2="35" /&gt;
  &lt;line x1="52" y1="42" x2="98" y2="42" /&gt;
  &lt;line x1="59" y1="49" x2="105" y2="49" /&gt;
  &lt;line x1="66" y1="56" x2="112" y2="56" /&gt;
  &lt;line x1="73" y1="63" x2="119" y2="63" /&gt;
  &lt;line x1="80" y1="70" x2="126" y2="70" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="10" y1="0" x2="80" y2="70" style="stroke-width:2" /&gt;
  &lt;line x1="55" y1="0" x2="126" y2="70" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="10.000000,0.000000 55.850746,0.000000 126.438982,70.588235 80.588235,70.588235" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="80" y1="70" x2="126" y2="70" style="stroke-width:2" /&gt;
  &lt;line x1="80" y1="131" x2="126" y2="131" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="80" y1="70" x2="80" y2="131" style="stroke-width:2" /&gt;
  &lt;line x1="126" y1="70" x2="126" y2="131" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="80.588235,70.588235 126.438982,70.588235 126.438982,131.722564 80.588235,131.722564" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="103.513608" y="151.722564" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;768&lt;/text&gt;
&lt;text x="146.438982" y="101.155399" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(-90,146.438982,101.155399)"&gt;1024&lt;/text&gt;
&lt;text x="35.294118" y="116.428446" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(45,35.294118,116.428446)"&gt;2010&lt;/text&gt;
&lt;/svg&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;Some day we’ll eventually be able to perform complex calculations on this dask array.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://raw.githubusercontent.com/mrocklin/raw-host/gh-pages/images/aollsm-index-1.jpg"
     width="45%"
     alt="Light Microscopy data rendered with NVidia IndeX"&gt;
&lt;img src="https://raw.githubusercontent.com/mrocklin/raw-host/gh-pages/images/aollsm-index-2.jpg"
     width="45%"
     alt="Light Microscopy data rendered with NVidia IndeX"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Disclaimer: we’re not going to produce rendered images like the above in this
post. These were created with &lt;a class="reference external" href="https://developer.nvidia.com/index"&gt;NVidia
IndeX&lt;/a&gt;, a completely separate tool chain
from what is being discussed here. This post covers the first step of image
loading.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/20/load-image-data.md&lt;/span&gt;, line 128)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="series-overview"&gt;
&lt;h1&gt;Series Overview&lt;/h1&gt;
&lt;p&gt;A common case in fields that acquire large amounts of imaging data is to write
out smaller acquisitions into many small files. These files can tile a larger
space, sub-sample from a larger time period, and may contain multiple channels.
The acquisition techniques themselves are often state of the art and constantly
pushing the envelope in term of how large a field of view can be acquired, at
what resolution, and what quality.&lt;/p&gt;
&lt;p&gt;Once acquired this data presents a number of challenges. Algorithms often
designed and tested to work on very small pieces of this data need to be scaled
up to work on the full dataset. It might not be clear at the outset what will
actually work and so exploration still plays a very big part of the whole
process.&lt;/p&gt;
&lt;p&gt;Historically this analytical process has involved a lot of custom code. Often
the analytical process is stitched together by a series of scripts possibly in
several different languages that write various intermediate results to disk.
Thanks to advances in modern tooling these process can be significantly
improved. In this series of blogposts, we will outline ways for image
scientists to leverage different tools to move towards a high level, friendly,
cohesive, interactive analytical pipeline.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/20/load-image-data.md&lt;/span&gt;, line 151)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="post-overview"&gt;
&lt;h1&gt;Post Overview&lt;/h1&gt;
&lt;p&gt;This post in particular focuses on loading and managing large stacks of image
data in parallel from Python.&lt;/p&gt;
&lt;p&gt;Loading large image data can be a complex and often unique problem. Different
groups may choose to store this across many files on disk, a commodity or
custom database solution, or they may opt to store it in the cloud. Not all
datasets within the same group may be treated the same for a variety of
reasons. In short, this means loading data is a hard and expensive problem.&lt;/p&gt;
&lt;p&gt;Despite data being stored in many different ways, often groups want to reapply
the same analytical pipeline to these datasets. However if the data pipeline is
tightly coupled to a particular way of loading the data for later analytical
steps, it may be very difficult if not impossible to reuse an existing
pipeline. In other words, there is friction between the loading and analysis
steps, which frustrates efforts to make things reusable.&lt;/p&gt;
&lt;p&gt;Having a modular and general way to load data makes it easy to present data
stored differently in a standard way. Further having a standard way to present
data to analytical pipelines allows that part of the pipeline to focus on what
it does best, analysis! In general, this should decouple these to components in
a way that improves the experience of users involved in all parts of the
pipeline.&lt;/p&gt;
&lt;p&gt;We will use
&lt;a class="reference external" href="https://drive.google.com/drive/folders/13mpIfqspKTIINkfoWbFsVtFF8D7jbTqJ"&gt;image data&lt;/a&gt;
generously provided by
&lt;a class="reference external" href="https://scholar.google.com/citations?user=nxwNAEgAAAAJ&amp;amp;amp;hl=en"&gt;Gokul Upadhyayula&lt;/a&gt;
at the
&lt;a class="reference external" href="http://microscopy.berkeley.edu/"&gt;Advanced Bioimaging Center&lt;/a&gt;
at UC Berkeley and discussed in
&lt;a class="reference external" href="https://science.sciencemag.org/content/360/6386/eaaq1392"&gt;this paper&lt;/a&gt;
(&lt;a class="reference external" href="https://www.biorxiv.org/content/10.1101/243352v2"&gt;preprint&lt;/a&gt;),
though the workloads presented here should work for any kind of imaging data,
or array data generally.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/20/load-image-data.md&lt;/span&gt;, line 188)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="load-image-data-with-dask"&gt;
&lt;h1&gt;Load image data with Dask&lt;/h1&gt;
&lt;p&gt;Let’s start again with our image data from the top of the post:&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ $ ls /path/to/files/raw/ | head
ex6-2_CamA_ch1_CAM1_stack0000_560nm_0000000msec_0001291795msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0001_560nm_0043748msec_0001335543msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0002_560nm_0087497msec_0001379292msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0003_560nm_0131245msec_0001423040msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0004_560nm_0174993msec_0001466788msecAbs_000x_000y_000z_0000t.tif
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;section id="load-a-single-sample-image-with-scikit-image"&gt;
&lt;h2&gt;Load a single sample image with Scikit-Image&lt;/h2&gt;
&lt;p&gt;To load a single image, we use &lt;a class="reference external" href="https://scikit-image.org/"&gt;Scikit-Image&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;glob&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;glob&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/path/to/files/raw/*.tif&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;597&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;imageio&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;imageio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;span class="go"&gt;(201, 1024, 768)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Each filename corresponds to some 3d chunk of a larger image. We can look at a
few 2d slices of this single 3d chunk to get some context.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;skimage.io&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;skimage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;img src="https://raw.githubusercontent.com/mrocklin/raw-host/gh-pages/images/aollsm-sample-1.png"
     width="60%"&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;skimage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;img src="https://raw.githubusercontent.com/mrocklin/raw-host/gh-pages/images/aollsm-sample-2.png"
     width="60%"&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;skimage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;img src="https://raw.githubusercontent.com/mrocklin/raw-host/gh-pages/images/aollsm-sample-3.png"
     width="60%"&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id="investigate-filename-structure"&gt;
&lt;h2&gt;Investigate Filename Structure&lt;/h2&gt;
&lt;p&gt;These are slices from only one chunk of a much larger aggregate image.
Our interest here is combining the pieces into a large image stack.
It is common to see a naming structure in the filenames. Each
filename then may indicate a channel, time step, and spatial location with the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;&amp;lt;i&amp;gt;&lt;/span&gt;&lt;/code&gt; being some numeric values (possibly with units). Individual filenames may
have more or less information and may notate it differently than we have.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;mydata_ch&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;t_&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;x_&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;y_&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tif&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In principle with NumPy we might allocate a giant array and then iteratively
load images and place them into the giant array.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;full_array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;imageio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_location_from_filename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# We need to write this function&lt;/span&gt;
    &lt;span class="n"&gt;full_array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;However if our data is large then we can’t load it all into memory at once like
this into a single Numpy array, and instead we need to be a bit more clever to
handle it efficiently. One approach here is to use &lt;a class="reference external" href="https://dask.org"&gt;Dask&lt;/a&gt;,
which handles larger-than-memory workloads easily.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="lazily-load-images-with-dask-array"&gt;
&lt;h2&gt;Lazily load images with Dask Array&lt;/h2&gt;
&lt;p&gt;Now we learn how to lazily load and stitch together image data with Dask array.
We’ll start with simple examples first and then move onto the full example with
this more complex dataset afterwards.&lt;/p&gt;
&lt;p&gt;We can delay the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;imageio.imread&lt;/span&gt;&lt;/code&gt; calls with &lt;a class="reference external" href="https://docs.dassk.org/en/latest/delayed.html"&gt;Dask
Delayed&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;

&lt;span class="n"&gt;lazy_arrays&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;imageio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imread&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;lazy_arrays&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
               &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lazy_arrays&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Note: here we’re assuming that all of the images have the same shape and dtype
as the sample file that we loaded above. This is not always the case. See the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask_image&lt;/span&gt;&lt;/code&gt; note below in the Future Work section for an alternative.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We haven’t yet stitched these together. We have hundreds of single-chunk Dask
arrays, each of which lazily loads a single 3d chunk of data from disk. Lets look at a single array.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;lazy_arrays&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;table&gt;  &lt;thead&gt;    &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;th&gt; Array &lt;/th&gt;&lt;th&gt; Chunk &lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;th&gt; Bytes &lt;/th&gt;&lt;td&gt; 316.15 MB &lt;/td&gt; &lt;td&gt; 316.15 MB &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Shape &lt;/th&gt;&lt;td&gt; (201, 1024, 768) &lt;/td&gt; &lt;td&gt; (201, 1024, 768) &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Count &lt;/th&gt;&lt;td&gt; 2 Tasks &lt;/td&gt;&lt;td&gt; 1 Chunks &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Type &lt;/th&gt;&lt;td&gt; uint16 &lt;/td&gt;&lt;td&gt; numpy.ndarray &lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;svg width="174" height="194" style="stroke:rgb(0,0,0);stroke-width:1" &gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="10" y1="0" x2="34" y2="24" style="stroke-width:2" /&gt;
  &lt;line x1="10" y1="120" x2="34" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="10" y1="0" x2="10" y2="120" style="stroke-width:2" /&gt;
  &lt;line x1="34" y1="24" x2="34" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="10.000000,0.000000 34.664918,24.664918 34.664918,144.664918 10.000000,120.000000" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="10" y1="0" x2="100" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="34" y1="24" x2="124" y2="24" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="10" y1="0" x2="34" y2="24" style="stroke-width:2" /&gt;
  &lt;line x1="100" y1="0" x2="124" y2="24" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="10.000000,0.000000 100.000000,0.000000 124.664918,24.664918 34.664918,24.664918" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="34" y1="24" x2="124" y2="24" style="stroke-width:2" /&gt;
  &lt;line x1="34" y1="144" x2="124" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="34" y1="24" x2="34" y2="144" style="stroke-width:2" /&gt;
  &lt;line x1="124" y1="24" x2="124" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="34.664918,24.664918 124.664918,24.664918 124.664918,144.664918 34.664918,144.664918" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="79.664918" y="164.664918" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;768&lt;/text&gt;
&lt;text x="144.664918" y="84.664918" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(-90,144.664918,84.664918)"&gt;1024&lt;/text&gt;
&lt;text x="12.332459" y="152.332459" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(45,12.332459,152.332459)"&gt;201&lt;/text&gt;
&lt;/svg&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;This is a lazy 3-dimensional Dask array of a &lt;em&gt;single&lt;/em&gt; 300MB chunk of data.
That chunk is created by loading in a particular TIFF file. Normally Dask
arrays are composed of &lt;em&gt;many&lt;/em&gt; chunks. We can concatenate many of these
single-chunked Dask arrays into a multi-chunked Dask array with functions like
&lt;a class="reference external" href="https://docs.dask.org/en/latest/array-api.html#dask.array.concatenate"&gt;da.concatenate&lt;/a&gt;
and
&lt;a class="reference external" href="https://docs.dask.org/en/latest/array-api.html#dask.array.stack"&gt;da.stack&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here we concatenate the first ten Dask arrays along a few axes, to get an
easier-to-understand picture of how this looks. Take a look both at how the
shape changes as we change the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;axis=&lt;/span&gt;&lt;/code&gt; parameter both in the table on the left
and the image on the right.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lazy_arrays&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;table&gt;  &lt;thead&gt;    &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;th&gt; Array &lt;/th&gt;&lt;th&gt; Chunk &lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;th&gt; Bytes &lt;/th&gt;&lt;td&gt; 3.16 GB &lt;/td&gt; &lt;td&gt; 316.15 MB &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Shape &lt;/th&gt;&lt;td&gt; (2010, 1024, 768) &lt;/td&gt; &lt;td&gt; (201, 1024, 768) &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Count &lt;/th&gt;&lt;td&gt; 30 Tasks &lt;/td&gt;&lt;td&gt; 10 Chunks &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Type &lt;/th&gt;&lt;td&gt; uint16 &lt;/td&gt;&lt;td&gt; numpy.ndarray &lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;svg width="176" height="181" style="stroke:rgb(0,0,0);stroke-width:1" &gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="10" y1="0" x2="80" y2="70" style="stroke-width:2" /&gt;
  &lt;line x1="10" y1="61" x2="80" y2="131" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="10" y1="0" x2="10" y2="61" style="stroke-width:2" /&gt;
  &lt;line x1="17" y1="7" x2="17" y2="68" /&gt;
  &lt;line x1="24" y1="14" x2="24" y2="75" /&gt;
  &lt;line x1="31" y1="21" x2="31" y2="82" /&gt;
  &lt;line x1="38" y1="28" x2="38" y2="89" /&gt;
  &lt;line x1="45" y1="35" x2="45" y2="96" /&gt;
  &lt;line x1="52" y1="42" x2="52" y2="103" /&gt;
  &lt;line x1="59" y1="49" x2="59" y2="110" /&gt;
  &lt;line x1="66" y1="56" x2="66" y2="117" /&gt;
  &lt;line x1="73" y1="63" x2="73" y2="124" /&gt;
  &lt;line x1="80" y1="70" x2="80" y2="131" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="10.000000,0.000000 80.588235,70.588235 80.588235,131.722564 10.000000,61.134328" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="10" y1="0" x2="55" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="17" y1="7" x2="62" y2="7" /&gt;
  &lt;line x1="24" y1="14" x2="69" y2="14" /&gt;
  &lt;line x1="31" y1="21" x2="77" y2="21" /&gt;
  &lt;line x1="38" y1="28" x2="84" y2="28" /&gt;
  &lt;line x1="45" y1="35" x2="91" y2="35" /&gt;
  &lt;line x1="52" y1="42" x2="98" y2="42" /&gt;
  &lt;line x1="59" y1="49" x2="105" y2="49" /&gt;
  &lt;line x1="66" y1="56" x2="112" y2="56" /&gt;
  &lt;line x1="73" y1="63" x2="119" y2="63" /&gt;
  &lt;line x1="80" y1="70" x2="126" y2="70" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="10" y1="0" x2="80" y2="70" style="stroke-width:2" /&gt;
  &lt;line x1="55" y1="0" x2="126" y2="70" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="10.000000,0.000000 55.850746,0.000000 126.438982,70.588235 80.588235,70.588235" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="80" y1="70" x2="126" y2="70" style="stroke-width:2" /&gt;
  &lt;line x1="80" y1="131" x2="126" y2="131" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="80" y1="70" x2="80" y2="131" style="stroke-width:2" /&gt;
  &lt;line x1="126" y1="70" x2="126" y2="131" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="80.588235,70.588235 126.438982,70.588235 126.438982,131.722564 80.588235,131.722564" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="103.513608" y="151.722564" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;768&lt;/text&gt;
&lt;text x="146.438982" y="101.155399" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(-90,146.438982,101.155399)"&gt;1024&lt;/text&gt;
&lt;text x="35.294118" y="116.428446" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(45,35.294118,116.428446)"&gt;2010&lt;/text&gt;
&lt;/svg&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lazy_arrays&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;table&gt;  &lt;thead&gt;    &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;th&gt; Array &lt;/th&gt;&lt;th&gt; Chunk &lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;th&gt; Bytes &lt;/th&gt;&lt;td&gt; 3.16 GB &lt;/td&gt; &lt;td&gt; 316.15 MB &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Shape &lt;/th&gt;&lt;td&gt; (201, 10240, 768) &lt;/td&gt; &lt;td&gt; (201, 1024, 768) &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Count &lt;/th&gt;&lt;td&gt; 30 Tasks &lt;/td&gt;&lt;td&gt; 10 Chunks &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Type &lt;/th&gt;&lt;td&gt; uint16 &lt;/td&gt;&lt;td&gt; numpy.ndarray &lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;svg width="113" height="187" style="stroke:rgb(0,0,0);stroke-width:1" &gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="10" y1="0" x2="27" y2="17" style="stroke-width:2" /&gt;
  &lt;line x1="10" y1="12" x2="27" y2="29" /&gt;
  &lt;line x1="10" y1="24" x2="27" y2="41" /&gt;
  &lt;line x1="10" y1="36" x2="27" y2="53" /&gt;
  &lt;line x1="10" y1="48" x2="27" y2="65" /&gt;
  &lt;line x1="10" y1="60" x2="27" y2="77" /&gt;
  &lt;line x1="10" y1="72" x2="27" y2="89" /&gt;
  &lt;line x1="10" y1="84" x2="27" y2="101" /&gt;
  &lt;line x1="10" y1="96" x2="27" y2="113" /&gt;
  &lt;line x1="10" y1="108" x2="27" y2="125" /&gt;
  &lt;line x1="10" y1="120" x2="27" y2="137" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="10" y1="0" x2="10" y2="120" style="stroke-width:2" /&gt;
  &lt;line x1="27" y1="17" x2="27" y2="137" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="10.000000,0.000000 27.014952,17.014952 27.014952,137.014952 10.000000,120.000000" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="10" y1="0" x2="46" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="27" y1="17" x2="63" y2="17" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="10" y1="0" x2="27" y2="17" style="stroke-width:2" /&gt;
  &lt;line x1="46" y1="0" x2="63" y2="17" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="10.000000,0.000000 46.948234,0.000000 63.963186,17.014952 27.014952,17.014952" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="27" y1="17" x2="63" y2="17" style="stroke-width:2" /&gt;
  &lt;line x1="27" y1="29" x2="63" y2="29" /&gt;
  &lt;line x1="27" y1="41" x2="63" y2="41" /&gt;
  &lt;line x1="27" y1="53" x2="63" y2="53" /&gt;
  &lt;line x1="27" y1="65" x2="63" y2="65" /&gt;
  &lt;line x1="27" y1="77" x2="63" y2="77" /&gt;
  &lt;line x1="27" y1="89" x2="63" y2="89" /&gt;
  &lt;line x1="27" y1="101" x2="63" y2="101" /&gt;
  &lt;line x1="27" y1="113" x2="63" y2="113" /&gt;
  &lt;line x1="27" y1="125" x2="63" y2="125" /&gt;
  &lt;line x1="27" y1="137" x2="63" y2="137" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="27" y1="17" x2="27" y2="137" style="stroke-width:2" /&gt;
  &lt;line x1="63" y1="17" x2="63" y2="137" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="27.014952,17.014952 63.963186,17.014952 63.963186,137.014952 27.014952,137.014952" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="45.489069" y="157.014952" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;768&lt;/text&gt;
&lt;text x="83.963186" y="77.014952" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(-90,83.963186,77.014952)"&gt;10240&lt;/text&gt;
&lt;text x="8.507476" y="148.507476" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(45,8.507476,148.507476)"&gt;201&lt;/text&gt;
&lt;/svg&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lazy_arrays&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;table&gt;  &lt;thead&gt;    &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;th&gt; Array &lt;/th&gt;&lt;th&gt; Chunk &lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;th&gt; Bytes &lt;/th&gt;&lt;td&gt; 3.16 GB &lt;/td&gt; &lt;td&gt; 316.15 MB &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Shape &lt;/th&gt;&lt;td&gt; (201, 1024, 7680) &lt;/td&gt; &lt;td&gt; (201, 1024, 768) &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Count &lt;/th&gt;&lt;td&gt; 30 Tasks &lt;/td&gt;&lt;td&gt; 10 Chunks &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Type &lt;/th&gt;&lt;td&gt; uint16 &lt;/td&gt;&lt;td&gt; numpy.ndarray &lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;svg width="197" height="108" style="stroke:rgb(0,0,0);stroke-width:1" &gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="10" y1="0" x2="27" y2="17" style="stroke-width:2" /&gt;
  &lt;line x1="10" y1="40" x2="27" y2="58" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="10" y1="0" x2="10" y2="40" style="stroke-width:2" /&gt;
  &lt;line x1="27" y1="17" x2="27" y2="58" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="10.000000,0.000000 27.988258,17.988258 27.988258,58.112379 10.000000,40.124121" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="10" y1="0" x2="130" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="27" y1="17" x2="147" y2="17" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="10" y1="0" x2="27" y2="17" style="stroke-width:2" /&gt;
  &lt;line x1="22" y1="0" x2="39" y2="17" /&gt;
  &lt;line x1="34" y1="0" x2="51" y2="17" /&gt;
  &lt;line x1="46" y1="0" x2="63" y2="17" /&gt;
  &lt;line x1="58" y1="0" x2="75" y2="17" /&gt;
  &lt;line x1="70" y1="0" x2="87" y2="17" /&gt;
  &lt;line x1="82" y1="0" x2="99" y2="17" /&gt;
  &lt;line x1="94" y1="0" x2="111" y2="17" /&gt;
  &lt;line x1="106" y1="0" x2="123" y2="17" /&gt;
  &lt;line x1="118" y1="0" x2="135" y2="17" /&gt;
  &lt;line x1="130" y1="0" x2="147" y2="17" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="10.000000,0.000000 130.000000,0.000000 147.988258,17.988258 27.988258,17.988258" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="27" y1="17" x2="147" y2="17" style="stroke-width:2" /&gt;
  &lt;line x1="27" y1="58" x2="147" y2="58" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="27" y1="17" x2="27" y2="58" style="stroke-width:2" /&gt;
  &lt;line x1="39" y1="17" x2="39" y2="58" /&gt;
  &lt;line x1="51" y1="17" x2="51" y2="58" /&gt;
  &lt;line x1="63" y1="17" x2="63" y2="58" /&gt;
  &lt;line x1="75" y1="17" x2="75" y2="58" /&gt;
  &lt;line x1="87" y1="17" x2="87" y2="58" /&gt;
  &lt;line x1="99" y1="17" x2="99" y2="58" /&gt;
  &lt;line x1="111" y1="17" x2="111" y2="58" /&gt;
  &lt;line x1="123" y1="17" x2="123" y2="58" /&gt;
  &lt;line x1="135" y1="17" x2="135" y2="58" /&gt;
  &lt;line x1="147" y1="17" x2="147" y2="58" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="27.988258,17.988258 147.988258,17.988258 147.988258,58.112379 27.988258,58.112379" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="87.988258" y="78.112379" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;7680&lt;/text&gt;
&lt;text x="167.988258" y="38.050318" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(-90,167.988258,38.050318)"&gt;1024&lt;/text&gt;
&lt;text x="8.994129" y="69.118250" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(45,8.994129,69.118250)"&gt;201&lt;/text&gt;
&lt;/svg&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;Or, if we wanted to make a new dimension, we would use &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;da.stack&lt;/span&gt;&lt;/code&gt;. In this
case note that we’ve run out of easily visible dimensions, so you should take
note of the listed shape in the table input on the left more than the picture
on the right. Notice that we’ve stacked these 3d images into a 4d image.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lazy_arrays&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;table&gt;  &lt;thead&gt;    &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;th&gt; Array &lt;/th&gt;&lt;th&gt; Chunk &lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;th&gt; Bytes &lt;/th&gt;&lt;td&gt; 3.16 GB &lt;/td&gt; &lt;td&gt; 316.15 MB &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Shape &lt;/th&gt;&lt;td&gt; (10, 201, 1024, 768) &lt;/td&gt; &lt;td&gt; (1, 201, 1024, 768) &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Count &lt;/th&gt;&lt;td&gt; 30 Tasks &lt;/td&gt;&lt;td&gt; 10 Chunks &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Type &lt;/th&gt;&lt;td&gt; uint16 &lt;/td&gt;&lt;td&gt; numpy.ndarray &lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;svg width="354" height="194" style="stroke:rgb(0,0,0);stroke-width:1" &gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="0" y1="0" x2="25" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="0" y1="25" x2="25" y2="25" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="0" y1="0" x2="0" y2="25" style="stroke-width:2" /&gt;
  &lt;line x1="2" y1="0" x2="2" y2="25" /&gt;
  &lt;line x1="5" y1="0" x2="5" y2="25" /&gt;
  &lt;line x1="7" y1="0" x2="7" y2="25" /&gt;
  &lt;line x1="10" y1="0" x2="10" y2="25" /&gt;
  &lt;line x1="12" y1="0" x2="12" y2="25" /&gt;
  &lt;line x1="15" y1="0" x2="15" y2="25" /&gt;
  &lt;line x1="17" y1="0" x2="17" y2="25" /&gt;
  &lt;line x1="20" y1="0" x2="20" y2="25" /&gt;
  &lt;line x1="22" y1="0" x2="22" y2="25" /&gt;
  &lt;line x1="25" y1="0" x2="25" y2="25" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="0.000000,0.000000 25.412617,0.000000 25.412617,25.412617 0.000000,25.412617" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="12.706308" y="45.412617" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;10&lt;/text&gt;
&lt;text x="45.412617" y="12.706308" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(0,45.412617,12.706308)"&gt;1&lt;/text&gt;&lt;/p&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="95" y1="0" x2="119" y2="24" style="stroke-width:2" /&gt;
  &lt;line x1="95" y1="120" x2="119" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="95" y1="0" x2="95" y2="120" style="stroke-width:2" /&gt;
  &lt;line x1="119" y1="24" x2="119" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="95.000000,0.000000 119.664918,24.664918 119.664918,144.664918 95.000000,120.000000" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="95" y1="0" x2="185" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="119" y1="24" x2="209" y2="24" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="95" y1="0" x2="119" y2="24" style="stroke-width:2" /&gt;
  &lt;line x1="185" y1="0" x2="209" y2="24" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="95.000000,0.000000 185.000000,0.000000 209.664918,24.664918 119.664918,24.664918" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="119" y1="24" x2="209" y2="24" style="stroke-width:2" /&gt;
  &lt;line x1="119" y1="144" x2="209" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="119" y1="24" x2="119" y2="144" style="stroke-width:2" /&gt;
  &lt;line x1="209" y1="24" x2="209" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="119.664918,24.664918 209.664918,24.664918 209.664918,144.664918 119.664918,144.664918" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="164.664918" y="164.664918" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;768&lt;/text&gt;
&lt;text x="229.664918" y="84.664918" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(-90,229.664918,84.664918)"&gt;1024&lt;/text&gt;
&lt;text x="97.332459" y="152.332459" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(45,97.332459,152.332459)"&gt;201&lt;/text&gt;
&lt;/svg&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;These are the common case situations, where you have a single axis along which
you want to stitch images together.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="full-example"&gt;
&lt;h2&gt;Full example&lt;/h2&gt;
&lt;p&gt;This works fine for combining along a single axis. However if we need to
combine across multiple we need to perform multiple concatenate steps.
Fortunately there is a simpler option &lt;a class="reference external" href="https://docs.dask.org/en/latest/array-api.html#dask.array.block"&gt;da.block&lt;/a&gt;, which can
concatenate along multiple axes at once if you give it a nested list of dask
arrays.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="n"&gt;laxy_array_00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lazy_array_01&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
              &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lazy_array_10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lazy_array_11&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We now do the following:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Parse each filename to learn where it should live in the larger array&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;See how many files are in each of our relevant dimensions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Allocate a NumPy object-dtype array of the appropriate size, where each
element of this array will hold a single-chunk Dask array&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Go through our filenames and insert the proper Dask array into the right
position&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Call &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;da.block&lt;/span&gt;&lt;/code&gt; on the result&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This code is a bit complex, but shows what this looks like in a real-world
setting&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Get various dimensions&lt;/span&gt;

&lt;span class="n"&gt;fn_comp_sets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;comp&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;splitext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;_&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;fn_comp_sets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;setdefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;fn_comp_sets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fn_comp_sets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fn_comp_sets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;

&lt;span class="n"&gt;remap_comps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;reversed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn_comp_sets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]))),&lt;/span&gt;
    &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;reversed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn_comp_sets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;])))&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Create an empty object array to organize each chunk that loads a TIFF&lt;/span&gt;
&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;remap_comps&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lazy_arrays&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;channel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;_ch&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;_&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;stack&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;_stack&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;_&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;

&lt;span class="c1"&gt;# Stitch together the many blocks into a single array&lt;/span&gt;
&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;table&gt;  &lt;thead&gt;    &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;th&gt; Array &lt;/th&gt;&lt;th&gt; Chunk &lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;th&gt; Bytes &lt;/th&gt;&lt;td&gt; 188.74 GB &lt;/td&gt; &lt;td&gt; 316.15 MB &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Shape &lt;/th&gt;&lt;td&gt; (3, 199, 201, 1024, 768) &lt;/td&gt; &lt;td&gt; (1, 1, 201, 1024, 768) &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Count &lt;/th&gt;&lt;td&gt; 2985 Tasks &lt;/td&gt;&lt;td&gt; 597 Chunks &lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;th&gt; Type &lt;/th&gt;&lt;td&gt; uint16 &lt;/td&gt;&lt;td&gt; numpy.ndarray &lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;&lt;/table&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;svg width="386" height="194" style="stroke:rgb(0,0,0);stroke-width:1" &gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="0" y1="0" x2="41" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="0" y1="8" x2="41" y2="8" /&gt;
  &lt;line x1="0" y1="16" x2="41" y2="16" /&gt;
  &lt;line x1="0" y1="25" x2="41" y2="25" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="0" y1="0" x2="0" y2="25" style="stroke-width:2" /&gt;
  &lt;line x1="0" y1="0" x2="0" y2="25" /&gt;
  &lt;line x1="0" y1="0" x2="0" y2="25" /&gt;
  &lt;line x1="0" y1="0" x2="0" y2="25" /&gt;
  &lt;line x1="0" y1="0" x2="0" y2="25" /&gt;
  &lt;line x1="1" y1="0" x2="1" y2="25" /&gt;
  &lt;line x1="1" y1="0" x2="1" y2="25" /&gt;
  &lt;line x1="1" y1="0" x2="1" y2="25" /&gt;
  &lt;line x1="1" y1="0" x2="1" y2="25" /&gt;
  &lt;line x1="1" y1="0" x2="1" y2="25" /&gt;
  &lt;line x1="2" y1="0" x2="2" y2="25" /&gt;
  &lt;line x1="2" y1="0" x2="2" y2="25" /&gt;
  &lt;line x1="2" y1="0" x2="2" y2="25" /&gt;
  &lt;line x1="2" y1="0" x2="2" y2="25" /&gt;
  &lt;line x1="2" y1="0" x2="2" y2="25" /&gt;
  &lt;line x1="3" y1="0" x2="3" y2="25" /&gt;
  &lt;line x1="3" y1="0" x2="3" y2="25" /&gt;
  &lt;line x1="3" y1="0" x2="3" y2="25" /&gt;
  &lt;line x1="3" y1="0" x2="3" y2="25" /&gt;
  &lt;line x1="3" y1="0" x2="3" y2="25" /&gt;
  &lt;line x1="4" y1="0" x2="4" y2="25" /&gt;
  &lt;line x1="4" y1="0" x2="4" y2="25" /&gt;
  &lt;line x1="4" y1="0" x2="4" y2="25" /&gt;
  &lt;line x1="4" y1="0" x2="4" y2="25" /&gt;
  &lt;line x1="5" y1="0" x2="5" y2="25" /&gt;
  &lt;line x1="5" y1="0" x2="5" y2="25" /&gt;
  &lt;line x1="5" y1="0" x2="5" y2="25" /&gt;
  &lt;line x1="5" y1="0" x2="5" y2="25" /&gt;
  &lt;line x1="5" y1="0" x2="5" y2="25" /&gt;
  &lt;line x1="6" y1="0" x2="6" y2="25" /&gt;
  &lt;line x1="6" y1="0" x2="6" y2="25" /&gt;
  &lt;line x1="6" y1="0" x2="6" y2="25" /&gt;
  &lt;line x1="6" y1="0" x2="6" y2="25" /&gt;
  &lt;line x1="6" y1="0" x2="6" y2="25" /&gt;
  &lt;line x1="7" y1="0" x2="7" y2="25" /&gt;
  &lt;line x1="7" y1="0" x2="7" y2="25" /&gt;
  &lt;line x1="7" y1="0" x2="7" y2="25" /&gt;
  &lt;line x1="7" y1="0" x2="7" y2="25" /&gt;
  &lt;line x1="7" y1="0" x2="7" y2="25" /&gt;
  &lt;line x1="8" y1="0" x2="8" y2="25" /&gt;
  &lt;line x1="8" y1="0" x2="8" y2="25" /&gt;
  &lt;line x1="8" y1="0" x2="8" y2="25" /&gt;
  &lt;line x1="8" y1="0" x2="8" y2="25" /&gt;
  &lt;line x1="9" y1="0" x2="9" y2="25" /&gt;
  &lt;line x1="9" y1="0" x2="9" y2="25" /&gt;
  &lt;line x1="9" y1="0" x2="9" y2="25" /&gt;
  &lt;line x1="9" y1="0" x2="9" y2="25" /&gt;
  &lt;line x1="9" y1="0" x2="9" y2="25" /&gt;
  &lt;line x1="10" y1="0" x2="10" y2="25" /&gt;
  &lt;line x1="10" y1="0" x2="10" y2="25" /&gt;
  &lt;line x1="10" y1="0" x2="10" y2="25" /&gt;
  &lt;line x1="10" y1="0" x2="10" y2="25" /&gt;
  &lt;line x1="10" y1="0" x2="10" y2="25" /&gt;
  &lt;line x1="11" y1="0" x2="11" y2="25" /&gt;
  &lt;line x1="11" y1="0" x2="11" y2="25" /&gt;
  &lt;line x1="11" y1="0" x2="11" y2="25" /&gt;
  &lt;line x1="11" y1="0" x2="11" y2="25" /&gt;
  &lt;line x1="11" y1="0" x2="11" y2="25" /&gt;
  &lt;line x1="12" y1="0" x2="12" y2="25" /&gt;
  &lt;line x1="12" y1="0" x2="12" y2="25" /&gt;
  &lt;line x1="12" y1="0" x2="12" y2="25" /&gt;
  &lt;line x1="12" y1="0" x2="12" y2="25" /&gt;
  &lt;line x1="13" y1="0" x2="13" y2="25" /&gt;
  &lt;line x1="13" y1="0" x2="13" y2="25" /&gt;
  &lt;line x1="13" y1="0" x2="13" y2="25" /&gt;
  &lt;line x1="13" y1="0" x2="13" y2="25" /&gt;
  &lt;line x1="13" y1="0" x2="13" y2="25" /&gt;
  &lt;line x1="14" y1="0" x2="14" y2="25" /&gt;
  &lt;line x1="14" y1="0" x2="14" y2="25" /&gt;
  &lt;line x1="14" y1="0" x2="14" y2="25" /&gt;
  &lt;line x1="14" y1="0" x2="14" y2="25" /&gt;
  &lt;line x1="14" y1="0" x2="14" y2="25" /&gt;
  &lt;line x1="15" y1="0" x2="15" y2="25" /&gt;
  &lt;line x1="15" y1="0" x2="15" y2="25" /&gt;
  &lt;line x1="15" y1="0" x2="15" y2="25" /&gt;
  &lt;line x1="15" y1="0" x2="15" y2="25" /&gt;
  &lt;line x1="15" y1="0" x2="15" y2="25" /&gt;
  &lt;line x1="16" y1="0" x2="16" y2="25" /&gt;
  &lt;line x1="16" y1="0" x2="16" y2="25" /&gt;
  &lt;line x1="16" y1="0" x2="16" y2="25" /&gt;
  &lt;line x1="16" y1="0" x2="16" y2="25" /&gt;
  &lt;line x1="17" y1="0" x2="17" y2="25" /&gt;
  &lt;line x1="17" y1="0" x2="17" y2="25" /&gt;
  &lt;line x1="17" y1="0" x2="17" y2="25" /&gt;
  &lt;line x1="17" y1="0" x2="17" y2="25" /&gt;
  &lt;line x1="17" y1="0" x2="17" y2="25" /&gt;
  &lt;line x1="18" y1="0" x2="18" y2="25" /&gt;
  &lt;line x1="18" y1="0" x2="18" y2="25" /&gt;
  &lt;line x1="18" y1="0" x2="18" y2="25" /&gt;
  &lt;line x1="18" y1="0" x2="18" y2="25" /&gt;
  &lt;line x1="18" y1="0" x2="18" y2="25" /&gt;
  &lt;line x1="19" y1="0" x2="19" y2="25" /&gt;
  &lt;line x1="19" y1="0" x2="19" y2="25" /&gt;
  &lt;line x1="19" y1="0" x2="19" y2="25" /&gt;
  &lt;line x1="19" y1="0" x2="19" y2="25" /&gt;
  &lt;line x1="19" y1="0" x2="19" y2="25" /&gt;
  &lt;line x1="20" y1="0" x2="20" y2="25" /&gt;
  &lt;line x1="20" y1="0" x2="20" y2="25" /&gt;
  &lt;line x1="20" y1="0" x2="20" y2="25" /&gt;
  &lt;line x1="20" y1="0" x2="20" y2="25" /&gt;
  &lt;line x1="21" y1="0" x2="21" y2="25" /&gt;
  &lt;line x1="21" y1="0" x2="21" y2="25" /&gt;
  &lt;line x1="21" y1="0" x2="21" y2="25" /&gt;
  &lt;line x1="21" y1="0" x2="21" y2="25" /&gt;
  &lt;line x1="21" y1="0" x2="21" y2="25" /&gt;
  &lt;line x1="22" y1="0" x2="22" y2="25" /&gt;
  &lt;line x1="22" y1="0" x2="22" y2="25" /&gt;
  &lt;line x1="22" y1="0" x2="22" y2="25" /&gt;
  &lt;line x1="22" y1="0" x2="22" y2="25" /&gt;
  &lt;line x1="22" y1="0" x2="22" y2="25" /&gt;
  &lt;line x1="23" y1="0" x2="23" y2="25" /&gt;
  &lt;line x1="23" y1="0" x2="23" y2="25" /&gt;
  &lt;line x1="23" y1="0" x2="23" y2="25" /&gt;
  &lt;line x1="23" y1="0" x2="23" y2="25" /&gt;
  &lt;line x1="23" y1="0" x2="23" y2="25" /&gt;
  &lt;line x1="24" y1="0" x2="24" y2="25" /&gt;
  &lt;line x1="24" y1="0" x2="24" y2="25" /&gt;
  &lt;line x1="24" y1="0" x2="24" y2="25" /&gt;
  &lt;line x1="24" y1="0" x2="24" y2="25" /&gt;
  &lt;line x1="25" y1="0" x2="25" y2="25" /&gt;
  &lt;line x1="25" y1="0" x2="25" y2="25" /&gt;
  &lt;line x1="25" y1="0" x2="25" y2="25" /&gt;
  &lt;line x1="25" y1="0" x2="25" y2="25" /&gt;
  &lt;line x1="25" y1="0" x2="25" y2="25" /&gt;
  &lt;line x1="26" y1="0" x2="26" y2="25" /&gt;
  &lt;line x1="26" y1="0" x2="26" y2="25" /&gt;
  &lt;line x1="26" y1="0" x2="26" y2="25" /&gt;
  &lt;line x1="26" y1="0" x2="26" y2="25" /&gt;
  &lt;line x1="26" y1="0" x2="26" y2="25" /&gt;
  &lt;line x1="27" y1="0" x2="27" y2="25" /&gt;
  &lt;line x1="27" y1="0" x2="27" y2="25" /&gt;
  &lt;line x1="27" y1="0" x2="27" y2="25" /&gt;
  &lt;line x1="27" y1="0" x2="27" y2="25" /&gt;
  &lt;line x1="27" y1="0" x2="27" y2="25" /&gt;
  &lt;line x1="28" y1="0" x2="28" y2="25" /&gt;
  &lt;line x1="28" y1="0" x2="28" y2="25" /&gt;
  &lt;line x1="28" y1="0" x2="28" y2="25" /&gt;
  &lt;line x1="28" y1="0" x2="28" y2="25" /&gt;
  &lt;line x1="29" y1="0" x2="29" y2="25" /&gt;
  &lt;line x1="29" y1="0" x2="29" y2="25" /&gt;
  &lt;line x1="29" y1="0" x2="29" y2="25" /&gt;
  &lt;line x1="29" y1="0" x2="29" y2="25" /&gt;
  &lt;line x1="29" y1="0" x2="29" y2="25" /&gt;
  &lt;line x1="30" y1="0" x2="30" y2="25" /&gt;
  &lt;line x1="30" y1="0" x2="30" y2="25" /&gt;
  &lt;line x1="30" y1="0" x2="30" y2="25" /&gt;
  &lt;line x1="30" y1="0" x2="30" y2="25" /&gt;
  &lt;line x1="30" y1="0" x2="30" y2="25" /&gt;
  &lt;line x1="31" y1="0" x2="31" y2="25" /&gt;
  &lt;line x1="31" y1="0" x2="31" y2="25" /&gt;
  &lt;line x1="31" y1="0" x2="31" y2="25" /&gt;
  &lt;line x1="31" y1="0" x2="31" y2="25" /&gt;
  &lt;line x1="31" y1="0" x2="31" y2="25" /&gt;
  &lt;line x1="32" y1="0" x2="32" y2="25" /&gt;
  &lt;line x1="32" y1="0" x2="32" y2="25" /&gt;
  &lt;line x1="32" y1="0" x2="32" y2="25" /&gt;
  &lt;line x1="32" y1="0" x2="32" y2="25" /&gt;
  &lt;line x1="33" y1="0" x2="33" y2="25" /&gt;
  &lt;line x1="33" y1="0" x2="33" y2="25" /&gt;
  &lt;line x1="33" y1="0" x2="33" y2="25" /&gt;
  &lt;line x1="33" y1="0" x2="33" y2="25" /&gt;
  &lt;line x1="33" y1="0" x2="33" y2="25" /&gt;
  &lt;line x1="34" y1="0" x2="34" y2="25" /&gt;
  &lt;line x1="34" y1="0" x2="34" y2="25" /&gt;
  &lt;line x1="34" y1="0" x2="34" y2="25" /&gt;
  &lt;line x1="34" y1="0" x2="34" y2="25" /&gt;
  &lt;line x1="34" y1="0" x2="34" y2="25" /&gt;
  &lt;line x1="35" y1="0" x2="35" y2="25" /&gt;
  &lt;line x1="35" y1="0" x2="35" y2="25" /&gt;
  &lt;line x1="35" y1="0" x2="35" y2="25" /&gt;
  &lt;line x1="35" y1="0" x2="35" y2="25" /&gt;
  &lt;line x1="35" y1="0" x2="35" y2="25" /&gt;
  &lt;line x1="36" y1="0" x2="36" y2="25" /&gt;
  &lt;line x1="36" y1="0" x2="36" y2="25" /&gt;
  &lt;line x1="36" y1="0" x2="36" y2="25" /&gt;
  &lt;line x1="36" y1="0" x2="36" y2="25" /&gt;
  &lt;line x1="37" y1="0" x2="37" y2="25" /&gt;
  &lt;line x1="37" y1="0" x2="37" y2="25" /&gt;
  &lt;line x1="37" y1="0" x2="37" y2="25" /&gt;
  &lt;line x1="37" y1="0" x2="37" y2="25" /&gt;
  &lt;line x1="37" y1="0" x2="37" y2="25" /&gt;
  &lt;line x1="38" y1="0" x2="38" y2="25" /&gt;
  &lt;line x1="38" y1="0" x2="38" y2="25" /&gt;
  &lt;line x1="38" y1="0" x2="38" y2="25" /&gt;
  &lt;line x1="38" y1="0" x2="38" y2="25" /&gt;
  &lt;line x1="38" y1="0" x2="38" y2="25" /&gt;
  &lt;line x1="39" y1="0" x2="39" y2="25" /&gt;
  &lt;line x1="39" y1="0" x2="39" y2="25" /&gt;
  &lt;line x1="39" y1="0" x2="39" y2="25" /&gt;
  &lt;line x1="39" y1="0" x2="39" y2="25" /&gt;
  &lt;line x1="39" y1="0" x2="39" y2="25" /&gt;
  &lt;line x1="40" y1="0" x2="40" y2="25" /&gt;
  &lt;line x1="40" y1="0" x2="40" y2="25" /&gt;
  &lt;line x1="40" y1="0" x2="40" y2="25" /&gt;
  &lt;line x1="40" y1="0" x2="40" y2="25" /&gt;
  &lt;line x1="41" y1="0" x2="41" y2="25" /&gt;
  &lt;line x1="41" y1="0" x2="41" y2="25" /&gt;
  &lt;line x1="41" y1="0" x2="41" y2="25" /&gt;
  &lt;line x1="41" y1="0" x2="41" y2="25" /&gt;
  &lt;line x1="41" y1="0" x2="41" y2="25" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="0.000000,0.000000 41.887587,0.000000 41.887587,25.412617 0.000000,25.412617" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="20.943793" y="45.412617" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;199&lt;/text&gt;
&lt;text x="61.887587" y="12.706308" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(0,61.887587,12.706308)"&gt;3&lt;/text&gt;&lt;/p&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="111" y1="0" x2="135" y2="24" style="stroke-width:2" /&gt;
  &lt;line x1="111" y1="120" x2="135" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="111" y1="0" x2="111" y2="120" style="stroke-width:2" /&gt;
  &lt;line x1="135" y1="24" x2="135" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="111.000000,0.000000 135.664918,24.664918 135.664918,144.664918 111.000000,120.000000" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="111" y1="0" x2="201" y2="0" style="stroke-width:2" /&gt;
  &lt;line x1="135" y1="24" x2="225" y2="24" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="111" y1="0" x2="135" y2="24" style="stroke-width:2" /&gt;
  &lt;line x1="201" y1="0" x2="225" y2="24" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="111.000000,0.000000 201.000000,0.000000 225.664918,24.664918 135.664918,24.664918" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Horizontal lines --&gt;
  &lt;line x1="135" y1="24" x2="225" y2="24" style="stroke-width:2" /&gt;
  &lt;line x1="135" y1="144" x2="225" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Vertical lines --&gt;
  &lt;line x1="135" y1="24" x2="135" y2="144" style="stroke-width:2" /&gt;
  &lt;line x1="225" y1="24" x2="225" y2="144" style="stroke-width:2" /&gt;
  &lt;!-- Colored Rectangle --&gt;
  &lt;polygon points="135.664918,24.664918 225.664918,24.664918 225.664918,144.664918 135.664918,144.664918" style="fill:#ECB172A0;stroke-width:0"/&gt;
  &lt;!-- Text --&gt;
&lt;p&gt;&lt;text x="180.664918" y="164.664918" font-size="1.0rem" font-weight="100" text-anchor="middle" &gt;768&lt;/text&gt;
&lt;text x="245.664918" y="84.664918" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(-90,245.664918,84.664918)"&gt;1024&lt;/text&gt;
&lt;text x="113.332459" y="152.332459" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(45,113.332459,152.332459)"&gt;201&lt;/text&gt;
&lt;/svg&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;That’s a 180 GB logical array, composed of around 600 chunks, each of size 300
MB. We can now do normal NumPy like computations on this array using &lt;a class="reference external" href="https://docs.dask.org/en/latest/array.html"&gt;Dask
Array&lt;/a&gt;, but we’ll save that for a
future post.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="c1"&gt;# array computations would work fine, and would run in low memory&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="c1"&gt;# but we&amp;#39;ll save actual computation for future posts&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/20/load-image-data.md&lt;/span&gt;, line 1056)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="save-data"&gt;
&lt;h1&gt;Save Data&lt;/h1&gt;
&lt;p&gt;To simplify data loading in the future, we store this in a large chunked
array format like &lt;a class="reference external" href="https://zarr.readthedocs.io/"&gt;Zarr&lt;/a&gt; using the &lt;a class="reference external" href="https://docs.dask.org/en/latest/array-api.html#dask.array.Array.to_zarr"&gt;to_zarr&lt;/a&gt;
method.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_zarr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;mydata.zarr&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We may add additional information about the image data as &lt;a class="reference external" href="https://zarr.readthedocs.io/en/stable/tutorial.html#user-attributes"&gt;attributes&lt;/a&gt;. This
both makes things simpler for future users (they can read the full dataset with
a single line using &lt;a class="reference external" href="http://docs.dask.org/en/latest/array-api.html#dask.array.from_zarr"&gt;da.from_zarr&lt;/a&gt;) and much
more performant because Zarr is an &lt;em&gt;analysis ready format&lt;/em&gt; that is efficiently
encoded for computation.&lt;/p&gt;
&lt;p&gt;Zarr uses the &lt;a class="reference external" href="http://blosc.org/"&gt;Blosc&lt;/a&gt; library for compression by default.
For scientific imaging data, we can optionally pass compression options that provide
a good compression ratio to speed tradeoff and optimize compression
performance.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numcodecs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Blosc&lt;/span&gt;
&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_zarr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;mydata.zarr&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compressor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Blosc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;zstd&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clevel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Blosc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BITSHUFFLE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/20/load-image-data.md&lt;/span&gt;, line 1082)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="future-work"&gt;
&lt;h1&gt;Future Work&lt;/h1&gt;
&lt;p&gt;The workload above is generic and straightforward. It works well in simple
cases and also extends well to more complex cases, providing you’re willing to
write some for-loops and parsing code around your custom logic. It works on a
single small-scale laptop as well as a large HPC or Cloud cluster. If you have
a function that turns a filename into a NumPy array, you can generate large
lazy Dask array using that function, &lt;a class="reference external" href="https://docs.dask.org/en/latest/delayed.html"&gt;Dask
Delayed&lt;/a&gt; and &lt;a class="reference external" href="https://docs.dask.org/en/latest/array.html"&gt;Dask
Array&lt;/a&gt;.&lt;/p&gt;
&lt;section id="dask-image"&gt;
&lt;h2&gt;Dask Image&lt;/h2&gt;
&lt;p&gt;However, we can make things a bit easier for users if we specialize a bit. For
example the &lt;a class="reference external" href="https://image.dask.org/en/latest/"&gt;Dask Image&lt;/a&gt; library has a
parallel image reader function, which automates much of our work above in the
simple case.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_image&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask_image&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;raw/*.tif&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Similarly libraries like &lt;a class="reference external" href="https://xarray.pydata.org/en/stable/"&gt;Xarray&lt;/a&gt; have
readers for other file formats, like GeoTIFF.&lt;/p&gt;
&lt;p&gt;As domains do more and more work like what we did above they tend to write down
common patterns into domain-specific libraries, which then increases the
accessibility and user base of these tools.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="gpus"&gt;
&lt;h2&gt;GPUs&lt;/h2&gt;
&lt;p&gt;If we have special hardware lying around like a few GPUs, we can move the data
over to it and perform computations with a library like CuPy, which mimics
NumPy very closely. Thus benefiting from the same operations listed above, but
with the added performance of GPUs behind them.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cupy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cp&lt;/span&gt;
&lt;span class="n"&gt;a_gpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_blocks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;asarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="computation"&gt;
&lt;h2&gt;Computation&lt;/h2&gt;
&lt;p&gt;Finally, in future blogposts we plan to talk about how to compute on our large
Dask arrays using common image-processing workloads like overlapping stencil
functions, segmentation and deconvolution, and integrating with other libraries
like ITK.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2019/06/20/load-image-data/"/>
    <summary>Document headings start at H2, not H1 [myst.header]</summary>
    <category term="dask-image" label="dask-image"/>
    <category term="python" label="python"/>
    <category term="scikit-image" label="scikit-image"/>
    <category term="scipy" label="scipy"/>
    <published>2019-06-20T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2019/06/19/python-gpus-status-update/</id>
    <title>Python and GPUs: A Status Update</title>
    <updated>2019-06-19T00:00:00+00:00</updated>
    <author>
      <name>Matthew Rocklin</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;This blogpost was delivered in talk form at the recent &lt;a class="reference external" href="https://pasc19.pasc-conference.org/"&gt;PASC
2019&lt;/a&gt; conference.
&lt;a class="reference external" href="https://docs.google.com/presentation/d/e/2PACX-1vSajAH6FzgQH4OwOJD5y-t9mjF9tTKEeljguEsfcjavp18pL4LkpABy4lW2uMykIUvP2dC-1AmhCq6l/pub?start=false&amp;amp;amp;loop=false&amp;amp;amp;delayms=60000"&gt;Slides for that talk are
here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/19/python-gpus-status-update.md&lt;/span&gt;, line 14)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="executive-summary"&gt;

&lt;p&gt;We’re improving the state of scalable GPU computing in Python.&lt;/p&gt;
&lt;p&gt;This post lays out the current status, and describes future work.
It also summarizes and links to several other more blogposts from recent months that drill down into different topics for the interested reader.&lt;/p&gt;
&lt;p&gt;Broadly we cover briefly the following categories:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Python libraries written in CUDA like CuPy and RAPIDS&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Python-CUDA compilers, specifically Numba&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scaling these libraries out with Dask&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Network communication with UCX&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Packaging with Conda&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/19/python-gpus-status-update.md&lt;/span&gt;, line 29)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="performance-of-gpu-accelerated-python-libraries"&gt;
&lt;h1&gt;Performance of GPU accelerated Python Libraries&lt;/h1&gt;
&lt;p&gt;Probably the easiest way for a Python programmer to get access to GPU
performance is to use a GPU-accelerated Python library. These provide a set of
common operations that are well tuned and integrate well together.&lt;/p&gt;
&lt;p&gt;Many users know libraries for deep learning like PyTorch and TensorFlow, but
there are several other for more general purpose computing. These tend to copy
the APIs of popular Python projects:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Numpy on the GPU: &lt;a class="reference external" href="https://cupy.chainer.org/"&gt;CuPy&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Numpy on the GPU (again): &lt;a class="reference external" href="https://github.com/google/jax"&gt;Jax&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pandas on the GPU: &lt;a class="reference external" href="https://docs.rapids.ai/api/cudf/nightly/"&gt;RAPIDS cuDF&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scikit-Learn on the GPU: &lt;a class="reference external" href="https://docs.rapids.ai/api/cuml/nightly/"&gt;RAPIDS cuML&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These libraries build GPU accelerated variants of popular Python
libraries like NumPy, Pandas, and Scikit-Learn. In order to better understand
the relative performance differences
&lt;a class="reference external" href="https://github.com/pentschev"&gt;Peter Entschev&lt;/a&gt; recently put together a
&lt;a class="reference external" href="https://github.com/pentschev/pybench"&gt;benchmark suite&lt;/a&gt; to help with comparisons.
He has produced the following image showing the relative speedup between GPU
and CPU:&lt;/p&gt;
&lt;style&gt;
.vega-actions a {
    margin-right: 12px;
    color: #757575;
    font-weight: normal;
    font-size: 13px;
}
.error {
    color: red;
}
&lt;/style&gt;
&lt;script type="text/javascript" src="https://cdn.jsdelivr.net/npm//vega@5"&gt;&lt;/script&gt;
&lt;script type="text/javascript" src="https://cdn.jsdelivr.net/npm//vega-lite@3.3.0"&gt;&lt;/script&gt;
&lt;script type="text/javascript" src="https://cdn.jsdelivr.net/npm//vega-embed@4"&gt;&lt;/script&gt;
&lt;div id="vis"&gt;&lt;/div&gt;
&lt;p&gt;There are lots of interesting results there.
Peter goes into more depth in this in &lt;a class="reference external" href="https://blog.dask.org/2019/06/27/single-gpu-cupy-benchmarks"&gt;his blogpost&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;More broadly though, we see that there is variability in performance.
Our mental model for what is fast and slow on the CPU doesn’t neccessarily
carry over to the GPU. Fortunately though, due consistent APIs, users that are
familiar with Python can easily experiment with GPU acceleration without
learning CUDA.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/19/python-gpus-status-update.md&lt;/span&gt;, line 78)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="numba-compiling-python-to-cuda"&gt;
&lt;h1&gt;Numba: Compiling Python to CUDA&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;See also this &lt;a class="reference external" href="https://blog.dask.org/2019/04/09/numba-stencil"&gt;recent blogpost about Numba
stencils&lt;/a&gt; and the attached &lt;a class="reference external" href="https://gist.github.com/mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c"&gt;GPU
notebook&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The built-in operations in GPU libraries like CuPy and RAPIDS cover most common
operations. However, in real-world settings we often find messy situations
that require writing a little bit of custom code. Switching down to C/C++/CUDA
in these cases can be challenging, especially for users that are primarily
Python developers. This is where Numba can come in.&lt;/p&gt;
&lt;p&gt;Python has this same problem on the CPU as well. Users often couldn’t be
bothered to learn C/C++ to write fast custom code. To address this there are
tools like Cython or Numba, which let Python programmers write fast numeric
code without learning much beyond the Python language.&lt;/p&gt;
&lt;p&gt;For example, Numba accelerates the for-loop style code below about 500x on the
CPU, from slow Python speeds up to fast C/Fortran speeds.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numba&lt;/span&gt;  &lt;span class="c1"&gt;# We added these two lines for a 500x speedup&lt;/span&gt;

&lt;span class="nd"&gt;@numba&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jit&lt;/span&gt;    &lt;span class="c1"&gt;# We added these two lines for a 500x speedup&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The ability to drop down to low-level performant code without context switching
out of Python is useful, particularly if you don’t already know C/C++ or
have a compiler chain set up for you (which is the case for most Python users
today).&lt;/p&gt;
&lt;p&gt;This benefit is even more pronounced on the GPU. While many Python programmers
know a little bit of C, very few of them know CUDA. Even if they did, they
would probably have difficulty in setting up the compiler tools and development
environment.&lt;/p&gt;
&lt;p&gt;Enter &lt;a class="reference external" href="https://numba.pydata.org/numba-doc/dev/cuda/index.html"&gt;numba.cuda.jit&lt;/a&gt;
Numba’s backend for CUDA. Numba.cuda.jit allows Python users to author,
compile, and run CUDA code, written in Python, interactively without leaving a
Python session. Here is an image of writing a stencil computation that
smoothes a 2d-image all from within a Jupyter Notebook:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/numba.cuda.jit.png"
     width="100%"
     alt="Numba.cuda.jit in a Jupyter Notebook"&gt;&lt;/p&gt;
&lt;p&gt;Here is a simplified comparison of Numba CPU/GPU code to compare programming
style..
The GPU code gets a 200x speed improvement over a single CPU core.&lt;/p&gt;
&lt;section id="cpu-600-ms"&gt;
&lt;h2&gt;CPU – 600 ms&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@numba&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jit&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;_smooth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
                        &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
                        &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;or if we use the fancy numba.stencil decorator …&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@numba&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stencil&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;_smooth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="gpu-3-ms"&gt;
&lt;h2&gt;GPU – 3 ms&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@numba&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jit&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;smooth_gpu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
                     &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
                     &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Numba.cuda.jit has been out in the wild for years.
It’s accessible, mature, and fun to play with.
If you have a machine with a GPU in it and some curiosity
then we strongly recommend that you try it out.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;conda&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;numba&lt;/span&gt;
&lt;span class="c1"&gt;# or&lt;/span&gt;
&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;numba&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numba.cuda&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/19/python-gpus-status-update.md&lt;/span&gt;, line 186)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="scaling-with-dask"&gt;
&lt;h1&gt;Scaling with Dask&lt;/h1&gt;
&lt;p&gt;As mentioned in previous blogposts
(
&lt;a class="reference external" href="https://blog.dask.org/2019/01/03/dask-array-gpus-first-steps"&gt;1&lt;/a&gt;,
&lt;a class="reference external" href="https://blog.dask.org/2019/01/13/dask-cudf-first-steps"&gt;2&lt;/a&gt;,
&lt;a class="reference external" href="https://blog.dask.org/2019/03/04/building-gpu-groupbys"&gt;3&lt;/a&gt;,
&lt;a class="reference external" href="https://blog.dask.org/2019/03/18/dask-nep18"&gt;4&lt;/a&gt;
)
we’ve been generalizing &lt;a class="reference external" href="https://dask.org"&gt;Dask&lt;/a&gt;, to operate not just with
Numpy arrays and Pandas dataframes, but with anything that looks enough like
Numpy (like &lt;a class="reference external" href="https://cupy.chainer.org/"&gt;CuPy&lt;/a&gt; or
&lt;a class="reference external" href="https://sparse.pydata.org/en/latest/"&gt;Sparse&lt;/a&gt; or
&lt;a class="reference external" href="https://github.com/google/jax"&gt;Jax&lt;/a&gt;) or enough like Pandas (like &lt;a class="reference external" href="https://docs.rapids.ai/api/cudf/nightly/"&gt;RAPIDS
cuDF&lt;/a&gt;)
to scale those libraries out too. This is working out well. Here is a brief
video showing Dask array computing an SVD in parallel, and seeing what happens
when we swap out the Numpy library for CuPy.&lt;/p&gt;
&lt;iframe width="560"
        height="315"
        src="https://www.youtube.com/embed/QyyxpzNPuIE?start=1046"
        frameborder="0"
        allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture"
        allowfullscreen&gt;&lt;/iframe&gt;
&lt;p&gt;We see that there is about a 10x speed improvement on the computation. Most
importantly, we were able to switch between a CPU implementation and a GPU
implementation with a small one-line change, but continue using the
sophisticated algorithms with Dask Array, like it’s parallel SVD
implementation.&lt;/p&gt;
&lt;p&gt;We also saw a relative slowdown in communication. In general almost all
non-trivial Dask + GPU work today is becoming communication-bound. We’ve
gotten fast enough at computation that the relative importance of communication
has grown significantly. We’re working to resolve this with our next topic,
UCX.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/19/python-gpus-status-update.md&lt;/span&gt;, line 224)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="communication-with-ucx"&gt;
&lt;h1&gt;Communication with UCX&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;See &lt;a class="reference external" href="https://developer.download.nvidia.com/video/gputechconf/gtc/2019/video/S9679/s9679-ucx-python-a-flexible-communication-library-for-python-applications.mp4"&gt;this talk&lt;/a&gt; by &lt;a class="reference external" href="https://github.com/Akshay-Venkatesh"&gt;Akshay
Venkatesh&lt;/a&gt; or view &lt;a class="reference external" href="https://www.slideshare.net/MatthewRocklin/ucxpython-a-flexible-communication-library-for-python-applications"&gt;the
slides&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Also see &lt;a class="reference external" href="https://blog.dask.org/2019/06/09/ucx-dgx"&gt;this recent blogpost about UCX and
Dask&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We’ve been integrating the &lt;a class="reference external" href="https://openucx.org"&gt;OpenUCX&lt;/a&gt; library into Python
with &lt;a class="reference external" href="https://github.com/rapidsai/ucx-py"&gt;UCX-Py&lt;/a&gt;. UCX provides uniform access
to transports like TCP, InfiniBand, shared memory, and NVLink. UCX-Py is the
first time that access to many of these transports has been easily accessible
from the Python language.&lt;/p&gt;
&lt;p&gt;Using UCX and Dask together we’re able to get significant speedups. Here is a
trace of the SVD computation from before both before and after adding UCX:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Before UCX&lt;/strong&gt;:&lt;/p&gt;
&lt;iframe src="https://matthewrocklin.com/raw-host/task_stream_lcc_dgx16.html" width="100%" height="200"&gt;&lt;/iframe&gt;
&lt;p&gt;&lt;strong&gt;After UCX&lt;/strong&gt;:&lt;/p&gt;
&lt;iframe src="https://matthewrocklin.com/raw-host/task_stream_dgx_dgx16.html" width="100%" height="200"&gt;&lt;/iframe&gt;
&lt;p&gt;There is still a great deal to do here though (the blogpost linked above has
several items in the Future Work section).&lt;/p&gt;
&lt;p&gt;People can try out UCX and UCX-Py with highly experimental conda packages:&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;conda&lt;/span&gt; &lt;span class="n"&gt;create&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="n"&gt;ucx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;conda&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;forge&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;jakirkham&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ucx&lt;/span&gt; &lt;span class="n"&gt;cudatoolkit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;9.2&lt;/span&gt; &lt;span class="n"&gt;ucx&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="o"&gt;=*=&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt; &lt;span class="n"&gt;ucx&lt;/span&gt; &lt;span class="n"&gt;ucx&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3.7&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We hope that this work will also affect non-GPU users on HPC systems with
Infiniband, or even users on consumer hardware due to the easy access to shared
memory communication.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/19/python-gpus-status-update.md&lt;/span&gt;, line 263)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="packaging"&gt;
&lt;h1&gt;Packaging&lt;/h1&gt;
&lt;p&gt;In an &lt;a class="reference external" href="https://matthewrocklin.com/blog/work/2018/12/17/gpu-python-challenges"&gt;earlier blogpost&lt;/a&gt;
we discussed the challenges around installing the wrong versions of CUDA
enabled packages that don’t match the CUDA driver installed on the system.
Fortunately due to recent work from &lt;a class="reference external" href="https://github.com/seibert"&gt;Stan Seibert&lt;/a&gt;
and &lt;a class="reference external" href="https://github.com/msarahan"&gt;Michael Sarahan&lt;/a&gt; at Anaconda, Conda 4.7 now
has a special &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;cuda&lt;/span&gt;&lt;/code&gt; meta-package that is set to the version of the installed
driver. This should make it much easier for users in the future to install the
correct package.&lt;/p&gt;
&lt;p&gt;Conda 4.7 was just releasead, and comes with many new features other than the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;cuda&lt;/span&gt;&lt;/code&gt; meta-package. You can read more about it &lt;a class="reference external" href="https://www.anaconda.com/how-we-made-conda-faster-4-7/"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;conda&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt; &lt;span class="n"&gt;conda&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;There is still plenty of work to do in the packaging space today.
Everyone who builds conda packages does it their own way,
resulting in headache and heterogeneity.
This is largely due to not having centralized infrastructure
to build and test CUDA enabled packages,
like we have in &lt;a class="reference external" href="https://conda-forge.org"&gt;Conda Forge&lt;/a&gt;.
Fortunately, the Conda Forge community is working together with Anaconda and
NVIDIA to help resolve this, though that will likely take some time.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/19/python-gpus-status-update.md&lt;/span&gt;, line 290)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="summary"&gt;
&lt;h1&gt;Summary&lt;/h1&gt;
&lt;p&gt;This post gave an update of the status of some of the efforts behind GPU
computing in Python. It also provided a variety of links for future reading.
We include them below if you would like to learn more:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://docs.google.com/presentation/d/e/2PACX-1vSajAH6FzgQH4OwOJD5y-t9mjF9tTKEeljguEsfcjavp18pL4LkpABy4lW2uMykIUvP2dC-1AmhCq6l/pub?start=false&amp;amp;amp;loop=false&amp;amp;amp;delayms=60000"&gt;Slides&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Numpy on the GPU: &lt;a class="reference external" href="https://cupy.chainer.org/"&gt;CuPy&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Numpy on the GPU (again): &lt;a class="reference external" href="https://github.com/google/jax"&gt;Jax&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pandas on the GPU: &lt;a class="reference external" href="https://docs.rapids.ai/api/cudf/nightly/"&gt;RAPIDS cuDF&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scikit-Learn on the GPU: &lt;a class="reference external" href="https://docs.rapids.ai/api/cuml/nightly/"&gt;RAPIDS cuML&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/pentschev/pybench"&gt;Benchmark suite&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://gist.github.com/mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c"&gt;Numba CUDA JIT notebook&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://developer.download.nvidia.com/video/gputechconf/gtc/2019/video/S9679/s9679-ucx-python-a-flexible-communication-library-for-python-applications.mp4"&gt;A talk on UCX&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://blog.dask.org/2019/06/09/ucx-dgx"&gt;A blogpost on UCX and Dask&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://www.anaconda.com/how-we-made-conda-faster-4-7/"&gt;Conda 4.7&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;script&gt;
  var spec = {
  "config": {
    "view": {
      "width": 300,
      "height": 200
    },
    "mark": {
      "tooltip": null
    },
    "axis": {
      "grid": false,
      "labelColor": "#666666",
      "labelFontSize": 16,
      "titleColor": "#666666",
      "titleFontSize": 20
    },
    "axisX": {
      "labelAngle": -30,
      "labelColor": "#666666",
      "labelFontSize": 0,
      "titleColor": "#666666",
      "titleFontSize": 0
    },
    "header": {
      "labelAngle": -20,
      "labelColor": "#666666",
      "labelFontSize": 16,
      "titleColor": "#666666",
      "titleFontSize": 20
    },
    "legend": {
      "fillColor": "#fefefe",
      "labelColor": "#666666",
      "labelFontSize": 18,
      "padding": 10,
      "strokeColor": "gray",
      "titleColor": "#666666",
      "titleFontSize": 18
    }
  },
  "data": {
    "name": "data-4957f64f65957150f8029f7df2e6936f"
  },
  "facet": {
    "column": {
      "type": "nominal",
      "field": "operation",
      "sort": {
        "field": "speedup",
        "op": "sum",
        "order": "descending"
      },
      "title": "Operation"
    }
  },
  "spec": {
    "layer": [
      {
        "mark": {
          "type": "bar",
          "fontSize": 18,
          "opacity": 1.0
        },
        "encoding": {
          "color": {
            "type": "nominal",
            "field": "size",
            "scale": {
              "domain": [
                "800MB",
                "8MB"
              ],
              "range": [
                "#7306ff",
                "#36c9dd"
              ]
            },
            "title": "Array Size"
          },
          "x": {
            "type": "nominal",
            "field": "size"
          },
          "y": {
            "type": "quantitative",
            "axis": {
              "title": "GPU Speedup Over CPU"
            },
            "field": "speedup",
            "scale": {
              "domain": [
                0,
                1000
              ],
              "type": "symlog"
            },
            "stack": null
          }
        },
        "height": 300,
        "width": 50
      },
      {
        "layer": [
          {
            "mark": {
              "type": "text",
              "dy": -5
            },
            "encoding": {
              "color": {
                "type": "nominal",
                "field": "size",
                "scale": {
                  "domain": [
                    "800MB",
                    "8MB"
                  ],
                  "range": [
                    "#7306ff",
                    "#36c9dd"
                  ]
                },
                "title": "Array Size"
              },
              "text": {
                "type": "quantitative",
                "field": "speedup"
              },
              "x": {
                "type": "nominal",
                "field": "size"
              },
              "y": {
                "type": "quantitative",
                "axis": {
                  "title": "GPU Speedup Over CPU"
                },
                "field": "speedup",
                "scale": {
                  "domain": [
                    0,
                    1000
                  ],
                  "type": "symlog"
                },
                "stack": null
              }
            },
            "height": 300,
            "width": 50
          },
          {
            "mark": {
              "type": "text",
              "dy": 7
            },
            "encoding": {
              "color": {
                "type": "nominal",
                "field": "size",
                "scale": {
                  "domain": [
                    "800MB",
                    "8MB"
                  ],
                  "range": [
                    "#7306ff",
                    "#36c9dd"
                  ]
                },
                "title": "Array Size"
              },
              "text": {
                "type": "quantitative",
                "field": "speedup"
              },
              "x": {
                "type": "nominal",
                "field": "size"
              },
              "y": {
                "type": "quantitative",
                "axis": {
                  "title": "GPU Speedup Over CPU"
                },
                "field": "speedup",
                "scale": {
                  "domain": [
                    0,
                    1000
                  ],

                  "type": "symlog"
                },
                "stack": null
              }
            },
            "height": 300,
            "width": 50
          }
        ]
      }
    ]
  },
  "$schema": "https://vega.github.io/schema/vega-lite/v3.3.0.json",
  "datasets": {
    "data-4957f64f65957150f8029f7df2e6936f": [
      {
        "operation": "FFT",
        "speedup": 5.3,
        "shape0": 1000,
        "shape1": 1000,
        "shape": "1000x1000",
        "size": "8MB"
      },
      {
        "operation": "FFT",
        "speedup": 210.0,
        "shape0": 10000,
        "shape1": 10000,
        "shape": "10000x10000",
        "size": "800MB"
      },
      {
        "operation": "Sum",
        "speedup": 8.3,
        "shape0": 1000,
        "shape1": 1000,
        "shape": "1000x1000",
        "size": "8MB"
      },
      {
        "operation": "Sum",
        "speedup": 66.0,
        "shape0": 10000,
        "shape1": 10000,
        "shape": "10000x10000",
        "size": "800MB"
      },
      {
        "operation": "Standard Deviation",
        "speedup": 1.1,
        "shape0": 1000,
        "shape1": 1000,
        "shape": "1000x1000",
        "size": "8MB"
      },
      {
        "operation": "Standard Deviation",
        "speedup": 3.5,
        "shape0": 10000,
        "shape1": 10000,
        "shape": "10000x10000",
        "size": "800MB"
      },
      {
        "operation": "Elementwise",
        "speedup": 150.0,
        "shape0": 1000,
        "shape1": 1000,
        "shape": "1000x1000",
        "size": "8MB"
      },
      {
        "operation": "Elementwise",
        "speedup": 270.0,
        "shape0": 10000,
        "shape1": 10000,
        "shape": "10000x10000",
        "size": "800MB"
      },
      {
        "operation": "Matrix Multiplication",
        "speedup": 18.0,
        "shape0": 1000,
        "shape1": 1000,
        "shape": "1000x1000",
        "size": "8MB"
      },
      {
        "operation": "Matrix Multiplication",
        "speedup": 11.0,
        "shape0": 10000,
        "shape1": 10000,
        "shape": "10000x10000",
        "size": "800MB"
      },
      {
        "operation": "Array Slicing",
        "speedup": 3.6,
        "shape0": 1000,
        "shape1": 1000,
        "shape": "1000x1000",
        "size": "8MB"
      },
      {
        "operation": "Array Slicing",
        "speedup": 190.0,
        "shape0": 10000,
        "shape1": 10000,
        "shape": "10000x10000",
        "size": "800MB"
      },
      {
        "operation": "SVD",
        "speedup": 1.5,
        "shape0": 1000,
        "shape1": 1000,
        "shape": "1000x1000",
        "size": "8MB"
      },
      {
        "operation": "SVD",
        "speedup": 17.0,
        "shape0": 10000,
        "shape1": 1000,
        "shape": "10000x1000",
        "size": "800MB"
      },
      {
        "operation": "Stencil",
        "speedup": 5.1,
        "shape0": 1000,
        "shape1": 1000,
        "shape": "1000x1000",
        "size": "8MB"
      },
      {
        "operation": "Stencil",
        "speedup": 150.0,
        "shape0": 10000,
        "shape1": 10000,
        "shape": "10000x10000",
        "size": "800MB"
      }
    ]
  }
};

  var embedOpt = {"mode": "vega-lite"};

  function showError(el, error){
      el.innerHTML = ('&lt;div class="error" style="color:red;"&gt;'
                      + '&lt;p&gt;JavaScript Error: ' + error.message + '&lt;/p&gt;'
                      + "&lt;p&gt;This usually means there's a typo in your chart specification. "
                      + "See the javascript console for the full traceback.&lt;/p&gt;"
                      + '&lt;/div&gt;');
      throw error;
  }
  vegaEmbed("#vis", spec, embedOpt)
    .catch(error =&gt; showError(el, error));
&lt;/script&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2019/06/19/python-gpus-status-update/"/>
    <summary>This blogpost was delivered in talk form at the recent PASC
2019 conference.
Slides for that talk are
here.</summary>
    <category term="python" label="python"/>
    <category term="scipy" label="scipy"/>
    <published>2019-06-19T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2019/06/09/ucx-dgx/</id>
    <title>Experiments in High Performance Networking with UCX and DGX</title>
    <updated>2019-06-09T00:00:00+00:00</updated>
    <author>
      <name>Rick Zamora</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;This post is about experimental and rapidly changing software.
Code examples in this post should not be relied upon to work in the future.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/09/ucx-dgx.md&lt;/span&gt;, line 12)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="executive-summary"&gt;

&lt;p&gt;This post talks about connecting UCX, a high performance networking library, to
Dask, a parallel Python library, to accelerate communication-heavy workloads,
particularly when using GPUs.&lt;/p&gt;
&lt;p&gt;Additionally, we do this work on a DGX, a high-end multi-CPU multi-GPU machine
with a complex internal network. Working in this context was good to force
improvements in setting up Dask in heterogeneous situations targeting
different network cards, CPU sockets, GPUs, and so on..&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/09/ucx-dgx.md&lt;/span&gt;, line 23)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="motivation"&gt;
&lt;h1&gt;Motivation&lt;/h1&gt;
&lt;p&gt;Many distributed computing workloads are communication-bound.
This is common in cases like the following:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Dataframe joins&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Machine learning algorithms&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Complex array computations&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Communication becomes a bigger bottleneck as we accelerate our computation,
such as when we use GPUs for computing.&lt;/p&gt;
&lt;p&gt;Historically, high performance communication was only available using MPI, or
with custom solutions. This post describes an effort to get close to the
communication bandwidth of MPI while still maintaining the ease of
programmability and accessibility of a dynamic system like Dask.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/09/ucx-dgx.md&lt;/span&gt;, line 40)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="ucx-python-and-dask"&gt;
&lt;h1&gt;UCX, Python, and Dask&lt;/h1&gt;
&lt;p&gt;To get high performance networking in Dask, we wrapped UCX with Python and
then connected that to Dask.&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="http://www.openucx.org/"&gt;OpenUCX&lt;/a&gt; project provides a uniform API around
various high performance networking libraries like InfiniBand, traditional
networking protocols like TCP/shared memory, and GPU-specific protocols like
NVLink. It is a layer beneath something like OpenMPI (the main user of OpenUCX
today) that figures out which networking system to use.&lt;/p&gt;
&lt;a href="http://www.openucx.org/wp-content/uploads/2015/07/ucx-architecture-1024x505.jpg"&gt;
&lt;img src="http://www.openucx.org/wp-content/uploads/2015/07/ucx-architecture-1024x505.jpg"
     width="100%" /&gt;&lt;/a&gt;
&lt;p&gt;Python users today don’t have much access to these network libraries, except
through MPI, which is sometimes not ideal. (&lt;a class="reference external" href="https://pypi.org/search/?q=infiniband"&gt;Try searching for “infiniband” on
PyPI.&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;This led us to create &lt;a class="reference external" href="https://github.com/rapidsai/ucx-py/"&gt;UCX-Py&lt;/a&gt;
.
UCX-Py is a Python wrapper around the UCX C library, which provides a Pythonic
API, both with blocking syntax appropriate for traditional HPC programs, as
well as a non-blocking &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;async/await&lt;/span&gt;&lt;/code&gt; syntax for more concurrent programs (like
Dask).
For more information on UCX I recommend watching Akshay’s &lt;a class="reference external" href="https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9679-ucx-python%3a+a+flexible+communication+library+for+python+applications"&gt;UCX
talk&lt;/a&gt;
from the GPU Technology Conference 2019.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: UCX-Py was primarily developed by &lt;a class="reference external" href="https://github.com/Akshay-Venkatesh/"&gt;Akshay Venkatesh&lt;/a&gt; (UCX, NVIDIA)
&lt;a class="reference external" href="https://tomaugspurger.github.io/"&gt;Tom Augspurger&lt;/a&gt; (Dask, Pandas, Anaconda),
and &lt;a class="reference external" href="https://github.com/quasiben/"&gt;Ben Zaitlen&lt;/a&gt; (NVIDIA, RAPIDS, Dask))&lt;/em&gt;&lt;/p&gt;
&lt;video width="560" height="315" controls&gt;
    &lt;source src="https://developer.download.nvidia.com/video/gputechconf/gtc/2019/video/S9679/s9679-ucx-python-a-flexible-communication-library-for-python-applications.mp4"
            type="video/mp4"&gt;
&lt;/video&gt;
&lt;p&gt;We then &lt;a class="reference external" href="https://github.com/dask/distributed/blob/master/distributed/comm/ucx.py"&gt;extended Dask communications to optionally use UCX&lt;/a&gt;.
If you have UCX and UCX-Py installed, then you can use the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;ucx://&lt;/span&gt;&lt;/code&gt; protocol in
addresses or the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;--protocol&lt;/span&gt; &lt;span class="pre"&gt;ucx&lt;/span&gt;&lt;/code&gt; flag when starting things up, something like
this.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ dask-scheduler --protocol ucx
Scheduler started at ucx://127.0.0.1:8786

$ dask-worker ucx://127.0.0.1:8786
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ucx://127.0.0.1:8786&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/09/ucx-dgx.md&lt;/span&gt;, line 95)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="experiment"&gt;
&lt;h1&gt;Experiment&lt;/h1&gt;
&lt;p&gt;We modified our &lt;a class="reference external" href="https://github.com/mrocklin/dask-gpu-benchmarks/blob/master/cupy-svd.ipynb"&gt;SVD with Dask and CuPy
benchmark&lt;/a&gt;
benchmark to use the UCX protocol for inter-process communication and ran it on
half of a DGX machine, using four GPUs. Here is a minimal implementation of the
UCX-enabled code:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cupy&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_cuda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DGX&lt;/span&gt;

&lt;span class="c1"&gt;# Define DGX cluster and client&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DGX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create random data&lt;/span&gt;
&lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Perform distributed SVD&lt;/span&gt;
&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;svd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;By using UCX the overall communication times are reduced by an order of
magnitude. To produce the task-stream figures below, the benchmark was run on a
DGX-1 with &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;CUDA_VISIBLE_DEVICES=[0,1,2,3]&lt;/span&gt;&lt;/code&gt;. It is clear that the red task
bars, corresponding to inter-process communication, are significantly
compressed. Communications that were taking 500ms-1s before now take around 20ms.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Before UCX&lt;/strong&gt;:&lt;/p&gt;
&lt;iframe src="https://matthewrocklin.com/raw-host/task_stream_lcc_dgx16.html" width="100%" height="200"&gt;&lt;/iframe&gt;
&lt;p&gt;&lt;strong&gt;After UCX&lt;/strong&gt;:&lt;/p&gt;
&lt;iframe src="https://matthewrocklin.com/raw-host/task_stream_dgx_dgx16.html" width="100%" height="200"&gt;&lt;/iframe&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/09/ucx-dgx.md&lt;/span&gt;, line 139)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="diving-into-the-details"&gt;
&lt;h1&gt;Diving into the Details&lt;/h1&gt;
&lt;p&gt;On a GPU using NVLink we can get somewhere between 5-10 GB/s throughput between
pairs of GPUs. On a CPU this drops down to 1-2 GB/s (which seems well below
optimal).
These speeds can affect all Dask workloads (array, dataframe, xarray, ML, …),
but when the proper hardware is present, other bottlenecks may occur,
such as serialization when dealing with text or JSON-like data.&lt;/p&gt;
&lt;p&gt;This of course, depends on this fancy networking hardware being present.
On the GPU example above we’re mostly relying on NVLink, but we would also get
improved performance on an HPC InfiniBand network or even on a single laptop
machine using shared memory transports.&lt;/p&gt;
&lt;p&gt;The examples above was run on a DGX machine, which includes all of these
transports and more (as well as numerous GPUs).&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/09/ucx-dgx.md&lt;/span&gt;, line 156)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="dgx"&gt;
&lt;h1&gt;DGX&lt;/h1&gt;
&lt;p&gt;The test machine used above was a
&lt;a class="reference external" href="https://www.nvidia.com/en-us/data-center/dgx-1/"&gt;DGX-1&lt;/a&gt;, which has eight GPUs,
two CPU sockets, four Infiniband network cards, and a complex NVLink
arrangement. This is a good example of non-uniform hardware. Certain CPUs
are closer to certain GPUs and network cards, and understanding this proximity
has an order-of-magnitude effect on performance. This situation isn’t unique
to DGX machines. The same situation arises when we have …&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Multiple workers in one node, with several nodes in a cluster&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multiple nodes in one rack, with several racks in a data center&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multiple data centers, such as is the case with hybrid cloud&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Working with the DGX was interesting because it forced us to start thinking
about heterogeneity, and making it easier to specify complex deployment scenarios
with Dask.&lt;/p&gt;
&lt;p&gt;Here is a diagram showing how the GPUs, CPUs, and Infiniband
cards are connected to each other in a DGX-1:&lt;/p&gt;
&lt;a href="https://docs.nvidia.com/dgx/bp-dgx/index.html#networking"&gt;
  &lt;img src="https://docs.nvidia.com/dgx/bp-dgx/graphics/networks.png"
         width="100%" /&gt;
&lt;/a&gt;
&lt;p&gt;And here the output of nvidia-smi showing the NVLink, networking, and CPU affinity
structure (this is mostly orthogonal to the structure displayed above).&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ nvidia-smi  topo -m
     GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7   ib0   ib1   ib2   ib3
GPU0   X    NV1   NV1   NV2   NV2   SYS   SYS   SYS   PIX   SYS   PHB   SYS
GPU1  NV1    X    NV2   NV1   SYS   NV2   SYS   SYS   PIX   SYS   PHB   SYS
GPU2  NV1   NV2    X    NV2   SYS   SYS   NV1   SYS   PHB   SYS   PIX   SYS
GPU3  NV2   NV1   NV2    X    SYS   SYS   SYS   NV1   PHB   SYS   PIX   SYS
GPU4  NV2   SYS   SYS   SYS    X    NV1   NV1   NV2   SYS   PIX   SYS   PHB
GPU5  SYS   NV2   SYS   SYS   NV1    X    NV2   NV1   SYS   PIX   SYS   PHB
GPU6  SYS   SYS   NV1   SYS   NV1   NV2    X    NV2   SYS   PHB   SYS   PIX
GPU7  SYS   SYS   SYS   NV1   NV2   NV1   NV2    X    SYS   PHB   SYS   PIX
ib0   PIX   PIX   PHB   PHB   SYS   SYS   SYS   SYS    X    SYS   PHB   SYS
ib1   SYS   SYS   SYS   SYS   PIX   PIX   PHB   PHB   SYS    X    SYS   PHB
ib2   PHB   PHB   PIX   PIX   SYS   SYS   SYS   SYS   PHB   SYS    X    SYS
ib3   SYS   SYS   SYS   SYS   PHB   PHB   PIX   PIX   SYS   PHB   SYS    X

    CPU Affinity
GPU0  0-19,40-59
GPU1  0-19,40-59
GPU2  0-19,40-59
GPU3  0-19,40-59
GPU4  20-39,60-79
GPU5  20-39,60-79
GPU6  20-39,60-79
GPU7  20-39,60-79

Legend:

  X    = Self
  SYS  = Traverse PCIe as well as the SMP interconnect between NUMA nodes
  NODE = Travrese PCIe as well as the interconnect between PCIe Host Bridges
  PHB  = Traverse PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Traverse multiple PCIe switches (without PCIe Host Bridge)
  PIX  = Traverse a single PCIe switch
  NV#  = Traverse a bonded set of # NVLinks
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The DGX was originally designed for deep learning
applications. The complex network infrastructure above can be well used by
specialized NVIDIA networking libraries like
&lt;a class="reference external" href="https://developer.nvidia.com/nccl"&gt;NCCL&lt;/a&gt;, which knows how to route things
correctly, but is something of a challenge for other more general purpose
systems like Dask to adapt to.&lt;/p&gt;
&lt;p&gt;Fortunately, in meeting this challenge we were able to clean up a number of
related issues in Dask. In particular we can now:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Specify a more heterogeneous worker configuration when starting up a local cluster
&lt;a class="reference external" href="https://github.com/dask/distributed/pull/2675"&gt;dask/distributed #2675&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Learn bandwidth over time
&lt;a class="reference external" href="https://github.com/dask/distributed/pull/2658"&gt;dask/distributed #2658&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Add Worker plugins to help handle things like CPU affinity (though this is
quite general)
&lt;a class="reference external" href="https://github.com/dask/distributed/pull/2453"&gt;dask/distributed #2453&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;With these changes we’re now able to describe most of the DGX structure as
configuration in the Python function below:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;os&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Nanny&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SpecCluster&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Scheduler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed.worker&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TOTAL_MEMORY&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_cuda.local_cuda_cluster&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cuda_visible_devices&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;CPUAffinity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot; A Worker plugin to pin CPU affinity &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cores&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sched_setaffinity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;affinity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  &lt;span class="c1"&gt;# See nvidia-smi topo -m&lt;/span&gt;
    &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;79&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;79&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;79&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;79&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;DGX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;interface&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ib&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dashboard_address&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;:8787&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;threads_per_worker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;silence_logs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot; A Local Cluster for a DGX 1 machine&lt;/span&gt;

&lt;span class="sd"&gt;    NVIDIA&amp;#39;s DGX-1 machine has a complex architecture mapping CPUs,&lt;/span&gt;
&lt;span class="sd"&gt;    GPUs, and network hardware.  This function creates a local cluster&lt;/span&gt;
&lt;span class="sd"&gt;    that tries to respect this hardware as much as possible.&lt;/span&gt;

&lt;span class="sd"&gt;    It creates one Dask worker process per GPU, and assigns each worker&lt;/span&gt;
&lt;span class="sd"&gt;    process the correct CPU cores and Network interface cards to&lt;/span&gt;
&lt;span class="sd"&gt;    maximize performance.&lt;/span&gt;

&lt;span class="sd"&gt;    That being said, things aren&amp;#39;t perfect.  Today a DGX has very high&lt;/span&gt;
&lt;span class="sd"&gt;    performance between certain sets of GPUs and not others.  A Dask DGX&lt;/span&gt;
&lt;span class="sd"&gt;    cluster that uses only certain tightly coupled parts of the computer&lt;/span&gt;
&lt;span class="sd"&gt;    will have significantly higher bandwidth than a deployment on the&lt;/span&gt;
&lt;span class="sd"&gt;    entire thing.&lt;/span&gt;

&lt;span class="sd"&gt;    Parameters&lt;/span&gt;
&lt;span class="sd"&gt;    ----------&lt;/span&gt;
&lt;span class="sd"&gt;    interface: str&lt;/span&gt;
&lt;span class="sd"&gt;        The interface prefix for the infiniband networking cards.  This is&lt;/span&gt;
&lt;span class="sd"&gt;        often &amp;quot;ib&amp;quot;` or &amp;quot;bond&amp;quot;.  We will add the numeric suffix 0,1,2,3 as&lt;/span&gt;
&lt;span class="sd"&gt;        appropriate.  Defaults to &amp;quot;ib&amp;quot;.&lt;/span&gt;
&lt;span class="sd"&gt;    dashboard_address: str&lt;/span&gt;
&lt;span class="sd"&gt;        The address for the scheduler dashboard.  Defaults to &amp;quot;:8787&amp;quot;.&lt;/span&gt;
&lt;span class="sd"&gt;    CUDA_VISIBLE_DEVICES: str&lt;/span&gt;
&lt;span class="sd"&gt;        String like ``&amp;quot;0,1,2,3&amp;quot;`` or ``[0, 1, 2, 3]`` to restrict&lt;/span&gt;
&lt;span class="sd"&gt;        activity to different GPUs&lt;/span&gt;

&lt;span class="sd"&gt;    Examples&lt;/span&gt;
&lt;span class="sd"&gt;    --------&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; from dask_cuda import DGX&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; from dask.distributed import Client&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; cluster = DGX(interface=&amp;#39;ib&amp;#39;)&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; client = Client(cluster)&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CUDA_VISIBLE_DEVICES&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;0,1,2,3,4,5,6,7&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;memory_limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TOTAL_MEMORY&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;

    &lt;span class="n"&gt;spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;cls&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Nanny&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;options&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;env&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="s2"&gt;&amp;quot;CUDA_VISIBLE_DEVICES&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cuda_visible_devices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;ii&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;
                    &lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="s2"&gt;&amp;quot;UCX_TLS&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;rc,cuda_copy,cuda_ipc&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;interface&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;interface&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;protocol&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;ucx&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;ncores&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;threads_per_worker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;data&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;preload&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dask_cuda.initialize_context&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;dashboard_address&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;:0&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;plugins&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;CPUAffinity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;affinity&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;silence_logs&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;silence_logs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;memory_limit&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ii&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;scheduler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;&amp;quot;cls&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Scheduler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;&amp;quot;options&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;interface&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;interface&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;protocol&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;ucx&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;dashboard_address&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;dashboard_address&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;SpecCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;silence_logs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;silence_logs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;However, we never got the NVLink structure down. The Dask scheduler currently
still assumes uniform bandwidths between workers. We’ve started to make small
steps towards changing this, but we’re not there yet (this will be useful as
well for people that want to think about in-rack or cross-data-center
deployments).&lt;/p&gt;
&lt;p&gt;As usual, in solving a highly specific problem, we were able to solve a number
of lingering general features, which then made our specific problem easy to
write down.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/09/ucx-dgx.md&lt;/span&gt;, line 373)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="future-work"&gt;
&lt;h1&gt;Future Work&lt;/h1&gt;
&lt;p&gt;There has been significant effort over the last few months make everything
above work. In particular we …&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Modified UCX to support client-server workloads&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wrapped UCX with UCX-Py and design a Python async-await friendly interface&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wrapped UCX-Py with Dask&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hooked everything together to make generic workloads function well&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result is quite nice, especially for more communication heavy workloads.
However there is still plenty to do. This section details what we’re thinking
about now to continue this work.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Routing within complex networks&lt;/strong&gt;:
If you restrict yourself to four of the eight GPUs in a DGX, you can get 5-12 GB/s
between pairs of GPUs. For some workloads this can be significant. It
makes the system feel much more like a single unit than a bunch of isolated
machines.&lt;/p&gt;
&lt;p&gt;However we still can’t get great performance across the whole DGX because
there are many GPU-pairs that are not connected by NVLink, and so we get 10x
slower speeds. These dominate communication costs if you naively try to use
the full DGX.&lt;/p&gt;
&lt;p&gt;This might be solved either by:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Teaching Dask to avoid these communications&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Teaching UCX to route communications like these through a chain of
multiple NVLink connections&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Avoiding complex networks altogether. Newer systems like the DGX-2 use
NVSwitch, which provides uniform connectivity, with each GPU connected
to every other GPU.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;em&gt;Edit&lt;/em&gt;: I’ve since learned that UCX should be able to handle this. We
should still get PCIe speeds (around 4-7 GB/s) even when we don’t have
NVLink once an upstream bug gets fixed. Hooray!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CPU:&lt;/strong&gt; We can get 1-2 GB/s across InfiniBand, which isn’t bad, but also
wasn’t the full 5-8 GB/s that we were hoping for. This deserves more serious
profiling to determine what is going wrong. The current guess is that this
has to do with memory allocations.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;0&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000000000&lt;/span&gt;  &lt;span class="c1"&gt;# 1 GB&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mi"&gt;248&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;223&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;472&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;470&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;   &lt;span class="c1"&gt;# &amp;lt;&amp;lt;----- Around 2 GB/s.  Slower than I expected&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Probably we’re just doing something dumb here.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Package UCX:&lt;/strong&gt; Currently I’m building the UCX and UCX-Py libraries from
source (see appendix below for instructions). Ideally these would become
conda packages. &lt;a class="reference external" href="https://github.com/jakirkham"&gt;John Kirkham&lt;/a&gt; (Conda Forge,
NVIDIA, Dask) is taking a look at this along with the UCX developers from
Mellanox.&lt;/p&gt;
&lt;p&gt;See &lt;a class="reference external" href="https://github.com/rapidsai/ucx-py/issues/65"&gt;ucx-py #65&lt;/a&gt; for
more information.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Learn Heterogeneous Bandwidths:&lt;/strong&gt; In order to make good scheduling
decisions Dask needs to estimate how long it will take to move data between
machines. This question is now becoming much more complex, and depends on
both the source and destination machines (the network topology) the data
type (NumPy array, GPU array, Pandas Dataframe with text) and more. In
complex situations our bandwidths can span a 100x range (100 MB/s to 10
GB/s).&lt;/p&gt;
&lt;p&gt;Dask will have to develop more complex models for bandwidth, and
learn these over time.&lt;/p&gt;
&lt;p&gt;See &lt;a class="reference external" href="https://github.com/dask/distributed/issues/2743"&gt;dask/distributed
#2743&lt;/a&gt; for more
information.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Support other GPU libraries:&lt;/strong&gt; To send GPU data around we need to teach
Dask how to serialize Python objects into GPU buffers. There is code in
the dask/distributed repository to do this for Numba, CuPy, and RAPIDS cuDF
objects, but we’ve really only tested CuPy seriously. We should expand
this by some of the following steps:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;Try a distributed Dask cuDF join computation&lt;/p&gt;
&lt;p&gt;See &lt;a class="reference external" href="https://github.com/dask/distributed/pull/2746"&gt;dask/distributed #2746&lt;/a&gt; for initial work here.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Teach Dask to serialize array GPU libraries, like PyTorch and
TensorFlow, or possibly anything that supports the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;__cuda_array_interface__&lt;/span&gt;&lt;/code&gt; protocol.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Track down communication failures:&lt;/strong&gt; We still occasionally get
unexplained communication failures. We should stress test this system to
discover rough corners.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TCP&lt;/strong&gt;: Groups with high performing TCP networks can’t yet make use of UCX+Dask (though they can use either one individually).&lt;/p&gt;
&lt;p&gt;Currently using UCX in a client-server mode as we’re doing with
Dask requires access to RDMA libraries, which are often not found on systems
without networking systems like InfiniBand. This means that groups with
high performing TCP networks can’t make use of UCX+Dask.&lt;/p&gt;
&lt;p&gt;This is in progress at &lt;a class="reference external" href="https://github.com/openucx/ucx/pull/3570"&gt;openucx/ucx #3570&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Commodity Hardware&lt;/strong&gt;: Currently this code is only really useful on
high performance Linux systems that have InfiniBand or NVLink. However,
it would be nice to also use this on more commodity systems, including
personal laptop computers using TCP and shared memory.&lt;/p&gt;
&lt;p&gt;Currently Dask uses TCP for inter-process communication on a single machine.
Using UCX on a personal computer would give us access to shared memory
speeds, which tend to be an order of magnitude faster.&lt;/p&gt;
&lt;p&gt;See &lt;a class="reference external" href="https://github.com/openucx/ucx/issues/3663"&gt;openucx/ucx #3663&lt;/a&gt; for more
information.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tune Performance:&lt;/strong&gt; The 5-10 GB/s bandwidths that we see with NVLink
today are sub-optimal. With UCX-Py alone we’re able to get something like
15 GB/s on large message tranfers. We should benchmark and tune our
implementation to see what is taking up the extra time. Until things work
more robustly though, this is a secondary priority.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/09/ucx-dgx.md&lt;/span&gt;, line 493)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="appendix-setup"&gt;
&lt;h1&gt;Appendix: Setup&lt;/h1&gt;
&lt;p&gt;Performing these experiments depends currently on development branches in a few
repositories. This section includes my current setup.&lt;/p&gt;
&lt;section id="create-conda-environment"&gt;
&lt;h2&gt;Create Conda Environment&lt;/h2&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;conda&lt;/span&gt; &lt;span class="n"&gt;create&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="n"&gt;ucx&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3.7&lt;/span&gt; &lt;span class="n"&gt;libtool&lt;/span&gt; &lt;span class="n"&gt;cmake&lt;/span&gt; &lt;span class="n"&gt;automake&lt;/span&gt; &lt;span class="n"&gt;autoconf&lt;/span&gt; &lt;span class="n"&gt;cython&lt;/span&gt; &lt;span class="n"&gt;bokeh&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt; &lt;span class="n"&gt;pkg&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="n"&gt;ipython&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt; &lt;span class="n"&gt;numba&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Note: for some reason using conda-forge makes the autogen step below fail.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="set-up-ucx"&gt;
&lt;h2&gt;Set up UCX&lt;/h2&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;# Clone UCX repository and get branch
git clone https://github.com/openucx/ucx
cd ucx
git remote add Akshay-Venkatesh git@github.com:Akshay-Venkatesh/ucx.git
git remote update Akshay-Venkatesh
git checkout ucx-cuda

# Build
git clean -xfd
export CUDA_HOME=/usr/local/cuda-9.2/
./autogen.sh
mkdir build
cd build
../configure --prefix=$CONDA_PREFIX --enable-debug --with-cuda=$CUDA_HOME --enable-mt --disable-cma CPPFLAGS=&amp;quot;-I//usr/local/cuda-9.2/include&amp;quot;
make -j install

# Verify
ucx_info -d
which ucx_info  # verify that this is in the conda environment

# Verify that we see NVLink speeds
ucx_perftest -t tag_bw -m cuda -s 1048576 -n 1000 &amp;amp; ucx_perftest dgx15 -t tag_bw -m cuda -s 1048576 -n 1000
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="set-up-ucx-py"&gt;
&lt;h2&gt;Set up UCX-Py&lt;/h2&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;git clone git@github.com:rapidsai/ucx-py
cd ucx-py

export UCX_PATH=$CONDA_PREFIX
make install
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="set-up-dask"&gt;
&lt;h2&gt;Set up Dask&lt;/h2&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt; &lt;span class="n"&gt;clone&lt;/span&gt; &lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="nd"&gt;@github&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt;
&lt;span class="n"&gt;cd&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;
&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;cd&lt;/span&gt; &lt;span class="o"&gt;..&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt; &lt;span class="n"&gt;clone&lt;/span&gt; &lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="nd"&gt;@github&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;distributed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt;
&lt;span class="n"&gt;cd&lt;/span&gt; &lt;span class="n"&gt;distributed&lt;/span&gt;
&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;cd&lt;/span&gt; &lt;span class="o"&gt;..&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="optionally-set-up-cupy"&gt;
&lt;h2&gt;Optionally set up cupy&lt;/h2&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;cuda92&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="optionally-set-up-cudf"&gt;
&lt;h2&gt;Optionally set up cudf&lt;/h2&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;conda&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;rapidsai&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;nightly&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;conda&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;forge&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;numba&lt;/span&gt; &lt;span class="n"&gt;cudf&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;cudf&lt;/span&gt; &lt;span class="n"&gt;cudatoolkit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;9.2&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="optionally-set-up-jupyterlab"&gt;
&lt;h2&gt;Optionally set up JupyterLab&lt;/h2&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;conda&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;ipykernel&lt;/span&gt; &lt;span class="n"&gt;jupyterlab&lt;/span&gt; &lt;span class="n"&gt;nb_conda_kernels&lt;/span&gt; &lt;span class="n"&gt;nodejs&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;For the Dask dashboard&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;dask_labextension&lt;/span&gt;
&lt;span class="n"&gt;jupyter&lt;/span&gt; &lt;span class="n"&gt;labextension&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;labextension&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="my-benchmark"&gt;
&lt;h2&gt;My Benchmark&lt;/h2&gt;
&lt;p&gt;I’ve been using the following benchmark to test communication. It allocates a
chunked Dask array, and then adds it to its transpose, which forces a lot of
communication, but not much computation.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;collections&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pprint&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pprint&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cupy&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed.utils&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;format_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format_bytes&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

    &lt;span class="c1"&gt;# Set up workers on the local machine&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;DGX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asynchronous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;silence_logs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;asynchronous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

            &lt;span class="c1"&gt;# Create a simple random array&lt;/span&gt;
            &lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;40000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;128 MiB&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;npartitions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;chunks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Add X to its transpose, forcing computation&lt;/span&gt;
            &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Collect, aggregate, and print peer-to-peer bandwidths&lt;/span&gt;
            &lt;span class="n"&gt;incoming_logs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;dask_worker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;dask_worker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;incoming_transfer_log&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;bandwidths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;incoming_logs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;total&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;bandwidths&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;who&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;bandwidth&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;bandwidths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;workers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;w1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;workers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;w2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;format_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/s&amp;#39;&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;bandwidths&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;pprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bandwidths&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_event_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_until_complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Note: most of this example is just getting back diagnostics, which can be
easily ignored. Also, you can drop the async/await code if you like. I think
that there should probably be more examples in the world using Dask with
async/await syntax, so I decided to leave it in.&lt;/em&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2019/06/09/ucx-dgx/"/>
    <summary>This post is about experimental and rapidly changing software.
Code examples in this post should not be relied upon to work in the future.</summary>
    <category term="python" label="python"/>
    <category term="scipy" label="scipy"/>
    <published>2019-06-09T00:00:00+00:00</published>
  </entry>
</feed>
