<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts tagged pangeo</title>
  <updated>2026-03-05T15:05:26.686186+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/tag/pangeo/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2017/09/18/pangeo-1/</id>
    <title>Dask on HPC - Initial Work</title>
    <updated>2017-09-18T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc.&lt;/a&gt; and the &lt;a class="reference external" href="https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504780"&gt;NSF
EarthCube&lt;/a&gt; program.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We &lt;a class="reference external" href="http://blogs.ei.columbia.edu/2017/09/13/pangeo-project-will-improve-access-to-climate-data/"&gt;recently
announced&lt;/a&gt;
a collaboration between the &lt;a class="reference external" href="https://ncar.ucar.edu/"&gt;National Center for Atmospheric Research
(NCAR)&lt;/a&gt;, &lt;a class="reference external" href="http://www.ldeo.columbia.edu/"&gt;Columbia
University&lt;/a&gt;, and Anaconda Inc to accelerate the
analysis of atmospheric and oceanographic data on high performance computers
(HPC) with XArray and Dask. The &lt;a class="reference external" href="https://figshare.com/articles/Pangeo_NSF_Earthcube_Proposal/5361094"&gt;full
text&lt;/a&gt; of
the proposed work is &lt;a class="reference external" href="https://figshare.com/articles/Pangeo_NSF_Earthcube_Proposal/5361094"&gt;available
here&lt;/a&gt;. We
are very grateful to the NSF EarthCube program for funding this work, which
feels particularly relevant today in the wake (and continued threat) of the
major storms Harvey, Irma, and Jose.&lt;/p&gt;
&lt;p&gt;This is a collaboration of academic scientists (Columbia), infrastructure
stewards (NCAR), and software developers (Anaconda and Columbia and NCAR) to
scale current workflows with XArray and Jupyter onto big-iron HPC systems and
peta-scale datasets. In the first week after the grant closed a few of us
focused on the quickest path to get science groups up and running with XArray,
Dask, and Jupyter on these HPC systems. This blogpost details what we achieved
and some of the new challenges that we’ve found in that first week. We hope to
follow this blogpost with many more to come in the future.
Today we cover the following topics:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Deploying Dask with MPI&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Interactive deployments on a batch job scheduler, in this case PBS&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The virtues of JupyterLab in a remote system&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Network performance and 3GB/s infiniband&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Modernizing XArray’s interactions with Dask’s distributed scheduler&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A video walkthrough deploying Dask on XArray on an HPC system is available &lt;a class="reference external" href="https://www.youtube.com/watch?v=7i5m78DSr34"&gt;on
YouTube&lt;/a&gt; and instructions for
atmospheric scientists with access to the &lt;a class="reference external" href="https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne"&gt;Cheyenne
Supercomputer&lt;/a&gt;
is available
&lt;a class="reference external" href="https://github.com/pangeo-data/pangeo-discussion/wiki/Getting-Started-with-Dask-on-Cheyenne"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Now lets start with technical issues:&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2017/09/18/pangeo-1.md&lt;/span&gt;, line 50)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="deploying-dask-with-mpi"&gt;

&lt;p&gt;HPC systems use job schedulers like SGE, SLURM, PBS, LSF, and others. Dask
has been deployed on all of these systems before either by academic groups or
financial companies. However every time we do this it’s a little different and
generally tailored to a particular cluster.&lt;/p&gt;
&lt;p&gt;We wanted to make something more general. This started out as a &lt;a class="reference external" href="https://github.com/dask/distributed/issues/1260"&gt;GitHub issue
on PBS scripts&lt;/a&gt; that tried to
make a simple common template that people could copy-and-modify.
Unfortunately, there were significant challenges with this. HPC systems and
their job schedulers seem to focus and easily support only two common use
cases:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Embarrassingly parallel “run this script 1000 times” jobs. This is too
simple for what we have to do.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Message_Passing_Interface"&gt;MPI&lt;/a&gt; jobs. This
seemed like overkill, but is the approach that we ended up taking.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Deploying dask is somewhere between these two. It falls into the master-slave
pattern (or perhaps more appropriately coordinator-workers). We ended up
building an &lt;a class="reference external" href="http://mpi4py.readthedocs.io/en/stable/"&gt;MPI4Py&lt;/a&gt; program that
launches Dask. MPI is well supported, and more importantly consistently
supported, by all HPC job schedulers so depending on MPI provides a level of
stability across machines. Now dask.distributed ships with a new &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-mpi&lt;/span&gt;&lt;/code&gt;
executable:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;mpirun --np 4 dask-mpi
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To be clear, Dask isn’t using MPI for inter-process communication. It’s still
using TCP. We’re just using MPI to launch a scheduler and several workers and
hook them all together. In pseudocode the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-mpi&lt;/span&gt;&lt;/code&gt; executable looks
something like this:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;mpi4py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MPI&lt;/span&gt;
&lt;span class="n"&gt;comm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MPI&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COMM_WORLD&lt;/span&gt;
&lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;comm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get_rank&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;start_dask_scheduler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;start_dask_worker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Socially this is useful because &lt;em&gt;every&lt;/em&gt; cluster management team knows how to
support MPI, so anyone with access to such a cluster has someone they can ask
for help. We’ve successfully translated the question “How do I start Dask?” to
the question “How do I run this MPI program?” which is a question that the
technical staff at supercomputer facilities are generally much better equipped
to handle.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2017/09/18/pangeo-1.md&lt;/span&gt;, line 102)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="working-interactively-on-a-batch-scheduler"&gt;
&lt;h1&gt;Working Interactively on a Batch Scheduler&lt;/h1&gt;
&lt;p&gt;Our collaboration is focused on interactive analysis of big datasets. This
means that people expect to open up Jupyter notebooks, connect to clusters
of many machines, and compute on those machines while they sit at their
computer.&lt;/p&gt;
&lt;img src="/images/pangeo-dask-client.png" width="60%"&gt;
&lt;p&gt;Unfortunately most job schedulers were designed for batch scheduling. They
will try to run your job quickly, but don’t mind waiting for a few hours for a
nice set of machines on the super computer to open up. As you ask for more
time and more machines, waiting times can increase drastically. For most MPI
jobs this is fine because people aren’t expecting to get a result right away
and they’re certainly not interacting with the program, but in our case we
really do want some results right away, even if they’re only part of what we
asked for.&lt;/p&gt;
&lt;p&gt;Handling this problem long term will require both technical work and policy
decisions. In the short term we take advantage of two facts:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Many small jobs can start more quickly than a few large ones. These take
advantage of holes in the schedule that are too small to be used by larger
jobs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dask doesn’t need to be started all at once. Workers can come and go.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;And so I find that if I ask for several single machine jobs I can easily cobble
together a sizable cluster that starts very quickly. In practice this looks
like the following:&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;qsub&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;      &lt;span class="c1"&gt;# only ask for one machine&lt;/span&gt;
&lt;span class="n"&gt;qsub&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;  &lt;span class="c1"&gt;# ask for one more machine&lt;/span&gt;
&lt;span class="n"&gt;qsub&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;  &lt;span class="c1"&gt;# ask for one more machine&lt;/span&gt;
&lt;span class="n"&gt;qsub&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;  &lt;span class="c1"&gt;# ask for one more machine&lt;/span&gt;
&lt;span class="n"&gt;qsub&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;  &lt;span class="c1"&gt;# ask for one more machine&lt;/span&gt;
&lt;span class="n"&gt;qsub&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;  &lt;span class="c1"&gt;# ask for one more machine&lt;/span&gt;
&lt;span class="n"&gt;qsub&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;  &lt;span class="c1"&gt;# ask for one more machine&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Our main job has a wall time of about an hour. The workers have shorter wall
times. They can come and go as needed throughout the computation as our
computational needs change.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2017/09/18/pangeo-1.md&lt;/span&gt;, line 146)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="jupyter-lab-and-web-frontends"&gt;
&lt;h1&gt;Jupyter Lab and Web Frontends&lt;/h1&gt;
&lt;p&gt;Our scientific collaborators enjoy building Jupyter notebooks of their work.
This allows them to manage their code, scientific thoughts, and visual outputs
all at once and for them serves as an artifact that they can share with their
scientific teams and collaborators. To help them with this we start a Jupyter
server on the same machine in their allocation that is running the Dask
scheduler. We then provide them with SSH-tunneling lines that they can
copy-and-paste to get access to the Jupyter server from their personal
computer.&lt;/p&gt;
&lt;p&gt;We’ve been using the new Jupyter Lab rather than the classic notebook. This is
especially convenient for us because it provides much of the interactive
experience that they lost by not working on their local machine. They get a
file browser, terminals, easy visualization of textfiles and so on without
having to repeatedly SSH into the HPC system. We get all of this functionality
on a single connection and with an intuitive Jupyter interface.&lt;/p&gt;
&lt;p&gt;For now we give them a script to set all of this up. It starts Jupyter Lab
using Dask and then prints out the SSH-tunneling line.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler_file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scheduler.json&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;socket&lt;/span&gt;
&lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_on_scheduler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gethostname&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;start_jlab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dask_scheduler&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;subprocess&lt;/span&gt;
    &lt;span class="n"&gt;proc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Popen&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;jupyter&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;lab&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;--ip&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;--no-browser&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;dask_scheduler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jlab_proc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;proc&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_on_scheduler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_jlab&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ssh -N -L 8787:&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s2"&gt;:8787 -L 8888:&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s2"&gt;:8888 -L 8789:&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s2"&gt;:8789 cheyenne.ucar.edu&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Long term we would like to switch to an entirely point-and-click interface
(perhaps something like JupyterHub) but this will requires additional thinking
about deploying distributed resources along with the Jupyter server instance.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2017/09/18/pangeo-1.md&lt;/span&gt;, line 188)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="network-performance-on-infiniband"&gt;
&lt;h1&gt;Network Performance on Infiniband&lt;/h1&gt;
&lt;p&gt;The intended computations move several terabytes across the cluster.
On this cluster Dask gets about 1GB/s simultaneous read/write network bandwidth
per machine using the high-speed Infiniband network. For any commodity or
cloud-based system this is &lt;em&gt;very fast&lt;/em&gt; (about 10x faster than what I observe on
Amazon). However for a super-computer this is only about 30% of what’s
possible (see &lt;a class="reference external" href="https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne"&gt;hardware specs&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;I suspect that this is due to byte-handling in Tornado, the networking library
that Dask uses under the hood. The following image shows the diagnostic
dashboard for one worker after a communication-heavy workload. We see 1GB/s
for both read and write. We also see 100% CPU usage.&lt;/p&gt;
&lt;p&gt;&lt;a href="/images/pangeo-network.png"&gt;&lt;img src="/images/pangeo-network.png" width="70%"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Network performance is a big question for HPC users looking at Dask. If we can
get near MPI bandwidth then that may help to reduce concerns for this
performance-oriented community.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/pangeo-data/pangeo-discussion/issues/6"&gt;Github issue for this project&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/tornadoweb/tornado/issues/2147"&gt;Github issue for Tornado&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a class="reference external" href="https://stackoverflow.com/questions/43881157/how-do-i-use-an-infiniband-network-with-dask"&gt;&lt;em&gt;How do I use Infiniband network with Dask?&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2017/09/18/pangeo-1.md&lt;/span&gt;, line 213)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="xarray-and-dask-distributed"&gt;
&lt;h1&gt;XArray and Dask.distributed&lt;/h1&gt;
&lt;p&gt;XArray was the first major project to use Dask internally. This early
integration was critical to prove out Dask’s internals with user feedback.
However it also means that some parts of XArray were designed well before some
of the newer parts of Dask, notably the asynchronous distributed scheduling
features.&lt;/p&gt;
&lt;p&gt;XArray can still use Dask on a distributed cluster, but only with the subset of
features that are also available with the single machine scheduler. This means
that persisting data in distributed RAM, parallel debugging, publishing shared
datasets, and so on all require significantly more work today with XArray than
they should.&lt;/p&gt;
&lt;p&gt;To address this we plan to update XArray to follow a newly proposed &lt;a class="reference external" href="https://github.com/dask/dask/pull/1068#issuecomment-326591640"&gt;Dask
interface&lt;/a&gt;.
This is complex enough to handle all Dask scheduling features, but light weight
enough not to actually require any dependence on the Dask library itself.
(Work by &lt;a class="reference external" href="http://jcrist.github.io/"&gt;Jim Crist&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;We will also eventually need to look at reducing overhead for inspecting
several NetCDF files, but we haven’t yet run into this, so I plan to wait.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2017/09/18/pangeo-1.md&lt;/span&gt;, line 236)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="future-work"&gt;
&lt;h1&gt;Future Work&lt;/h1&gt;
&lt;p&gt;We think we’re at a decent point for scientific users to start playing with the
system. We have a &lt;a class="reference external" href="https://github.com/pangeo-data/pangeo-discussion/wiki/Getting-Started-with-Dask-on-Cheyenne"&gt;Getting Started with Dask on Cheyenne&lt;/a&gt;
wiki page that our first set of guinea pig users have successfully run through
without much trouble. We’ve also identified a number of issues that the
software developers can work on while the scientific teams spin up.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/tornadoweb/tornado/issues/2147"&gt;Zero copy Tornado writes&lt;/a&gt; to &lt;a class="reference external" href="https://github.com/pangeo-data/pangeo-discussion/issues/6"&gt;improve network bandwidth&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/pangeo-data/pangeo-discussion/issues/5"&gt;Enable Dask.distributed features in XArray&lt;/a&gt; by &lt;a class="reference external" href="https://github.com/dask/dask/pull/1068"&gt;formalizing dask’s expected interface&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/pangeo-data/pangeo-discussion/issues/8"&gt;Dynamic deployments on batch job schedulers&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We would love to engage other collaborators throughout this process. If you or
your group work on related problems we would love to hear from you. This grant
isn’t just about serving the scientific needs of researchers at Columbia and
NCAR, but about building long-term systems that can benefit the entire
atmospheric and oceanographic community. Please engage on the
&lt;a class="reference external" href="https://github.com/pangeo-data/pangeo-discussion/issues"&gt;Pangeo GitHub issue tracker&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2017/09/18/pangeo-1/"/>
    <summary>This work is supported by Anaconda Inc. and the NSF
EarthCube program.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="pangeo" label="pangeo"/>
    <category term="scipy" label="scipy"/>
    <published>2017-09-18T00:00:00+00:00</published>
  </entry>
</feed>
