<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts tagged HPC</title>
  <updated>2026-03-05T15:05:23.140276+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/tag/hpc/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2019/09/13/jupyter-on-dask/</id>
    <title>Co-locating a Jupyter Server and Dask Scheduler</title>
    <updated>2019-09-13T00:00:00+00:00</updated>
    <author>
      <name>Matthew Rocklin</name>
    </author>
    <content type="html">&lt;p&gt;If you want, you can have Dask set up a Jupyter notebook server for you,
co-located with the Dask scheduler. There are many ways to do this, but this
blog post lists two.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/13/jupyter-on-dask.md&lt;/span&gt;, line 13)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="first-why-would-you-do-this"&gt;

&lt;p&gt;Sometimes people inside of large institutions have complex deployment pains.
It takes them a while to stand up a process running on a machine in their
cluster, with all of the appropriate networking ports open and such.
In that situation, it can sometimes be nice to do this just once, say for Dask,
rather than twice, say for Dask and for Jupyter.&lt;/p&gt;
&lt;p&gt;Probably in these cases people should invest in a long term solution like
&lt;a class="reference external" href="https://jupyter.org/hub"&gt;JupyterHub&lt;/a&gt;,
or one of its enterprise variants,
but this blogpost gives a couple of hacks in the meantime.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/13/jupyter-on-dask.md&lt;/span&gt;, line 26)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="hack-1-create-a-jupyter-server-from-a-python-function-call"&gt;
&lt;h1&gt;Hack 1: Create a Jupyter server from a Python function call&lt;/h1&gt;
&lt;p&gt;If your Dask scheduler is already running, connect to it with a Client and run
a Python function that starts up a Jupyter server.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;scheduler-address:8786&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;start_juptyer_server&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;notebook.notebookapp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NotebookApp&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;NotebookApp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;([])&lt;/span&gt;  &lt;span class="c1"&gt;# add command line args here if you want&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_on_scheduler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_jupyter_server&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;If you have a complex networking setup (maybe you’re on the cloud or HPC and
had to open up a port explicitly) then you might want to install
&lt;a class="reference external" href="https://jupyter-server-proxy.readthedocs.io/en/latest/"&gt;jupyter-server-proxy&lt;/a&gt;
(which Dask also uses by default if installed), and then go to
&lt;a class="reference external" href="https://example.com"&gt;http://scheduler-address:8787/proxy/8888&lt;/a&gt; . The Dask dashboard can route your
connection to Jupyter (Jupyter is also kind enough to do the same for Dask if
it is the main service).&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/13/jupyter-on-dask.md&lt;/span&gt;, line 52)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="hack-2-preload-script"&gt;
&lt;h1&gt;Hack 2: Preload script&lt;/h1&gt;
&lt;p&gt;This is also a great opportunity to learn about the various ways of &lt;a class="reference external" href="https://docs.dask.org/en/latest/setup/custom-startup.html"&gt;adding
custom startup and teardown&lt;/a&gt;.
One such way, is a preload script like the following:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# jupyter-preload.py&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;notebook.notebookapp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NotebookApp&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;dask_setup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;NotebookApp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;([])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-bash notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;dask-scheduler&lt;span class="w"&gt; &lt;/span&gt;--preload&lt;span class="w"&gt; &lt;/span&gt;jupyter-preload.py
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;That script will run at an appropriate time during scheduler startup. You can
also put this into configuration&lt;/p&gt;
&lt;div class="highlight-yaml notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nt"&gt;distributed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;preload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/path/to/jupyter-preload.py&amp;quot;&lt;/span&gt;&lt;span class="p p-Indicator"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/13/jupyter-on-dask.md&lt;/span&gt;, line 80)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="really-though-you-should-use-something-else"&gt;
&lt;h1&gt;Really though, you should use something else&lt;/h1&gt;
&lt;p&gt;This is mostly a hack. If you’re at an institution then you should ask for
something like &lt;a class="reference external" href="https://jupyter.org/hub"&gt;JuptyerHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Or, you might also want to run this in a separate subprocess, so that Jupyter
and the Dask scheduler don’t collide with each other. This shouldn’t be so
much of a problem (they’re both pretty light weight), but isolating them
probably makes sense.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/13/jupyter-on-dask.md&lt;/span&gt;, line 90)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="thanks-nick"&gt;
&lt;h1&gt;Thanks Nick!&lt;/h1&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://github.com/bollwyvl"&gt;Nick Bollweg&lt;/a&gt;, who answered a &lt;a class="reference external" href="https://github.com/jupyter/notebook/issues/4873"&gt;questions on this topic here&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2019/09/13/jupyter-on-dask/"/>
    <summary>If you want, you can have Dask set up a Jupyter notebook server for you,
co-located with the Dask scheduler. There are many ways to do this, but this
blog post lists two.</summary>
    <category term="HPC" label="HPC"/>
    <published>2019-09-13T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2019/08/28/dask-on-summit/</id>
    <title>Dask on HPC: a case study</title>
    <updated>2019-08-28T00:00:00+00:00</updated>
    <author>
      <name>Matthew Rocklin</name>
    </author>
    <content type="html">&lt;p&gt;Dask is deployed on traditional HPC machines with increasing frequency.
In the past week I’ve personally helped four different groups get set up.
This is a surprisingly individual process,
because every HPC machine has its own idiosyncrasies.
Each machine uses a job scheduler like SLURM/PBS/SGE/LSF/…, a network file
system, and fast interconnect, but each of those sub-systems have slightly
different policies on a machine-by-machine basis, which is where things get tricky.&lt;/p&gt;
&lt;p&gt;Typically we can solve these problems in about 30 minutes if we have both:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Someone familiar with the machine, like a power-user or an IT administrator&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Someone familiar with setting up Dask&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These systems span a large range of scale. At different ends of this scale
this week I’ve seen both:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;A small in-house 24-node SLURM cluster for research work inside of a
bio-imaging lab&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Summit, the world’s most powerful supercomputer&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this post I’m going to share a few notes of what I went through in dealing
with Summit, which was particularly troublesome. Hopefully this gives a sense
for the kinds of situations that arise. These tips likely don’t apply to your
particular system, but hopefully they give a flavor of what can go wrong,
and the processes by which we track things down.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 35)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="power-architecture"&gt;

&lt;p&gt;First, Summit is an IBM PowerPC machine, meaning that packages compiled on
normal Intel chips won’t work. Fortunately, Anaconda maintains a download of
their distribution that works well with the Power architecture, so that gave me
a good starting point.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.anaconda.com/distribution/#linux"&gt;https://www.anaconda.com/distribution/#linux&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Packages do seem to be a few months older than for the normal distribution, but
I can live with that.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 47)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="install-dask-jobqueue-and-configure-basic-information"&gt;
&lt;h1&gt;Install Dask-Jobqueue and configure basic information&lt;/h1&gt;
&lt;p&gt;We need to tell Dask how many cores and how much memory is on each machine.
This process is fairly straightforward, is well documented at
&lt;a class="reference external" href="https://jobqueue.dask.org"&gt;jobqueue.dask.org&lt;/a&gt; with an informative screencast,
and even self-directing with error messages.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_jobqueue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PBSCluster&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PBSCluster&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="ne"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;You&lt;/span&gt; &lt;span class="n"&gt;must&lt;/span&gt; &lt;span class="n"&gt;specify&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt; &lt;span class="n"&gt;many&lt;/span&gt; &lt;span class="n"&gt;cores&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;use&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="n"&gt;like&lt;/span&gt; &lt;span class="err"&gt;``&lt;/span&gt;&lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="err"&gt;``&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I’m going to skip this section for now because, generally, novice users are
able to handle this. For more information, consider watching this YouTube
video (30m).&lt;/p&gt;
&lt;iframe width="560" height="315"
        src="https://www.youtube.com/embed/FXsgmwpRExM?rel=0"
        frameborder="0" allow="autoplay; encrypted-media"
        allowfullscreen&gt;&lt;/iframe&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 69)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="invalid-operations-in-the-job-script"&gt;
&lt;h1&gt;Invalid operations in the job script&lt;/h1&gt;
&lt;p&gt;So we make a cluster object with all of our information, we call &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.scale&lt;/span&gt;&lt;/code&gt; and
we get some error message from the job scheduler.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_jobqueue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LSFCluster&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LSFCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;600 GB&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;GEN119&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;walltime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;00:30&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ask for three nodes&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;Command:
bsub /tmp/tmp4874eufw.sh
stdout:

Typical usage:
  bsub [LSF arguments] jobscript
  bsub [LSF arguments] -Is $SHELL
  bsub -h[elp] [options]
  bsub -V

NOTES:
 * All jobs must specify a walltime (-W) and project id (-P)
 * Standard jobs must specify a node count (-nnodes) or -ln_slots. These jobs cannot specify a resource string (-R).
 * Expert mode jobs (-csm y) must specify a resource string and cannot specify -nnodes or -ln_slots.

stderr:
ERROR: Resource strings (-R) are not supported in easy mode. Please resubmit without a resource string.
ERROR: -n is no longer supported. Please request nodes with -nnodes.
ERROR: No nodes requested. Please request nodes with -nnodes.
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Dask-Jobqueue tried to generate a sensible job script from the inputs that you
provided, but the resource manager that you’re using may have additional
policies that are unique to that cluster. We debug this by looking at the
generated script, and comparing against scripts that are known to work on the
HPC machine.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_script&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-bash notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="ch"&gt;#!/usr/bin/env bash&lt;/span&gt;

&lt;span class="c1"&gt;#BSUB -J dask-worker&lt;/span&gt;
&lt;span class="c1"&gt;#BSUB -P GEN119&lt;/span&gt;
&lt;span class="c1"&gt;#BSUB -n 128&lt;/span&gt;
&lt;span class="c1"&gt;#BSUB -R &amp;quot;span[hosts=1]&amp;quot;&lt;/span&gt;
&lt;span class="c1"&gt;#BSUB -M 600000&lt;/span&gt;
&lt;span class="c1"&gt;#BSUB -W 00:30&lt;/span&gt;
&lt;span class="nv"&gt;JOB_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;LSB_JOBID&lt;/span&gt;&lt;span class="p"&gt;%.*&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;

/ccs/home/mrocklin/anaconda/bin/python&lt;span class="w"&gt; &lt;/span&gt;-m&lt;span class="w"&gt; &lt;/span&gt;distributed.cli.dask_worker&lt;span class="w"&gt; &lt;/span&gt;tcp://scheduler:8786&lt;span class="w"&gt; &lt;/span&gt;--nthreads&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;16&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;--nprocs&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;--memory-limit&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;75&lt;/span&gt;.00GB&lt;span class="w"&gt; &lt;/span&gt;--name&lt;span class="w"&gt; &lt;/span&gt;name&lt;span class="w"&gt; &lt;/span&gt;--nanny&lt;span class="w"&gt; &lt;/span&gt;--death-timeout&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;60&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;--interface&lt;span class="w"&gt; &lt;/span&gt;ib0&lt;span class="w"&gt; &lt;/span&gt;--interface&lt;span class="w"&gt; &lt;/span&gt;ib0
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;After comparing notes with existing scripts that we know to work on Summit,
we modify keywords to add and remove certain lines in the header.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LSFCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;500 GB&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;GEN119&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;walltime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;00:30&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;job_extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;-nnodes 1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;          &lt;span class="c1"&gt;# &amp;lt;--- new!&lt;/span&gt;
    &lt;span class="n"&gt;header_skip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;-R&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;-n &amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;-M&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;--- new!&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And when we call scale this seems to make LSF happy. It no longer dumps out
large error messages.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# things seem to pass&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 153)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="workers-don-t-connect-to-the-scheduler"&gt;
&lt;h1&gt;Workers don’t connect to the Scheduler&lt;/h1&gt;
&lt;p&gt;So things seem fine from LSF’s perspective, but when we connect up a client to
our cluster we don’t see anything arriving.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;Client: scheduler=&amp;#39;tcp://10.41.0.34:41107&amp;#39; processes=0 cores=0&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Two things to check, have the jobs actually made it through the queue?
Typically we use a resource manager operation, like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;qstat&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;squeue&lt;/span&gt;&lt;/code&gt;, or
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;bjobs&lt;/span&gt;&lt;/code&gt; for this. Maybe our jobs are trapped in the queue?&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ bash
JOBID   USER       STAT   SLOTS    QUEUE       START_TIME    FINISH_TIME   JOB_NAME
600785  mrocklin   RUN    43       batch       Aug 26 13:11  Aug 26 13:41  dask-worker
600786  mrocklin   RUN    43       batch       Aug 26 13:11  Aug 26 13:41  dask-worker
600784  mrocklin   RUN    43       batch       Aug 26 13:11  Aug 26 13:41  dask-worker
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Nope, it looks like they’re in a running state. Now we go and look at their
logs. It can sometimes be tricky to track down the log files from your jobs,
but your IT administrator should know where they are. Often they’re where you
ran your job from, and have the Job ID in the filename.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ cat dask-worker.600784.err
distributed.worker - INFO -       Start worker at: tcp://128.219.134.81:44053
distributed.worker - INFO -          Listening to: tcp://128.219.134.81:44053
distributed.worker - INFO -          dashboard at:       128.219.134.81:34583
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         16
distributed.worker - INFO -                Memory:                   75.00 GB
distributed.worker - INFO -       Local Directory: /autofs/nccs-svm1_home1/mrocklin/worker-ybnhk4ib
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
distributed.worker - INFO - Waiting to connect to: tcp://128.219.134.74:34153
...
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So the worker processes have started, but they’re having difficulty connecting
to the scheduler. When we ask at IT administrator they identify the address
here as on the wrong network interface:&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="mf"&gt;128.219.134.74&lt;/span&gt;  &lt;span class="o"&gt;&amp;lt;---&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;accessible&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So we run &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;ifconfig&lt;/span&gt;&lt;/code&gt;, and find the infiniband network interface, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;ib0&lt;/span&gt;&lt;/code&gt;, which
is more broadly accessible.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LSFCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;500 GB&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;GEN119&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;walltime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;00:30&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;job_extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;-nnodes 1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;header_skip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;-R&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;-n &amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;-M&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;interface&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ib0&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                    &lt;span class="c1"&gt;# &amp;lt;--- new!&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We try this out and still, no luck :(&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 227)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="interactive-nodes"&gt;
&lt;h1&gt;Interactive nodes&lt;/h1&gt;
&lt;p&gt;The expert user then says “Oh, our login nodes are pretty locked-down, lets try
this from an interactive compute node. Things tend to work better there”. We
run some arcane bash command (I’ve never seen two of these that look alike so
I’m going to omit it here), and things magically start working. Hooray!&lt;/p&gt;
&lt;p&gt;We run a tiny Dask computation just to prove that we can do some work.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;11&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Actually, it turns out that we were eventually able to get things running from
the login nodes on Summit using a slightly different &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;bsub&lt;/span&gt;&lt;/code&gt; command in LSF, but
I’m going to omit details here because we’re fixing this in Dask and it’s
unlikely to affect future users (I hope?). Locked down login nodes remain a
common cause of no connections across a variety of systems. I’ll say something
like 30% of the systems that I interact with.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 249)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="ssh-tunneling"&gt;
&lt;h1&gt;SSH Tunneling&lt;/h1&gt;
&lt;p&gt;It’s important to get the dashboard up and running so that you can see what’s
going on. Typically we do this with SSH tunnelling. Most HPC people know how
to do this and it’s covered in the Youtube screencast above, so I’m going to
skip it here.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 256)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="jupyter-lab"&gt;
&lt;h1&gt;Jupyter Lab&lt;/h1&gt;
&lt;p&gt;Many interactive Dask users on HPC today are moving towards using JupyterLab.
This choice gives them a notebook, terminals, file browser, and Dask’s
dashboard all in a single web tab. This greatly reduces the number of times
they have to SSH in, and, with the magic of web proxies, means that they only
need to tunnel once.&lt;/p&gt;
&lt;p&gt;I conda installed JupyterLab and a proxy library, and then tried to
&lt;a class="reference external" href="https://github.com/dask/dask-labextension#installation"&gt;set up the Dask JupyterLab extension&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;conda&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;jupyterlab&lt;/span&gt;
&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;  &lt;span class="c1"&gt;# to route dashboard through Jupyter&amp;#39;s port&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Next, we’re going to install the
&lt;a class="reference external" href="https://github.com/dask/dask-labextension"&gt;Dask Labextension&lt;/a&gt; into JupyterLab
in order to get the Dask Dashboard directly into our Jupyter session..
For that, we need &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;nodejs&lt;/span&gt;&lt;/code&gt; in order to install things into JupyterLab.
I thought that this was going to be a pain, given the Power architecture, but
amazingly, this also seems to be in Anaconda’s default Power channel.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;mrocklin@login2.summit $ conda install nodejs  # Thanks conda packaging devs!
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Then I install Dask-Labextension, which is both a Python and a JavaScript
package:&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;dask_labextension&lt;/span&gt;
&lt;span class="n"&gt;jupyter&lt;/span&gt; &lt;span class="n"&gt;labextension&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;labextension&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Then I set up a password for my Jupyter sessions&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt; &lt;span class="n"&gt;notebook&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And run JupyterLab in a network friendly way&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;mrocklin@login2.summit $ jupyter lab --no-browser --ip=&amp;quot;login2&amp;quot;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And set up a single SSH tunnel from my home machine to the login node&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;# Be sure to match the login node&amp;#39;s hostname and the Jupyter port below

mrocklin@my-laptop $ ssh -L 8888:login2:8888 summit.olcf.ornl.gov
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I can now connect to Jupyter from my laptop by navigating to
&lt;a class="reference external" href="http://localhost:8888"&gt;http://localhost:8888&lt;/a&gt; , run the cluster commands above in a notebook, and
things work great. Additionally, thanks to &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;jupyter-server-proxy&lt;/span&gt;&lt;/code&gt;, Dask’s
dashboard is also available at &lt;a class="reference external" href="http://localhost:8888/proxy/####/status"&gt;http://localhost:8888/proxy/####/status&lt;/a&gt; , where
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;####&lt;/span&gt;&lt;/code&gt; is the port currently hosting Dask’s dashboard. You can probably find
this by looking at &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;cluster.dashboard_link&lt;/span&gt;&lt;/code&gt;. It defaults to 8787, but if
you’ve started a bunch of Dask schedulers on your system recently it’s possible
that that port is taken up and so Dask had to resort to using random ports.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 320)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="configuration-files"&gt;
&lt;h1&gt;Configuration files&lt;/h1&gt;
&lt;p&gt;I don’t want to keep typing all of these commands, so now I put things into a
single configuration file, and plop that file into &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;~/.config/dask/summit.yaml&lt;/span&gt;&lt;/code&gt;
(any filename that ends in &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.yaml&lt;/span&gt;&lt;/code&gt; will do).&lt;/p&gt;
&lt;div class="highlight-yaml notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nt"&gt;jobqueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;lsf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;cores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;128&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;processes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;8&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;500 GB&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;job-extra&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;-nnodes&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ib0&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;header-skip&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;-R&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;-n&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;-M&amp;quot;&lt;/span&gt;

&lt;span class="nt"&gt;labextension&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;factory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;module&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;dask_jobqueue&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;class&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;LSFCluster&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[]&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;project&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;your-project-id&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 349)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="slow-worker-startup"&gt;
&lt;h1&gt;Slow worker startup&lt;/h1&gt;
&lt;p&gt;Now that things are easier to use I find myself using the system more, and some
other problems arise.&lt;/p&gt;
&lt;p&gt;I notice that it takes a long time to start up a worker. It seems to hang
intermittently during startup, so I add a few lines to
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;distributed/__init__.py&lt;/span&gt;&lt;/code&gt; to print out the state of the main Python thread
every second, to see where this is happening:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;threading&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;

&lt;span class="n"&gt;main_thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_ident&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;frame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_current_frames&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="n"&gt;main_thread&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_stack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;thraed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This prints out a traceback that brings us to this code in Dask:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_locking_enabled&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dir_path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;DIR_LOCK_EXT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Locking &lt;/span&gt;&lt;span class="si"&gt;%r&lt;/span&gt;&lt;span class="s2"&gt;...&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Avoid a race condition before locking the file&lt;/span&gt;
        &lt;span class="c1"&gt;# by taking the global lock&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_global_lock&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;locket&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lock_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock_file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;acquire&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;It looks like Dask is trying to use a file-based lock.
Unfortunately some NFS systems don’t like file-based locks, or handle them very
slowly. In the case of Summit, the home directory is actually mounted
read-only from the compute nodes, so a file-based lock will simply fail.
Looking up the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;is_locking_enabled&lt;/span&gt;&lt;/code&gt; function we see that it checks a
configuration value.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;is_locking_enabled&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;distributed.worker.use-file-locking&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So we add that to our config file. At the same time I switch from the
forkserver to spawn multiprocessing method (I thought that this might help, but
it didn’t), which is relatively harmless.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;distributed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;multiprocessing&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;spawn&lt;/span&gt;
    &lt;span class="n"&gt;use&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;locking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;

&lt;span class="n"&gt;jobqueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;lsf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;
    &lt;span class="n"&gt;processes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="n"&gt;GB&lt;/span&gt;
    &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;-nnodes 1&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ib0&lt;/span&gt;
    &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;skip&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;-R&amp;quot;&lt;/span&gt;
    &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;-n &amp;quot;&lt;/span&gt;
    &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;-M&amp;quot;&lt;/span&gt;

&lt;span class="n"&gt;labextension&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
     &lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;dask_jobqueue&amp;#39;&lt;/span&gt;
     &lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;LSFCluster&amp;#39;&lt;/span&gt;
     &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
     &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
       &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;your&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 435)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;This post outlines many issues that I ran into when getting Dask to run on
one specific HPC system. These problems aren’t universal, so you may not run
into them, but they’re also not super-rare. Mostly my objective in writing
this up is to give people a sense of the sorts of problems that arise when
Dask and an HPC system interact.&lt;/p&gt;
&lt;p&gt;None of the problems above are that serious. They’ve all happened before and
they all have solutions that can be written down in a configuration file.
Finding what the problem is though can be challenging, and often requires the
combined expertise of individuals that are experienced with Dask and with that
particular HPC system.&lt;/p&gt;
&lt;p&gt;There are a few configuration files posted here
&lt;a class="reference external" href="https://jobqueue.dask.org/en/latest/configurations.html"&gt;jobqueue.dask.org/en/latest/configurations.html&lt;/a&gt;, which may be informative. The &lt;a class="reference external" href="https://github.com/dask/dask-jobqueue/issues"&gt;Dask Jobqueue issue tracker&lt;/a&gt; is also a fairly friendly place, full of both IT professionals and Dask experts.&lt;/p&gt;
&lt;p&gt;Also, as a reminder, you don’t need to have an HPC machine in order to use
Dask. Dask is conveniently deployable from other Cloud, Hadoop, and local
systems. See the &lt;a class="reference external" href="https://docs.dask.org/en/latest/setup.html"&gt;Dask setup
documentation&lt;/a&gt; for more
information.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 458)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="future-work-gpus"&gt;
&lt;h1&gt;Future work: GPUs&lt;/h1&gt;
&lt;p&gt;Summit is fast because it has a ton of GPUs. I’m going to work on that next,
but that will probably cover enough content to fill up a whole other blogpost :)&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/08/28/dask-on-summit.md&lt;/span&gt;, line 463)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="branches"&gt;
&lt;h1&gt;Branches&lt;/h1&gt;
&lt;p&gt;For anyone playing along at home (or on Summit). I’m operating from the
following development branches:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/dask/distributed&amp;#64;master"&gt;dask/distributed&amp;#64;master&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="github reference external" href="https://github.com/mrocklin/dask-jobqueue&amp;#64;spec-rewrite"&gt;mrocklin/dask-jobqueue&amp;#64;spec-rewrite&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Although hopefully within a month of writing this article, everything should be
in a nicely released state.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2019/08/28/dask-on-summit/"/>
    <summary>Dask is deployed on traditional HPC machines with increasing frequency.
In the past week I’ve personally helped four different groups get set up.
This is a surprisingly individual process,
because every HPC machine has its own idiosyncrasies.
Each machine uses a job scheduler like SLURM/PBS/SGE/LSF/…, a network file
system, and fast interconnect, but each of those sub-systems have slightly
different policies on a machine-by-machine basis, which is where things get tricky.</summary>
    <category term="HPC" label="HPC"/>
    <published>2019-08-28T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/10/08/Dask-Jobqueue/</id>
    <title>Dask-jobqueue</title>
    <updated>2018-10-08T00:00:00+00:00</updated>
    <author>
      <name>Joe Hamman</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;This work was done in collaboration with &lt;a class="reference external" href="https://github.com/mrocklin"&gt;Matthew Rocklin&lt;/a&gt; (Anaconda), Jim Edwards (NCAR), &lt;a class="reference external" href="https://github.com/guillaumeeb"&gt;Guillaume Eynard-Bontemps&lt;/a&gt; (CNES), and &lt;a class="reference external" href="https://github.com/lesteve"&gt;Loïc Estève&lt;/a&gt; (INRIA), and is supported, in part, by the US National Science Foundation &lt;a class="reference external" href="https://www.earthcube.org/"&gt;Earth Cube program&lt;/a&gt;. The dask-jobqueue package is a spinoff of the &lt;a class="reference external" href="https://medium.com/pangeo"&gt;Pangeo Project&lt;/a&gt;. This blogpost was previously published &lt;a class="reference external" href="https://medium.com/pangeo/dask-jobqueue-d7754e42ca53"&gt;here&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;TLDR;&lt;/strong&gt; &lt;em&gt;Dask-jobqueue&lt;/em&gt; allows you to seamlessly deploy &lt;a class="reference external" href="https://dask.org/"&gt;dask&lt;/a&gt; on HPC clusters that use a variety of job queuing systems such as PBS, Slurm, SGE, or LSF. Dask-jobqueue provides a &lt;em&gt;Pythonic&lt;/em&gt; user interface that manages dask workers/clusters through the submission, execution, and deletion of individual jobs on a HPC system. It gives users the ability to interactively scale workloads across large HPC systems; turning an interactive &lt;a class="reference external" href="http://jupyter.org/"&gt;Jupyter&lt;/a&gt; Notebook into a powerful tool for scalable computation on very large datasets.&lt;/p&gt;
&lt;p&gt;Install with:&lt;/p&gt;
&lt;div class="highlight-bash notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;conda&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;conda-forge&lt;span class="w"&gt; &lt;/span&gt;dask-jobqueue
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;or&lt;/p&gt;
&lt;div class="highlight-bash notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;dask-jobqueue
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And checkout the dask-jobqueue documentation: &lt;a class="reference external" href="http://jobqueue.dask.org"&gt;http://jobqueue.dask.org&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/10/08/Dask-Jobqueue.md&lt;/span&gt;, line 28)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="introduction"&gt;

&lt;p&gt;Large high-performance computer (HPC) clusters are ubiquitous throughout the computational sciences. These HPC systems include powerful hardware, including many large compute nodes, high-speed interconnects, and parallel file systems. An example of such systems that we use at &lt;a class="reference external" href="https://ncar.ucar.edu/"&gt;NCAR&lt;/a&gt; is named &lt;a class="reference external" href="https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne"&gt;Cheyenne&lt;/a&gt;. Cheyenne is a fairly large machine, with about 150k cores and over 300 TB of total memory.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cheyenne is a 5.34-petaflops, high-performance computer operated by NCAR." src="https://cdn-images-1.medium.com/max/2000/1*Jqm612rTcdWFkmcZWhcrTw.jpeg" /&gt;&lt;em&gt;Cheyenne is a 5.34-petaflops, high-performance computer operated by NCAR.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;These systems frequently use a job queueing system, such as PBS, Slurm, or SGE, to manage the queueing and execution of many concurrent jobs from numerous users. A “job” is a single execution of a program that is to be run on some set of resources on the user’s HPC system. These jobs are often submitted via the command line:&lt;/p&gt;
&lt;div class="highlight-bash notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;qsub&lt;span class="w"&gt; &lt;/span&gt;do_thing_a.sh
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Where do_thing_a.sh is a shell script that might look something like this:&lt;/p&gt;
&lt;div class="highlight-bash notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="ch"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c1"&gt;#PBS -N thing_a&lt;/span&gt;
&lt;span class="c1"&gt;#PBS -q premium&lt;/span&gt;
&lt;span class="c1"&gt;#PBS -A 123456789&lt;/span&gt;
&lt;span class="c1"&gt;#PBS -l select=1:ncpus=36:mem=109G&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;“doing&lt;span class="w"&gt; &lt;/span&gt;thing&lt;span class="w"&gt; &lt;/span&gt;A”
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In this example “-N” specifies the name of this job, “-q” specifies the queue where the job should be run, “-A” specifies a project code to bill for the CPU time used while the job is run, and “-l” specifies the hardware specifications for this job. Each job queueing system has slightly different syntax for configuring and submitting these jobs.&lt;/p&gt;
&lt;p&gt;This interface has led to the development of a few common workflow patterns:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;MPI if you want to scale&lt;/em&gt;. MPI stands for the Message Passing Interface. It is a widely adopted interface allowing parallel computation across traditional HPC clusters. Many large computational models are written in languages like C and Fortran and use MPI to manage their parallel execution. For the old-timers out there, this is the go-to solution when it comes time to scale complex computations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Batch it&lt;/em&gt;. It is quite common for scientific processing pipelines to include a few steps that can be easily parallelized by submitting multiple jobs in parallel. Maybe you want to “do_thing_a.sh” 500 times with slightly different inputs — easy, just submit all the jobs separately (or in what some queueing systems refer to as “array-job”).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Serial is still okay&lt;/em&gt;. Computers are pretty fast these days, right? Maybe you don’t need to parallelize your programing at all. Okay, so keep it serial and get some coffee while your job is running.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/10/08/Dask-Jobqueue.md&lt;/span&gt;, line 62)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="the-problem"&gt;
&lt;h1&gt;The Problem&lt;/h1&gt;
&lt;p&gt;None of the workflow patterns listed above allow for interactive analysis on very large data analysis. When I’m prototyping new processing method, I often want to work interactively, say in a Jupyter Notebook. Writing MPI code on the fly is hard and expensive, batch jobs are inherently not interactive, and serial just won’t do when I start working on many TBs of data. Our experience is that these workflows tend to be fairly inelegant and difficult to transfer between applications, yielding lots of duplicated effort along the way.&lt;/p&gt;
&lt;p&gt;One of the aims of the Pangeo project is to facilitate interactive data on very large datasets. Pangeo leverages Jupyter and dask, along with a number of more domain specific packages like &lt;a class="reference external" href="http://xarray.pydata.org"&gt;xarray&lt;/a&gt; to make this possible. The problem is we didn’t have a particularly palatable method for deploying dask on our HPC clusters.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/10/08/Dask-Jobqueue.md&lt;/span&gt;, line 68)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="the-system"&gt;
&lt;h1&gt;The System&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Jupyter Notebooks&lt;/em&gt; are web applications that support interactive code execution, display of figures and animations, and in-line explanatory text and equations. They are quickly becoming the standard open-source format for interactive computing in Python.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Dask&lt;/em&gt; is a library for parallel computing that coordinates well with Python’s existing scientific software ecosystem, including libraries like &lt;a class="reference external" href="http://www.numpy.org/"&gt;NumPy&lt;/a&gt;, &lt;a class="reference external" href="https://pandas.pydata.org/"&gt;Pandas&lt;/a&gt;, &lt;a class="reference external" href="http://scikit-learn.org/stable/"&gt;Scikit-Learn&lt;/a&gt;, and xarray. In many cases, it offers users the ability to take existing workflows and quickly scale them to much larger applications. &lt;a class="reference external" href="http://distributed.dask.org"&gt;*Dask-distributed&lt;/a&gt;* is an extension of dask that facilitates parallel execution across many computers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Dask-jobqueue&lt;/em&gt; is a new Python package that we’ve built to facilitate the deployment of &lt;em&gt;dask&lt;/em&gt; on HPC clusters and interfacing with a number of job queuing systems. Its usage is concise and Pythonic.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_jobqueue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PBSCluster&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PBSCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;108GB&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;premium&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;section id="whats-happening-under-the-hood"&gt;
&lt;h2&gt;What’s happening under the hood?&lt;/h2&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;In the call to PBSCluster() we are telling dask-jobqueue how we want to configure each job. In this case, we set each job to have 1 &lt;em&gt;Worker&lt;/em&gt;, each using 36 cores (threads) and 108 GB of memory. We also tell the PBS queueing system we’d like to submit this job to the “premium” queue. This step also starts a Scheduler to manage workers that we’ll add later.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It is not until we call the cluster.scale() method that we interact with the PBS system. Here we start 10 workers, or equivalently 10 PBS jobs. For each job, dask-jobqueue creates a shell command similar to the one above (except dask-worker is called instead of echo) and submits the job via a subprocess call.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Finally, we connect to the cluster by instantiating the Client class. From here, the rest of our code looks just as it would if we were using one of &lt;a class="reference external" href="http://docs.dask.org/en/stable/scheduler-overview.html"&gt;dask’s local schedulers&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dask-jobqueue is easily customizable to help users capitalize on advanced HPC features. A more complicated example that would work on NCAR’s Cheyenne super computer is:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PBSCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;processes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;108GB&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;P48500028&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;premium&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;resource_spec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;select=1:ncpus=36:mem=109G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;walltime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;02:00:00&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;interface&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ib0&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;local_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;$TMPDIR&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In this example, we instruct the PBSCluster to 1) use up to 36 cores per job, 2) use 18 worker processes per job, 3) use the large memory nodes with 109 GB each, 4) use a longer walltime than is standard, 5) use the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/InfiniBand"&gt;InfiniBand&lt;/a&gt; network interface (ib0), and 6) use the fast SSD disks as its local directory space.&lt;/p&gt;
&lt;p&gt;Finally, Dask offers the ability to “autoscale” clusters based on a set of heuristics. When the cluster needs more CPU or memory, it will scale up. When the cluster has unused resources, it will scale down. Dask-jobqueue supports this with a simple interface:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;adapt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minimum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maximum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;360&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In this example, we tell our cluster to autoscale between 18 and 360 workers (or 1 and 20 jobs).&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/10/08/Dask-Jobqueue.md&lt;/span&gt;, line 119)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="demonstration"&gt;
&lt;h1&gt;Demonstration&lt;/h1&gt;
&lt;p&gt;We have put together a fairly comprehensive screen cast that walks users through all the steps of setting up Jupyter and Dask (and dask-jobqueue) on an HPC cluster:&lt;/p&gt;
&lt;center&gt;&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/FXsgmwpRExM" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;/center&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/10/08/Dask-Jobqueue.md&lt;/span&gt;, line 125)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusions"&gt;
&lt;h1&gt;Conclusions&lt;/h1&gt;
&lt;p&gt;Dask jobqueue makes it much easier to deploy Dask on HPC clusters. The package provides a Pythonic interface to common job-queueing systems. It is also easily customizable.&lt;/p&gt;
&lt;p&gt;The autoscaling functionality allows for a fundamentally different way to do science on HPC clusters. Start your Jupyter Notebook, instantiate your dask cluster, and then do science — let dask determine when to scale up and down depending on the computational demand. We think this bursting approach to interactive parallel computing offers many benefits.&lt;/p&gt;
&lt;p&gt;Finally, in developing dask-jobqueue, we’ve run into a few challenges that are worth mentioning.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Queueing systems are highly customizable. System administrators seem to have a lot of control over their particularly implementation of each queueing system. In practice, this means that it is often difficult to simultaneously cover all permutations of a particular queueing system. We’ve generally found that things seem to be flexible enough and welcome feedback in the cases where they are not.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CI testing has required a fair bit of work to setup. The target environment for using dask-jobqueue is on existing HPC clusters. In order to facilitate continuous integration testing of dask-jobqueue, we’ve had to configure multiple queueing systems (PBS, Slurm, SGE) to run in docker using Travis CI. This has been a laborious task and one we’re still working on.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We’ve built dask-jobqueue to operate in the dask-deploy framework. If you are familiar with &lt;a class="reference external" href="http://kubernetes.dask.org"&gt;dask-kubernetes&lt;/a&gt; or &lt;a class="reference external" href="http://yarn.dask.org"&gt;dask-yarn&lt;/a&gt;, you’ll recognize the basic syntax in dask-jobqueue as well. The coincident development of these dask deployment packages has recently brought up some important coordination discussions (e.g. &lt;a class="github reference external" href="https://github.com/dask/distributed/issues/2235"&gt;dask/distributed#2235&lt;/a&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/10/08/Dask-Jobqueue/"/>
    <summary>This work was done in collaboration with Matthew Rocklin (Anaconda), Jim Edwards (NCAR), Guillaume Eynard-Bontemps (CNES), and Loïc Estève (INRIA), and is supported, in part, by the US National Science Foundation Earth Cube program. The dask-jobqueue package is a spinoff of the Pangeo Project. This blogpost was previously published here</summary>
    <category term="HPC" label="HPC"/>
    <category term="distributed" label="distributed"/>
    <category term="jobqueue" label="jobqueue"/>
    <published>2018-10-08T00:00:00+00:00</published>
  </entry>
</feed>
