<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts by Kevin Paul (NCAR)</title>
  <updated>2026-03-05T15:05:19.913854+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/author/kevin-paul-ncar/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2019/06/12/dask-on-hpc/</id>
    <title>Dask on HPC</title>
    <updated>2019-06-12T00:00:00+00:00</updated>
    <author>
      <name>Joe Hamman (NCAR)</name>
    </author>
    <content type="html">&lt;p&gt;We analyze large datasets on HPC systems with Dask, a parallel computing
library that integrates well with the existing Python software ecosystem, and
works comfortably with native HPC hardware.&lt;/p&gt;
&lt;p&gt;This article explains why this approach makes sense for us.
Our motivation is to share our experiences with our colleagues,
and to highlight opportunities for future work.&lt;/p&gt;
&lt;p&gt;We start with six reasons why we use Dask,
followed by seven issues that affect us today.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/12/dask-on-hpc.md&lt;/span&gt;, line 21)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="reasons-why-we-use-dask"&gt;

&lt;section id="ease-of-use"&gt;
&lt;h2&gt;1. Ease of use&lt;/h2&gt;
&lt;p&gt;Dask extends libraries like Numpy, Pandas, and Scikit-learn, which are well-known APIs for scientists and engineers. It also extends simpler APIs for
multi-node multiprocessing. This makes it easy for our existing user base to
get up to speed.&lt;/p&gt;
&lt;p&gt;By abstracting the parallelism away from the user/developer, our analysis tools can be written by computer science non-experts, such as the scientists
themselves, meaning that our software engineers can take on more of a supporting role than a leadership role.
Experience has shown that, with tools like Dask and Jupyter, scientists spend less time coding and more time thinking about science, as they should.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="smooth-hpc-integration"&gt;
&lt;h2&gt;2. Smooth HPC integration&lt;/h2&gt;
&lt;p&gt;With tools like &lt;a class="reference external" href="https://jobqueue.dask.org"&gt;Dask Jobqueue&lt;/a&gt; and &lt;a class="reference external" href="https://mpi.dask.org"&gt;Dask MPI&lt;/a&gt; there is no need of any boilerplate shell scripting code commonly found with job queueing systems.&lt;/p&gt;
&lt;p&gt;Dask interacts natively with our existing job schedulers (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;SLURM&lt;/span&gt;&lt;/code&gt;/&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;SGE&lt;/span&gt;&lt;/code&gt;/&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;LSF&lt;/span&gt;&lt;/code&gt;/&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;PBS&lt;/span&gt;&lt;/code&gt;/…)
so there is no additional system to set up and manage between users and IT.
All the infrastructure that we need is already in place.&lt;/p&gt;
&lt;p&gt;Interactive analysis at scale is powerful, and lets
us use our existing infrastructure in new ways.
Auto scaling improves our occupancy and helps with acceptance by HPC operators / owners.
Dask’s resilience against the death of all or part of its workers offers new ways of leveraging job-preemption when co-locating classical HPC workloads with analytics jobs.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="aimed-for-scientific-processing"&gt;
&lt;h2&gt;3. Aimed for Scientific Processing&lt;/h2&gt;
&lt;p&gt;In addition to being integrated with the Scipy and PyData software ecosystems,
Dask is compatible with scientific data formats like HDF5, NetCDF, Parquet, and
so on. This is because Dask works with other libraries within the Python
ecosystem, like Xarray, which already have strong support for scientific data
formats and processing, and with C/C++/Fortran codes, such as is common for Python libraries.&lt;/p&gt;
&lt;p&gt;This native support is one of the major advantages that we’ve seen of Dask over Apache Spark.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="versatility-of-apis"&gt;
&lt;h2&gt;4. Versatility of APIs&lt;/h2&gt;
&lt;p&gt;And yet Dask is not designed for any particular workflow, but instead can
provide infrastructure to cover a variety of different problems within an
institution. Many different kinds of workloads are possible:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;You can easily handle Numpy arrays or Pandas Dataframes at scale, doing some numerical work or data analysis/cleaning,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can handle any objects collection, like JSON files, text, or log files,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can express more arbitrary task or job scheduling workloads with Dask Delayed, or real time and reactive processing with Dask Futures.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dask covers and simplifies many of the wide range of HPC workflows we’ve seen over the years. Many workflows that were previously implemented using job arrays, simplified MPI (e.g. mpi4py) or plain bash scripts seem to be easier for our users with Dask.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="versatility-of-infrastructure"&gt;
&lt;h2&gt;5. Versatility of Infrastructure&lt;/h2&gt;
&lt;p&gt;Dask is compatible with laptops, servers, HPC systems, and cloud computing. The environment can change with very little code adaptation which reduces our burden to rewrite code as we migrate analysis between systems such as from a laptop to a supercomputer, or between a supercomputer and the cloud.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Local machines&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LocalCluster&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LocalCluster&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# HPC Job Schedulers&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_jobqueue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SLURMCluster&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PBSCluster&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SGECluster&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SLURMCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;default&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ABCD1234&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Hadoop/Spark clusters&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_yarn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YARNCluster&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;YarnCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;environment.tar.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;worker_vcores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Cloud/Kubernetes clusters&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_kubernetes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KubeCluster&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KubeCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod_spec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Dask is more than just a tool to us; it is a gateway to thinking about a completely different way of providing computing infrastructure to our users. Dask opens up the door to cloud computing technologies (such as elastic scaling and object storage) and makes us rethink what an HPC center should really look like.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="cost-and-collaboration"&gt;
&lt;h2&gt;6. Cost and Collaboration&lt;/h2&gt;
&lt;p&gt;Dask is free and open source, which means we do not have to rebalance our budget and staff to address the new immediate need of data analysis tools.
We don’t have to pay for licenses, and we have the ability to make changes to the code when necessary. The HPC community has good representation among Dask developers. It’s easy for us to participate and our concerns are well understood.&lt;/p&gt;
&lt;!-- WR: Mention quick response to new use cases / demands as another benefit of the collaborative
approach?  And hint towards dask-mpi here? --&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/06/12/dask-on-hpc.md&lt;/span&gt;, line 100)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="what-needs-work"&gt;
&lt;h1&gt;What needs work&lt;/h1&gt;
&lt;section id="heterogeneous-resources-handling"&gt;
&lt;h2&gt;1. Heterogeneous resources handling&lt;/h2&gt;
&lt;p&gt;Often we want to include different kinds of HPC nodes in the same deployment.
This includes situations like the following:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Workers with low or high memory,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Workers with GPUs,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Workers from different node pools.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dask provides some support for this heterogeneity already, but not enough.
We see two major opportunities for improvement.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Tools like Dask-Jobqueue should make it easier to manage multiple worker
pools within the same cluster. Currently the deployment solution assumes
homogeneity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It should be easier for users to specify which parts of a computation
require different hardware. The solution today works, but requires more
detail from the user than is ideal.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;section id="coarse-grained-diagnostics-and-history"&gt;
&lt;h2&gt;2. Coarse-Grained Diagnostics and History&lt;/h2&gt;
&lt;p&gt;Dask provides a number of profiling tools that deliver real-time diagnostics at the individual task-level, but there is no way today to analyze or profile your Dask application at a coarse-grained level, and no built-in way to track performance over long periods of time.&lt;/p&gt;
&lt;p&gt;Having more tools to analyze bulk performance would be helpful when making design decisions and future architecture choices.&lt;/p&gt;
&lt;p&gt;Having the ability to persist or store history of computations (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;compute()&lt;/span&gt;&lt;/code&gt; calls)
and tasks executed on a scheduler could be really helpful to track problems and potential performance improvements.&lt;/p&gt;
&lt;!-- JJH: One tangible idea here would be a benchmarking suite that helps users make decisions about how to use dask most effectively.
 --&gt;
&lt;/section&gt;
&lt;section id="scheduler-performance-on-large-graphs"&gt;
&lt;h2&gt;3. Scheduler Performance on Large Graphs&lt;/h2&gt;
&lt;p&gt;HPC users want to analyze Petabyte datasets on clusters of thousands of large nodes.&lt;/p&gt;
&lt;p&gt;While Dask can theoretically handle this scale, it does tend to slow down a bit,
reducing the pleasure of interactive large-scale computing. Handling millions of tasks can lead to tens of seconds latency before a computation actually starts. This is perfectly fine for our Dask batch jobs, but tends to make the interactive Jupyter users frustrated.&lt;/p&gt;
&lt;p&gt;Much of this slowdown is due to task-graph construction time and centralized scheduling, both of which can be accelerated through a variety of means. We expect that, with some cleverness, we can increase the scale at which Dask continues to run smoothly by another order of magnitude.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="launch-batch-jobs-with-mpi"&gt;
&lt;h2&gt;4. ~~Launch Batch Jobs with MPI~~&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;This issue was resolved while we prepared this blogpost.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Most Dask workflows today are interactive. People log into a Jupyter notebook, import Dask, and then Dask asks the job scheduler (like SLURM, PBS, …) for resources dynamically. This is great because Dask is able to fit into small gaps in the schedule, release workers when they’re not needed, giving users a pleasant interactive experience while lessening the load on the cluster.&lt;/p&gt;
&lt;p&gt;However not all jobs are interactive. Often scientists want to submit a large job similar to how they submit MPI jobs. They submit a single job script with the necessary resources, walk away, and the resource manager runs that job when those resources become available (which may be many hours from now). While not as novel as the interactive workloads, these workloads are critical to common processes, and important to support.&lt;/p&gt;
&lt;p&gt;This point was raised by Kevin Paul at NCAR during discussion of this blogpost. Between when we started planning and when we released this blogpost Kevin had already solved the problem by prodiving &lt;a class="reference external" href="https://dask-mpi.readthedocs.org"&gt;dask-mpi&lt;/a&gt;, a project that makes it easy to launch Dask using normal &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mpirun&lt;/span&gt;&lt;/code&gt; or &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mpiexec&lt;/span&gt;&lt;/code&gt; commands, making it easy to deploy Dask anywhere that MPI is deployable.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="more-data-formats"&gt;
&lt;h2&gt;5. More Data Formats&lt;/h2&gt;
&lt;p&gt;Dask works well today with bread-and-butter scientific data formats like HDF5, Grib, and NetCDF, as well as common data science formats like CSV, JSON, Parquet, ORC, and so on.&lt;/p&gt;
&lt;p&gt;However, the space of data formats is vast and Dask users find themselves struggling a little, or even solving the data ingestion problem manually for a number of common formats in different domains:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Remote sensing datasets: GeoTIFF, Jpeg2000,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Astronomical data: FITS, VOTable,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;… and so on&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Supporting these isn’t hard (indeed many of us have built our own support for them in Dask), but it would be handy to have a high quality centralized solution.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="link-with-deep-learning"&gt;
&lt;h2&gt;6. Link with Deep Learning&lt;/h2&gt;
&lt;p&gt;Many of our institutions are excited to leverage recent advances in deep learning and integrate powerful tools like Keras, TensorFlow, and PyTorch and powerful hardware like GPUs into our workflows.&lt;/p&gt;
&lt;p&gt;However, we often find that our data and architecture look a bit different from what we find in standard deep learning tutorials. We like using Dask for data ingestion, cleanup, and pre-processing, but would like to establish better practices and smooth tooling to transition from scientific workflows on HPC using Dask to deep learning as efficiently as possible.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;For more information, see &lt;a class="reference external" href="https://github.com/pangeo-data/pangeo/issues/567"&gt;this github
issue&lt;/a&gt; for an example topic.&lt;/em&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id="more-calculation-guidelines"&gt;
&lt;h2&gt;7. More calculation guidelines&lt;/h2&gt;
&lt;p&gt;While there are means to analyse and diagnose computations interactively, and
a quite decent set of examples for Dask common calculations, trials and error appear to be the norm with big HPC computation before coming to optimized workflows.&lt;/p&gt;
&lt;p&gt;We should develop more guidelines and strategy on how to perform large scale computation, and we need to foster the community around Dask, which is already done in projects such as Pangeo. Note that these guidelines may be infrastructure dependent.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2019/06/12/dask-on-hpc/"/>
    <summary>We analyze large datasets on HPC systems with Dask, a parallel computing
library that integrates well with the existing Python software ecosystem, and
works comfortably with native HPC hardware.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2019-06-12T00:00:00+00:00</published>
  </entry>
</feed>
