<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posted in 2018</title>
  <updated>2026-03-05T15:05:21.924234+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/2018/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2018/11/29/version-1.0/</id>
    <title>Dask Version 1.0</title>
    <updated>2018-11-29T00:00:00+00:00</updated>
    <author>
      <name>the Dask Team</name>
    </author>
    <content type="html">&lt;p&gt;We are pleased to announce the release of Dask version 1.0.0!&lt;/p&gt;
&lt;p&gt;Usually in release blogposts we outline important features and changes since
the last major version. Because of the 1.0 version number, this post will be a
bit different. Instead we’ll talk about what this version number means to us,
and discuss the broader context of Dask projects more generally.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/11/29/version-1.0.md&lt;/span&gt;, line 16)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="what-1-0-means-to-us"&gt;

&lt;p&gt;Version 1.0 software means different things to different groups.
In some communities it might mean …&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;The first version of a package&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When a package is first ready for production use&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When a package has reached API stability&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;…&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is common in the PyData ecosystem to wait a &lt;em&gt;long time&lt;/em&gt; before releasing a
version 1.0. For example neither Pandas nor Scikit-Learn, arguably two of the
most well used PyData packages in production, have yet declared a 1.0 version
number (today they are at versions 0.23 and 0.20 respectively). And yet each
package is widely used in production by organizations that demand high degrees
of stability.&lt;/p&gt;
&lt;p&gt;Dask is not as API-stable as Pandas or Scikit-Learn, but it’s pretty close.
The project rarely invents new APIs, instead preferring to implement
pre-existing APIs (like the NumPy/Pandas/Scikit-Learn APIs) or standard language
protocols (like async-await, concurrent.futures, Queues, Locks, and so on).
Additionally, Dask is well used in production today across sectors ranging from
risk-tolerant industries like startups and quantitative finance shops, to
risk-averse institutions like banks, large enterprises, and governments.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;When we say that Dask has reached 1.0 we mean that it is ready to be used in
production. We are late in saying this. This happened a long time ago.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/11/29/version-1.0.md&lt;/span&gt;, line 44)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="development-will-continue-as-before"&gt;
&lt;h1&gt;Development will continue as before&lt;/h1&gt;
&lt;p&gt;Dask is living software that exists in a rapidly evolving space. Nothing is
changing about our internal stability practices. We will continue to add new
features, deprecate old ones, and fix bugs with the same policies. We always
try to minimize negative effects on users when making these internal changes
while maximizing the speed at which we can deliver new bugfixes and features.
This is hard and requires care, but we believe that we’ve done this decently in
the past so hopefully you haven’t noticed much. We will continue to operate
the same way into the future.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;The 1.0 version change does not affect our development cycle.
There are no LTS versions beyond what we already provide.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/11/29/version-1.0.md&lt;/span&gt;, line 58)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="different-dask-packages-move-at-different-speeds"&gt;
&lt;h1&gt;Different Dask packages move at different speeds&lt;/h1&gt;
&lt;p&gt;Dask is able to evolve and experiment rapidly while maintaining a stable core
because it is split into sub-packages, each of which evolves independently, has
its own maintainers, its own versions, and its own release cycle. Some Dask
subprojects have had versions above 1.0 for a long time, while others are still
unstable.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Dask’s version number is hard to define today because it is composed of so
many independent efforts by different groups. This is similar to situation in
Jupyter, or in the Numeric Python ecosystem itself.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/11/29/version-1.0.md&lt;/span&gt;, line 70)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="thanks"&gt;
&lt;h1&gt;Thanks&lt;/h1&gt;
&lt;p&gt;Finally, we’re grateful to everyone who has contributed to the project over the
years, either by contributing code, reviews, documentation, discussion, bug
reports, well written questions and answers, visual designs, and well wishes.
This means a lot to us.&lt;/p&gt;
&lt;p&gt;Today there are dozens of &lt;a class="reference external" href="https://pypi.org/search/?q=dask"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-*&lt;/span&gt;&lt;/code&gt; packages on
PyPI&lt;/a&gt; that support thousands of users and
several more that incorporate Dask for parallelism. We’re thankful to play a
role in such a vibrant community.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/11/29/version-1.0/"/>
    <summary>We are pleased to announce the release of Dask version 1.0.0!</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-11-29T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/10/08/Dask-Jobqueue/</id>
    <title>Dask-jobqueue</title>
    <updated>2018-10-08T00:00:00+00:00</updated>
    <author>
      <name>Joe Hamman</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;This work was done in collaboration with &lt;a class="reference external" href="https://github.com/mrocklin"&gt;Matthew Rocklin&lt;/a&gt; (Anaconda), Jim Edwards (NCAR), &lt;a class="reference external" href="https://github.com/guillaumeeb"&gt;Guillaume Eynard-Bontemps&lt;/a&gt; (CNES), and &lt;a class="reference external" href="https://github.com/lesteve"&gt;Loïc Estève&lt;/a&gt; (INRIA), and is supported, in part, by the US National Science Foundation &lt;a class="reference external" href="https://www.earthcube.org/"&gt;Earth Cube program&lt;/a&gt;. The dask-jobqueue package is a spinoff of the &lt;a class="reference external" href="https://medium.com/pangeo"&gt;Pangeo Project&lt;/a&gt;. This blogpost was previously published &lt;a class="reference external" href="https://medium.com/pangeo/dask-jobqueue-d7754e42ca53"&gt;here&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;TLDR;&lt;/strong&gt; &lt;em&gt;Dask-jobqueue&lt;/em&gt; allows you to seamlessly deploy &lt;a class="reference external" href="https://dask.org/"&gt;dask&lt;/a&gt; on HPC clusters that use a variety of job queuing systems such as PBS, Slurm, SGE, or LSF. Dask-jobqueue provides a &lt;em&gt;Pythonic&lt;/em&gt; user interface that manages dask workers/clusters through the submission, execution, and deletion of individual jobs on a HPC system. It gives users the ability to interactively scale workloads across large HPC systems; turning an interactive &lt;a class="reference external" href="http://jupyter.org/"&gt;Jupyter&lt;/a&gt; Notebook into a powerful tool for scalable computation on very large datasets.&lt;/p&gt;
&lt;p&gt;Install with:&lt;/p&gt;
&lt;div class="highlight-bash notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;conda&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;conda-forge&lt;span class="w"&gt; &lt;/span&gt;dask-jobqueue
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;or&lt;/p&gt;
&lt;div class="highlight-bash notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;dask-jobqueue
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And checkout the dask-jobqueue documentation: &lt;a class="reference external" href="http://jobqueue.dask.org"&gt;http://jobqueue.dask.org&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/10/08/Dask-Jobqueue.md&lt;/span&gt;, line 28)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="introduction"&gt;

&lt;p&gt;Large high-performance computer (HPC) clusters are ubiquitous throughout the computational sciences. These HPC systems include powerful hardware, including many large compute nodes, high-speed interconnects, and parallel file systems. An example of such systems that we use at &lt;a class="reference external" href="https://ncar.ucar.edu/"&gt;NCAR&lt;/a&gt; is named &lt;a class="reference external" href="https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne"&gt;Cheyenne&lt;/a&gt;. Cheyenne is a fairly large machine, with about 150k cores and over 300 TB of total memory.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cheyenne is a 5.34-petaflops, high-performance computer operated by NCAR." src="https://cdn-images-1.medium.com/max/2000/1*Jqm612rTcdWFkmcZWhcrTw.jpeg" /&gt;&lt;em&gt;Cheyenne is a 5.34-petaflops, high-performance computer operated by NCAR.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;These systems frequently use a job queueing system, such as PBS, Slurm, or SGE, to manage the queueing and execution of many concurrent jobs from numerous users. A “job” is a single execution of a program that is to be run on some set of resources on the user’s HPC system. These jobs are often submitted via the command line:&lt;/p&gt;
&lt;div class="highlight-bash notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;qsub&lt;span class="w"&gt; &lt;/span&gt;do_thing_a.sh
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Where do_thing_a.sh is a shell script that might look something like this:&lt;/p&gt;
&lt;div class="highlight-bash notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="ch"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c1"&gt;#PBS -N thing_a&lt;/span&gt;
&lt;span class="c1"&gt;#PBS -q premium&lt;/span&gt;
&lt;span class="c1"&gt;#PBS -A 123456789&lt;/span&gt;
&lt;span class="c1"&gt;#PBS -l select=1:ncpus=36:mem=109G&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;“doing&lt;span class="w"&gt; &lt;/span&gt;thing&lt;span class="w"&gt; &lt;/span&gt;A”
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In this example “-N” specifies the name of this job, “-q” specifies the queue where the job should be run, “-A” specifies a project code to bill for the CPU time used while the job is run, and “-l” specifies the hardware specifications for this job. Each job queueing system has slightly different syntax for configuring and submitting these jobs.&lt;/p&gt;
&lt;p&gt;This interface has led to the development of a few common workflow patterns:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;MPI if you want to scale&lt;/em&gt;. MPI stands for the Message Passing Interface. It is a widely adopted interface allowing parallel computation across traditional HPC clusters. Many large computational models are written in languages like C and Fortran and use MPI to manage their parallel execution. For the old-timers out there, this is the go-to solution when it comes time to scale complex computations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Batch it&lt;/em&gt;. It is quite common for scientific processing pipelines to include a few steps that can be easily parallelized by submitting multiple jobs in parallel. Maybe you want to “do_thing_a.sh” 500 times with slightly different inputs — easy, just submit all the jobs separately (or in what some queueing systems refer to as “array-job”).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Serial is still okay&lt;/em&gt;. Computers are pretty fast these days, right? Maybe you don’t need to parallelize your programing at all. Okay, so keep it serial and get some coffee while your job is running.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/10/08/Dask-Jobqueue.md&lt;/span&gt;, line 62)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="the-problem"&gt;
&lt;h1&gt;The Problem&lt;/h1&gt;
&lt;p&gt;None of the workflow patterns listed above allow for interactive analysis on very large data analysis. When I’m prototyping new processing method, I often want to work interactively, say in a Jupyter Notebook. Writing MPI code on the fly is hard and expensive, batch jobs are inherently not interactive, and serial just won’t do when I start working on many TBs of data. Our experience is that these workflows tend to be fairly inelegant and difficult to transfer between applications, yielding lots of duplicated effort along the way.&lt;/p&gt;
&lt;p&gt;One of the aims of the Pangeo project is to facilitate interactive data on very large datasets. Pangeo leverages Jupyter and dask, along with a number of more domain specific packages like &lt;a class="reference external" href="http://xarray.pydata.org"&gt;xarray&lt;/a&gt; to make this possible. The problem is we didn’t have a particularly palatable method for deploying dask on our HPC clusters.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/10/08/Dask-Jobqueue.md&lt;/span&gt;, line 68)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="the-system"&gt;
&lt;h1&gt;The System&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Jupyter Notebooks&lt;/em&gt; are web applications that support interactive code execution, display of figures and animations, and in-line explanatory text and equations. They are quickly becoming the standard open-source format for interactive computing in Python.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Dask&lt;/em&gt; is a library for parallel computing that coordinates well with Python’s existing scientific software ecosystem, including libraries like &lt;a class="reference external" href="http://www.numpy.org/"&gt;NumPy&lt;/a&gt;, &lt;a class="reference external" href="https://pandas.pydata.org/"&gt;Pandas&lt;/a&gt;, &lt;a class="reference external" href="http://scikit-learn.org/stable/"&gt;Scikit-Learn&lt;/a&gt;, and xarray. In many cases, it offers users the ability to take existing workflows and quickly scale them to much larger applications. &lt;a class="reference external" href="http://distributed.dask.org"&gt;*Dask-distributed&lt;/a&gt;* is an extension of dask that facilitates parallel execution across many computers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Dask-jobqueue&lt;/em&gt; is a new Python package that we’ve built to facilitate the deployment of &lt;em&gt;dask&lt;/em&gt; on HPC clusters and interfacing with a number of job queuing systems. Its usage is concise and Pythonic.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_jobqueue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PBSCluster&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PBSCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;108GB&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;premium&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;section id="whats-happening-under-the-hood"&gt;
&lt;h2&gt;What’s happening under the hood?&lt;/h2&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;In the call to PBSCluster() we are telling dask-jobqueue how we want to configure each job. In this case, we set each job to have 1 &lt;em&gt;Worker&lt;/em&gt;, each using 36 cores (threads) and 108 GB of memory. We also tell the PBS queueing system we’d like to submit this job to the “premium” queue. This step also starts a Scheduler to manage workers that we’ll add later.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It is not until we call the cluster.scale() method that we interact with the PBS system. Here we start 10 workers, or equivalently 10 PBS jobs. For each job, dask-jobqueue creates a shell command similar to the one above (except dask-worker is called instead of echo) and submits the job via a subprocess call.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Finally, we connect to the cluster by instantiating the Client class. From here, the rest of our code looks just as it would if we were using one of &lt;a class="reference external" href="http://docs.dask.org/en/stable/scheduler-overview.html"&gt;dask’s local schedulers&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dask-jobqueue is easily customizable to help users capitalize on advanced HPC features. A more complicated example that would work on NCAR’s Cheyenne super computer is:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PBSCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;processes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;108GB&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;P48500028&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;premium&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;resource_spec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;select=1:ncpus=36:mem=109G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;walltime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;02:00:00&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;interface&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ib0&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;local_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;$TMPDIR&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In this example, we instruct the PBSCluster to 1) use up to 36 cores per job, 2) use 18 worker processes per job, 3) use the large memory nodes with 109 GB each, 4) use a longer walltime than is standard, 5) use the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/InfiniBand"&gt;InfiniBand&lt;/a&gt; network interface (ib0), and 6) use the fast SSD disks as its local directory space.&lt;/p&gt;
&lt;p&gt;Finally, Dask offers the ability to “autoscale” clusters based on a set of heuristics. When the cluster needs more CPU or memory, it will scale up. When the cluster has unused resources, it will scale down. Dask-jobqueue supports this with a simple interface:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;adapt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minimum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maximum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;360&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In this example, we tell our cluster to autoscale between 18 and 360 workers (or 1 and 20 jobs).&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/10/08/Dask-Jobqueue.md&lt;/span&gt;, line 119)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="demonstration"&gt;
&lt;h1&gt;Demonstration&lt;/h1&gt;
&lt;p&gt;We have put together a fairly comprehensive screen cast that walks users through all the steps of setting up Jupyter and Dask (and dask-jobqueue) on an HPC cluster:&lt;/p&gt;
&lt;center&gt;&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/FXsgmwpRExM" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;/center&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/10/08/Dask-Jobqueue.md&lt;/span&gt;, line 125)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusions"&gt;
&lt;h1&gt;Conclusions&lt;/h1&gt;
&lt;p&gt;Dask jobqueue makes it much easier to deploy Dask on HPC clusters. The package provides a Pythonic interface to common job-queueing systems. It is also easily customizable.&lt;/p&gt;
&lt;p&gt;The autoscaling functionality allows for a fundamentally different way to do science on HPC clusters. Start your Jupyter Notebook, instantiate your dask cluster, and then do science — let dask determine when to scale up and down depending on the computational demand. We think this bursting approach to interactive parallel computing offers many benefits.&lt;/p&gt;
&lt;p&gt;Finally, in developing dask-jobqueue, we’ve run into a few challenges that are worth mentioning.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Queueing systems are highly customizable. System administrators seem to have a lot of control over their particularly implementation of each queueing system. In practice, this means that it is often difficult to simultaneously cover all permutations of a particular queueing system. We’ve generally found that things seem to be flexible enough and welcome feedback in the cases where they are not.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CI testing has required a fair bit of work to setup. The target environment for using dask-jobqueue is on existing HPC clusters. In order to facilitate continuous integration testing of dask-jobqueue, we’ve had to configure multiple queueing systems (PBS, Slurm, SGE) to run in docker using Travis CI. This has been a laborious task and one we’re still working on.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We’ve built dask-jobqueue to operate in the dask-deploy framework. If you are familiar with &lt;a class="reference external" href="http://kubernetes.dask.org"&gt;dask-kubernetes&lt;/a&gt; or &lt;a class="reference external" href="http://yarn.dask.org"&gt;dask-yarn&lt;/a&gt;, you’ll recognize the basic syntax in dask-jobqueue as well. The coincident development of these dask deployment packages has recently brought up some important coordination discussions (e.g. &lt;a class="github reference external" href="https://github.com/dask/distributed/issues/2235"&gt;dask/distributed#2235&lt;/a&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/10/08/Dask-Jobqueue/"/>
    <summary>This work was done in collaboration with Matthew Rocklin (Anaconda), Jim Edwards (NCAR), Guillaume Eynard-Bontemps (CNES), and Loïc Estève (INRIA), and is supported, in part, by the US National Science Foundation Earth Cube program. The dask-jobqueue package is a spinoff of the Pangeo Project. This blogpost was previously published here</summary>
    <category term="HPC" label="HPC"/>
    <category term="distributed" label="distributed"/>
    <category term="jobqueue" label="jobqueue"/>
    <published>2018-10-08T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/09/27/docs-refactor/</id>
    <title>Refactor Documentation</title>
    <updated>2018-09-27T00:00:00+00:00</updated>
    <author>
      <name>Matthew Rocklin and Tom Augspurger</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/27/docs-refactor.md&lt;/span&gt;, line 11)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="summary"&gt;

&lt;p&gt;We recently changed how we organize and connect Dask’s documentation.
Our approach may prove useful for other umbrella projects that spread
documentation across many different builds and sites.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/27/docs-refactor.md&lt;/span&gt;, line 17)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="dask-splits-documentation-into-many-pages"&gt;
&lt;h1&gt;Dask splits documentation into many pages&lt;/h1&gt;
&lt;p&gt;Dask’s documentation is split into several different websites, each managed by
a different team for a different sub-project:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask.pydata.org"&gt;dask.pydata.org&lt;/a&gt; : Main site&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://distributed.readthedocs.org"&gt;distributed.readthedocs.org&lt;/a&gt; : Distributed scheduler&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-ml.readthedocs.io"&gt;dask-ml.readthedocs.io&lt;/a&gt; : Dask for machine learning&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-kubernetes.readthedocs.io"&gt;dask-kubernetes.readthedocs.io&lt;/a&gt; : Dask on Kubernetes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-jobqueue.readthedocs.io"&gt;dask-jobqueue.readthedocs.io&lt;/a&gt; : Dask on HPC systems&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-yarn.readthedocs.io"&gt;dask-yarn.readthedocs.io&lt;/a&gt; : Dask on Hadoop systems&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-examples.readthedocs.io"&gt;dask-examples.readthedocs.io&lt;/a&gt; : Examples that use Dask&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://matthewrocklin.com/blog"&gt;matthewrocklin.com/blog&lt;/a&gt;,
&lt;a class="reference external" href="https://jcrist.github.io"&gt;jcrist.github.io&lt;/a&gt;,
&lt;a class="reference external" href="https://tomaugspurger.github.io"&gt;tomaugspurger.github.io&lt;/a&gt;,
&lt;a class="reference external" href="https://martindurant.github.io/blog"&gt;martindurant.github.io/blog&lt;/a&gt; :
Developers’ personal blogs&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This split in documentation matches the split in development teams. Each of
sub-project’s team manages its own docs in its own way. They release at their
own pace and make their own decisions about technology. This makes it much
more likely that developers maintain the documentation as they develop and
change software libraries.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;We make it easy to write documentation. This choice causes many different documentation systems to emerge.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This approach is common. A web search for Jupyter Documentation yields the
following list:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://jupyter.readthedocs.io/en/latest/"&gt;jupyter.readthedocs.io/en/latest/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://jupyter-notebook.readthedocs.io/en/stable/"&gt;jupyter-notebook.readthedocs.io/en/stable/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://jupyter.org/"&gt;jupyter.org/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://jupyterhub.readthedocs.io/en/stable/"&gt;jupyterhub.readthedocs.io/en/stable/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://nteract.io/"&gt;nteract.io/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://ipython.org/"&gt;ipython.org/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Different teams developing semi-independently create different web pages. This
is inevitable. Asking a large distributed team to coordinate on a single
cohesive website adds substantial friction, which results in worse
documentation coverage.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/27/docs-refactor.md&lt;/span&gt;, line 58)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="problem"&gt;
&lt;h1&gt;Problem&lt;/h1&gt;
&lt;p&gt;However, while using separate websites results in excellent coverage, it
also fragments the documentation. This makes it harder for users to smoothly
navigate between sites and discover appropriate content.&lt;/p&gt;
&lt;p&gt;Monolithic documentation is good for readers,
modular documentation is good for writers.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/27/docs-refactor.md&lt;/span&gt;, line 67)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="our-solutions"&gt;
&lt;h1&gt;Our Solutions&lt;/h1&gt;
&lt;p&gt;Over the last month we took steps to connect our documentation and make it more
cohesive, while still enabling independent development. This post outlines the
following steps:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Organize under a single domain, dask.org&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Develop a sphinx template project for uniform style&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Include a cross-project navbar in addition to the within-project
table-of-contents&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We did some other things along the way that we find useful, but are probably
more specific to just Dask.&lt;/p&gt;
&lt;ol class="arabic simple" start="4"&gt;
&lt;li&gt;&lt;p&gt;We moved this blog to blog.dask.org&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We improved our example notebooks to host both a static site and also a live Binder&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/27/docs-refactor.md&lt;/span&gt;, line 84)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="organize-under-a-single-domains-dask-org"&gt;
&lt;h1&gt;1: Organize under a single domains, Dask.org&lt;/h1&gt;
&lt;p&gt;Previously we had some documentation under &lt;a class="reference external" href="https://rtfd.org"&gt;readthedocs&lt;/a&gt;,
some under the &lt;a class="reference external" href="https://dask.pydata.org"&gt;dask.pydata.org&lt;/a&gt; subdomain (thanks
NumFOCUS!) and some pages on personal websites, like
&lt;a class="reference external" href="https://matthewrocklin.com/blog"&gt;matthewrocklin.com/blog&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;While looking for a new dask domain to host all of our content we noticed that
&lt;a class="reference external" href="https://dask.org"&gt;dask.org&lt;/a&gt; redirected to
&lt;a class="reference external" href="https://anaconda.org"&gt;anaconda.org&lt;/a&gt;, and were pleased to learn that someone at
&lt;a class="reference external" href="https://anaconda.com"&gt;Anaconda Inc&lt;/a&gt; had the foresight to register the domain
early on.&lt;/p&gt;
&lt;p&gt;Anaconda was happy to transfer ownership of the domain to NumFOCUS, who helps
us to maintain it now. Now all of our documentation is available under that
single domain as subdomains:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask.org"&gt;dask.org&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://docs.dask.org"&gt;docs.dask.org&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://distributed.dask.org"&gt;distributed.dask.org&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://ml.dask.org"&gt;ml.dask.org&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://kubernetes.dask.org"&gt;kubernetes.dask.org&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://yarn.dask.org"&gt;yarn.dask.org&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://jobqueue.dask.org"&gt;jobqueue.dask.org&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://examples.dask.org"&gt;examples.dask.org&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://examples.dask.org"&gt;stories.dask.org&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://blog.dask.org"&gt;blog.dask.org&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This uniformity means that the thing you want is probably at that-thing.dask.org, which is a bit easier to guess than otherwise.&lt;/p&gt;
&lt;p&gt;Many thanks to &lt;a class="reference external" href="https://andy.terrel.us/"&gt;Andy Terrel&lt;/a&gt; and &lt;a class="reference external" href="https://tomaugspurger.github.io"&gt;Tom
Augspurger&lt;/a&gt; for managing this move, and to
Anaconda for generously donating the domain.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/27/docs-refactor.md&lt;/span&gt;, line 118)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="cross-project-navigation-bar"&gt;
&lt;h1&gt;2: Cross-project Navigation Bar&lt;/h1&gt;
&lt;p&gt;We wanted a way for readers to quickly discover the other sites that were
available to them. All of our sites have side-navigation-bars to help readers
navigate within a particular site, but now they also have a top-navigation-bar
to help them navigate between projects.&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/docs-navbar-sidebar.png"
     width="100%"
     alt="adding a navbar to dask docs"&gt;&lt;/p&gt;
&lt;p&gt;This navigation bar is managed independently from all of the documentation projects at
our new Sphinx theme.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/27/docs-refactor.md&lt;/span&gt;, line 132)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="dask-sphinx-theme"&gt;
&lt;h1&gt;3: Dask Sphinx Theme&lt;/h1&gt;
&lt;p&gt;To give a uniform sense of style we developed our own Sphinx HTML theme. This
inherits from ReadTheDocs’ theme, but with changed styling to match Dask color
and visual style. We publish this theme as a &lt;a class="reference external" href="https://pypi.org/project/dask-sphinx-theme/"&gt;package on
PyPI&lt;/a&gt; that all of our projects’
Sphinx builds can import and use if they want. We can change style in this one
package and publish to PyPI and all of the projects will pick up those changes
on their next build without having to copy stylesheets around to different
repositories.&lt;/p&gt;
&lt;p&gt;This allows several different projects to evolve content (which they care
about) and build process separately from style (which they typically don’t care
as much about). We have a single style sheet that gets used everywhere easily.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/27/docs-refactor.md&lt;/span&gt;, line 147)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="move-dask-blogging-to-blog-dask-org"&gt;
&lt;h1&gt;4: Move Dask Blogging to blog.dask.org&lt;/h1&gt;
&lt;p&gt;Previously most announcements about Dask were written and published from one of
the maintainers’ personal blogs. This split information about the project and
made it hard for people to discover good content. There also wasn’t a good way
for a community member to suggest a blog for distribution to the general
community, other than by starting their own.&lt;/p&gt;
&lt;p&gt;Now we have an official blog at &lt;a class="reference external" href="https://blog.dask.org"&gt;blog.dask.org&lt;/a&gt; which
serves files submitted to
&lt;a class="reference external" href="https://github.com/dask/dask-blog"&gt;github.com/dask/dask-blog&lt;/a&gt;. These posts
are simple markdown files that should be easy for people to generate. For
example the source for this post is available at
&lt;a class="reference external" href="https://github.com/dask/dask-blog/blob/gh-pages/_posts/2018-09-27-docs-refactor.md"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;github.com/dask/dask-blog/blob/gh-pages/_posts/2018-09-27-docs-refactor.md&lt;/span&gt;&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;We encourage community members to share posts about work they’ve done with Dask
by submitting pull requests to that repository.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/27/docs-refactor.md&lt;/span&gt;, line 165)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="host-examples-as-both-static-html-and-live-binder-sessions"&gt;
&lt;h1&gt;5: Host Examples as both static HTML and live Binder sessions&lt;/h1&gt;
&lt;p&gt;The Dask community maintains a set of example notebooks that show people how to
use Dask in a variety of ways. These notebooks live at
&lt;a class="reference external" href="https://github.com/dask/dask-examples"&gt;github.com/dask/dask-examples&lt;/a&gt; and are
easy for users to download and run.&lt;/p&gt;
&lt;p&gt;To get more value from these notebooks we now expose them in two additional
ways:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;As static HTML at &lt;a class="reference external" href="https://examples.dask.org"&gt;examples.dask.org&lt;/a&gt;, rendered
with the &lt;a class="reference external" href="https://nbsphinx.readthedocs.io/en/latest/"&gt;nbsphinx&lt;/a&gt; plugin.&lt;/p&gt;
&lt;p&gt;Seeing them statically rendered and being able to quickly navigate between
them really increases the pleasure of exploring them. We hope that this
encourages users to explore more broadly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;As live-runnable notebooks on the cloud using &lt;a class="reference external" href="https://mybinder.org"&gt;mybinder.org&lt;/a&gt;.
You can play with any of these notebooks by clicking on this button:
&lt;a class="reference external" href="https://mybinder.org/v2/gh/dask/dask-examples/main?urlpath=lab"&gt;&lt;img alt="Binder" src="https://mybinder.org/badge.svg" /&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This allows people to explore more deeply. Also, because we’ve connected
up the Dask JupyterLab extension to this environment, users get an
immediate instinctual experience of what parallel computing feels like (if
you haven’t used the dask dashboard during computation you really should
give that link a try).&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Now that these examples get much more exposure we hope that this encourages
community members to submit new examples. We hope that by providing
infrastructure more content creators will come as well.&lt;/p&gt;
&lt;p&gt;We also encourage other projects to take a look at what we’ve done in
&lt;a class="reference external" href="https://github.com/dask/dask-examples"&gt;github.com/dask/dask-examples&lt;/a&gt;. We
think that this model might be broadly useful across other projects.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/27/docs-refactor.md&lt;/span&gt;, line 200)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Thank you for reading. We hope that this post pushes readers to re-explore
Dask’s documentation, and that it pushes developers to consider some of the
approaches above for their own projects.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/09/27/docs-refactor/"/>
    <summary>This work is supported by Anaconda Inc</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-09-27T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/09/17/dask-dev/</id>
    <title>Dask Development Log</title>
    <updated>2018-09-17T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To increase transparency I’m trying to blog more often about the current work
going on around Dask and related projects. Nothing here is ready for
production. This blogpost is written in haste, so refined polish should not be
expected.&lt;/p&gt;
&lt;p&gt;Since the last update in the &lt;a class="reference internal" href="../../2018/09/05/dask-0.19.0/"&gt;&lt;span class="doc std std-doc"&gt;0.19.0 release blogpost&lt;/span&gt;&lt;/a&gt; two weeks ago we’ve seen activity in the following areas:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Update Dask examples to use JupyterLab on Binder&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Render Dask examples into static HTML pages for easier viewing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Consolidate and unify disparate documentation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Retire the &lt;a class="reference external" href="https://hdfs3.readthedocs.io/en/latest/"&gt;hdfs3 library&lt;/a&gt; in favor of the solution in &lt;a class="reference external" href="https://arrow.apache.org/docs/python/filesystems.html"&gt;Apache Arrow&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Continue work on hyper-parameter selection for incrementally trained models&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Publish two small bugfix releases&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Blogpost from the Pangeo community about combining Binder with Dask&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Skein/Yarn Update&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/17/dask-dev.md&lt;/span&gt;, line 26)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="update-dask-examples-to-use-jupyterlab-extension"&gt;

&lt;p&gt;The new &lt;a class="reference external" href="https://github.com/dask/dask-labextension"&gt;dask-labextension&lt;/a&gt; embeds
Dask’s dashboard plots into a JupyterLab session so that you can get easy
access to information about your computations from Jupyter directly. This was
released a few weeks ago as part of the previous release post.&lt;/p&gt;
&lt;p&gt;However since then we’ve hooked this up to our live examples system that lets
users try out Dask on a small cloud instance using
&lt;a class="reference external" href="https://mybinder.org"&gt;mybinder.org&lt;/a&gt;. If you want to try out Dask and
JupyterLab together then head here:&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://mybinder.org/v2/gh/dask/dask-examples/main?urlpath=lab"&gt;&lt;img alt="Binder" src="https://mybinder.org/badge.svg" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://github.com/ian-r-rose"&gt;Ian Rose&lt;/a&gt; for managing this.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/17/dask-dev.md&lt;/span&gt;, line 42)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="render-dask-examples-as-static-documentation"&gt;
&lt;h1&gt;2: Render Dask Examples as static documentation&lt;/h1&gt;
&lt;p&gt;Using the &lt;a class="reference external" href="https://nbsphinx.readthedocs.io/en/0.3.5/"&gt;nbsphinx&lt;/a&gt; Sphinx
extension to automatically run and render Jupyter Notebooks we’ve turned our
live examples repository into static documentation for easy viewing.&lt;/p&gt;
&lt;p&gt;These examples are currently available at
&lt;a class="reference external" href="https://dask.org/dask-examples/"&gt;https://dask.org/dask-examples/&lt;/a&gt; but will
soon be available at &lt;a class="reference external" href="https://dask.org/dask-examples/"&gt;examples.dask.org&lt;/a&gt; and
from the navbar at all dask pages.&lt;/p&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://tomaugspurger.github.io/"&gt;Tom Augspurger&lt;/a&gt; for putting this
together.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/17/dask-dev.md&lt;/span&gt;, line 56)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="consolidate-documentation-under-a-single-org-and-style"&gt;
&lt;h1&gt;3: Consolidate documentation under a single org and style&lt;/h1&gt;
&lt;p&gt;Dask documentation is currently spread out in many small hosted sites, each
associated to a particular subpackage like dask-ml, dask-kubernetes,
dask-distributed, etc.. This eases development (developers are encouraged to
modify documentation as they modify code) but results in a fragmented
experience because users don’t know how to discover and efficiently explore our
full documentation.&lt;/p&gt;
&lt;p&gt;To resolve this we’re doing two things:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;Moving all sites under the dask.org domain&lt;/p&gt;
&lt;p&gt;Anaconda Inc, the company that employs several of the Dask developers
(myself included) recently donated the domain &lt;a class="reference external" href="http://dask.org"&gt;dask.org&lt;/a&gt;
to NumFOCUS. We’ve been slowly moving over all of our independent sites to
use that location for our documentation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Develop a uniform Sphinx theme &lt;a class="reference external" href="http://github.com/dask/dask-sphinx-theme"&gt;dask-sphinx-theme&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This has both uniform styling and also includes a navbar that gets
automatically shared between the projects. The navbar makes it easy to
discover and explore content and is something that we can keep up-to-date
in a single repository.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;You can see how this works by going to any of the Dask sites, like
&lt;a class="reference external" href="http://docs.dask.org/en/latest/docs.html"&gt;docs.dask.org&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://tomaugspurger.github.io/"&gt;Tom Augspurger&lt;/a&gt; for managing this
work and &lt;a class="reference external" href="http://andy.terrel.us/"&gt;Andy Terrel&lt;/a&gt; for patiently handling things on
the NumFOCUS side and domain name side.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/17/dask-dev.md&lt;/span&gt;, line 88)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="retire-the-hdfs3-library"&gt;
&lt;h1&gt;4: Retire the hdfs3 library&lt;/h1&gt;
&lt;p&gt;For years the Dask community has maintained the
&lt;a class="reference external" href="https://hdfs3.readthedocs.io/en/latest/"&gt;hdfs3&lt;/a&gt; library that allows for native
access to the Hadoop file system from Python. This used Pivotal’s libhdfs3
library written in C++ and was, for a long while the only performant way to
maturely manipulate HDFS from Python.&lt;/p&gt;
&lt;p&gt;Since then though PyArrow has developed efficient bindings to the standard
libhdfs library and exposed it through their Pythonic &lt;a class="reference external" href="https://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs"&gt;file system
interface&lt;/a&gt;,
which is fortunately Dask-compatible.&lt;/p&gt;
&lt;p&gt;We’ve been telling people to use the Arrow solution for a while now and thought
we’d now do so officially
(see &lt;a class="reference external" href="https://github.com/dask/hdfs3/pull/170"&gt;dask/hdfs3 #170&lt;/a&gt;). As of the
last bugfix release Dask will use Arrow by default and, while the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;hdfs3&lt;/span&gt;&lt;/code&gt;
library is still available, Dask maintainers probably won’t spend much time on
it in the future.&lt;/p&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://hdfs3.readthedocs.io/en/latest/"&gt;Martin Durant&lt;/a&gt; for building
and maintaining HDFS3 over all this time.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/17/dask-dev.md&lt;/span&gt;, line 111)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="hyper-parameter-selection-for-incrementally-trained-models"&gt;
&lt;h1&gt;5: Hyper-parameter selection for incrementally trained models&lt;/h1&gt;
&lt;p&gt;In Dask-ML we continue to work on hyper-parameter selection for models that
implement the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt; API. We’ve built algorithms and infrastructure to
handle this well, and are currently fine tuning API, parameter names, etc..&lt;/p&gt;
&lt;p&gt;If you have any interest in this process, come on over to &lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/356"&gt;dask/dask-ml #356&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://tomaugspurger.github.io/"&gt;Tom Augspurger&lt;/a&gt; and &lt;a class="reference external" href="https://stsievert.com/"&gt;Scott
Sievert&lt;/a&gt; for this work.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/17/dask-dev.md&lt;/span&gt;, line 122)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="two-small-bugfix-releases"&gt;
&lt;h1&gt;6: Two small bugfix releases&lt;/h1&gt;
&lt;p&gt;We’ve been trying to increase the frequency of bugfix releases while things are
stable. Since our last writing there have been two minor bugfix releases. You
can read more about them here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/blob/master/docs/source/changelog.rst"&gt;dask/dask&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/blob/master/docs/source/changelog.rst"&gt;dask/distributed&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/17/dask-dev.md&lt;/span&gt;, line 131)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="binder-dask"&gt;
&lt;h1&gt;7: Binder + Dask&lt;/h1&gt;
&lt;p&gt;The Pangeo community has done work to integrate Binder with Dask and has
written about the process here: &lt;a class="reference external" href="https://medium.com/pangeo/pangeo-meets-binder-2ea923feb34f"&gt;Pangeo meets Binder&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="http://joehamman.com/"&gt;Joe Hamman&lt;/a&gt; for this work and the blogpost.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/17/dask-dev.md&lt;/span&gt;, line 138)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="skein-yarn-update"&gt;
&lt;h1&gt;8: Skein/Yarn Update&lt;/h1&gt;
&lt;p&gt;The Dask-Yarn connection to deploy Dask on Hadoop clusters uses a library
&lt;a class="reference external" href="https://jcrist.github.io/skein/"&gt;Skein&lt;/a&gt; to easily manage Yarn jobs from
Python.&lt;/p&gt;
&lt;p&gt;Skein has seen a lot of activity over the last few weeks, including the
following:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;A Web UI for the project. See &lt;a class="reference external" href="https://github.com/jcrist/skein/pull/68"&gt;jcrist/skein #68&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A Tensorflow on Yarn project from Criteo that uses Skein. See
&lt;a class="reference external" href="https://github.com/criteo/tf-yarn"&gt;github.com/criteo/tf-yarn&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This work is mostly managed by &lt;a class="reference external" href="http://jcrist.github.io/"&gt;Jim Crist&lt;/a&gt; and other
Skein contributors.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/09/17/dask-dev/"/>
    <summary>This work is supported by Anaconda Inc</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-09-17T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/09/05/dask-0.19.0/</id>
    <title>Dask Release 0.19.0</title>
    <updated>2018-09-05T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I’m pleased to announce the release of Dask version 0.19.0. This is a major
release with bug fixes and new features. The last release was 0.18.2 on July
23rd. This blogpost outlines notable changes since the last release blogpost
for 0.18.0 on June 14th.&lt;/p&gt;
&lt;p&gt;You can conda install Dask:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;conda install dask
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;or pip install from PyPI:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;pip install dask[complete] --upgrade
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Full changelogs are available here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/blob/master/docs/source/changelog.rst"&gt;dask/dask&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/blob/master/docs/source/changelog.rst"&gt;dask/distributed&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/05/dask-0.19.0.md&lt;/span&gt;, line 28)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="notable-changes"&gt;

&lt;p&gt;A ton of work has happened over the past two months, but most of the changes
are small and diffuse. Stability, feature parity with upstream libraries (like
Numpy and Pandas), and performance have all significantly improved, but in ways
that are difficult to condense into blogpost form.&lt;/p&gt;
&lt;p&gt;That being said, here are a few of the more exciting changes in the new
release.&lt;/p&gt;
&lt;section id="python-versions"&gt;
&lt;h2&gt;Python Versions&lt;/h2&gt;
&lt;p&gt;We’ve dropped official support for Python 3.4 and added official support for
Python 3.7.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="deploy-on-hadoop-clusters"&gt;
&lt;h2&gt;Deploy on Hadoop Clusters&lt;/h2&gt;
&lt;p&gt;Over the past few months &lt;a class="reference external" href="https://jcrist.github.io/"&gt;Jim Crist&lt;/a&gt; has bulit a
suite of tools to deploy applications on YARN, the primary cluster manager used
in Hadoop clusters.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://conda.github.io/conda-pack/"&gt;Conda-pack&lt;/a&gt;: packs up Conda
environments for redistribution to distributed clusters, especially when
Python or Conda may not be present.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://jcrist.github.io/skein/"&gt;Skein&lt;/a&gt;: easily launches and manages YARN
applications from non-JVM systems&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-yarn.readthedocs.io/en/latest/"&gt;Dask-Yarn&lt;/a&gt;: a thin library
around Skein to launch and manage Dask clusters&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Jim has written about Skein and Dask-Yarn in two recent blogposts:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://jcrist.github.io/dask-on-yarn"&gt;jcrist.github.io/dask-on-yarn&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://jcrist.github.io/introducing-skein.html"&gt;jcrist.github.io/introducing-skein.html&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;section id="implement-actors"&gt;
&lt;h2&gt;Implement Actors&lt;/h2&gt;
&lt;p&gt;Some advanced workloads want to directly manage and mutate state on workers. A
task-based framework like Dask can be forced into this kind of workload using
long-running-tasks, but it’s an uncomfortable experience.&lt;/p&gt;
&lt;p&gt;To address this we’ve added an experimental Actors framework to Dask alongside
the standard task-scheduling system. This provides reduced latencies, removes
scheduling overhead, and provides the ability to directly mutate state on a
worker, but loses niceties like resilience and diagnostics.
The idea to adopt Actors was shamelessly stolen from the &lt;a class="reference external" href="http://ray.readthedocs.io/en/latest/"&gt;Ray Project&lt;/a&gt; :)&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;

&lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;You can read more about actors in the &lt;a class="reference external" href="https://distributed.readthedocs.io/en/latest/actors.html"&gt;Actors documentation&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="dashboard-improvements"&gt;
&lt;h2&gt;Dashboard improvements&lt;/h2&gt;
&lt;p&gt;The Dask dashboard is a critical tool to understand distributed performance.
There are a few accessibility issues that trip up beginning users that we’ve
addressed in this release.&lt;/p&gt;
&lt;section id="save-task-stream-plots"&gt;
&lt;h3&gt;Save task stream plots&lt;/h3&gt;
&lt;p&gt;You can now save a task stream record by wrapping a computation in the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;get_task_stream&lt;/span&gt;&lt;/code&gt; context manager.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_task_stream&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timeseries&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;get_task_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;save&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;my-task-stream.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;
&lt;span class="go"&gt;[{&amp;#39;key&amp;#39;: &amp;quot;(&amp;#39;make-timeseries-edc372a35b317f328bf2bb5e636ae038&amp;#39;, 0)&amp;quot;,&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;nbytes&amp;#39;: 8175440,&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;startstops&amp;#39;: [(&amp;#39;compute&amp;#39;, 1535661384.2876947, 1535661384.3366017)],&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;status&amp;#39;: &amp;#39;OK&amp;#39;,&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;thread&amp;#39;: 139754603898624,&lt;/span&gt;
&lt;span class="go"&gt;  &amp;#39;worker&amp;#39;: &amp;#39;inproc://192.168.50.100/15417/2&amp;#39;},&lt;/span&gt;

&lt;span class="go"&gt;  ...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This gives you the start and stop time of every task on every worker done
during that time. It also saves that data as an HTML file that you can share
with others. This is very valuable for communicating performance issues within
a team. I typically upload the HTML file as a gist and then share it with
rawgit.com&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ gist my-task-stream.html
https://gist.github.com/f48a121bf03c869ec586a036296ece1a
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;iframe src="https://rawgit.com/mrocklin/f48a121bf03c869ec586a036296ece1a/raw/d2c1a83d5dc62996eeabca495d5284e324d71d0c/my-task-stream.html" width="800" height="400"&gt;&lt;/iframe&gt;
&lt;/section&gt;
&lt;section id="robust-to-different-screen-sizes"&gt;
&lt;h3&gt;Robust to different screen sizes&lt;/h3&gt;
&lt;p&gt;The Dashboard’s layout was designed to be used on a single screen, side-by-side
with a Jupyter notebook. This is how many Dask developers operate when working
on a laptop, however it is not how many users operate for one of two reasons:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;They are working in an office setting where they have several screens&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They are new to Dask and uncomfortable splitting their screen into two
halves&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In these cases the styling of the dashboard becomes odd. Fortunately, &lt;a class="reference external" href="https://github.com/canavandl"&gt;Luke
Canavan&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/dsludwig"&gt;Derek
Ludwig&lt;/a&gt; recently improved the CSS for the
dashboard considerably, allowing it to switch between narrow and wide screens.
Here is a snapshot.&lt;/p&gt;
&lt;p&gt;&lt;a href="/images/dashboard-widescreen.png"&gt;&lt;img src="/images/dashboard-widescreen.png" width="70%"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id="jupyter-lab-extension"&gt;
&lt;h3&gt;Jupyter Lab Extension&lt;/h3&gt;
&lt;p&gt;You can now embed Dashboard panes directly within Jupyter Lab using the newly
updated &lt;a class="reference external" href="https://github.com/dask/dask-labextension/"&gt;dask-labextension&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;jupyter&lt;/span&gt; &lt;span class="n"&gt;labextension&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;labextension&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This allows you to layout your own dashboard directly within JupyterLab. You
can combine plots from different pages, control their sizing, and so on. You
will need to provide the address of the dashboard server
(&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;http://localhost:8787&lt;/span&gt;&lt;/code&gt; by default on local machines) but after that
everything should persist between sessions. Now when I open up JupyterLab and
start up a Dask Client, I get this:&lt;/p&gt;
&lt;p&gt;&lt;a href="/images/dashboard-jupyterlab.png"&gt;&lt;img src="/images/dashboard-jupyterlab.png" width="70%"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://github.com/ian-r-rose"&gt;Ian Rose&lt;/a&gt; for doing most of the work
here.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/05/dask-0.19.0.md&lt;/span&gt;, line 178)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="outreach"&gt;
&lt;h1&gt;Outreach&lt;/h1&gt;
&lt;section id="dask-stories"&gt;
&lt;h2&gt;Dask Stories&lt;/h2&gt;
&lt;p&gt;People who use Dask have been writing about their experiences at &lt;a class="reference external" href="https://dask-stories.readthedocs.io/en/latest/"&gt;Dask
Stories&lt;/a&gt;. In the last couple
months the following people have written about and contributed their experience:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-stories.readthedocs.io/en/latest/sidewalk-labs.html"&gt;Civic Modelling at Sidewalk Labs&lt;/a&gt; by &lt;a class="reference external" href="https://github.com/bnaul"&gt;Brett Naul&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-stories.readthedocs.io/en/latest/mosquito-sequencing.html"&gt;Genome Sequencing for Mosquitoes&lt;/a&gt; by &lt;a class="reference external" href="http://alimanfoo.github.io/about/"&gt;Alistair Miles&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-stories.readthedocs.io/en/latest/fullspectrum.html"&gt;Lending and Banking at Full Spectrum&lt;/a&gt; by &lt;a class="reference external" href="https://www.linkedin.com/in/hussainsultan/"&gt;Hussain Sultan&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-stories.readthedocs.io/en/latest/icecube-cosmic-rays.html"&gt;Detecting Cosmic Rays at IceCube&lt;/a&gt; by &lt;a class="reference external" href="https://github.com/jrbourbeau"&gt;James Bourbeau&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-stories.readthedocs.io/en/latest/pangeo.html"&gt;Large Data Earth Science at Pangeo&lt;/a&gt; by &lt;a class="reference external" href="http://rabernat.github.io/"&gt;Ryan Abernathey&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-stories.readthedocs.io/en/latest/hydrologic-modeling.html"&gt;Hydrological Modelling at the National Center for Atmospheric Research&lt;/a&gt; by &lt;a class="reference external" href="http://joehamman.com/about/"&gt;Joe Hamman&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-stories.readthedocs.io/en/latest/network-modeling.html"&gt;Mobile Networks Modeling&lt;/a&gt; by &lt;a class="reference external" href="https://www.linkedin.com/in/lalwanisameer/"&gt;Sameer Lalwani&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-stories.readthedocs.io/en/latest/satellite-imagery.html"&gt;Satellite Imagery Processing at the Space Science and Engineering Center&lt;/a&gt; by &lt;a class="reference external" href="http://github.com/djhoese"&gt;David Hoese&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These stories help people understand where Dask is and is not applicable, and
provide useful context around how it gets used in practice. We welcome further
contributions to this project. It’s very valuable to the broader community.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="dask-examples"&gt;
&lt;h2&gt;Dask Examples&lt;/h2&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://github.com/dask/dask-examples"&gt;Dask-Examples repository&lt;/a&gt; maintains
easy-to-run examples using Dask on a small machine, suitable for an entry-level
laptop or for a small cloud instance. These are hosted on
&lt;a class="reference external" href="https://mybinder.org"&gt;mybinder.org&lt;/a&gt; and are integrated into our documentation.
A number of new examples have arisen recently, particularly in machine
learning. We encourage people to try them out by clicking the link below.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://mybinder.org/v2/gh/dask/dask-examples/main"&gt;&lt;img alt="Binder" src="https://mybinder.org/badge.svg" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/05/dask-0.19.0.md&lt;/span&gt;, line 210)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="other-projects"&gt;
&lt;h1&gt;Other Projects&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="https://dask-image.readthedocs.io/en/latest/"&gt;dask-image&lt;/a&gt; project was
recently released. It includes a number of image processing routines around
dask arrays.&lt;/p&gt;
&lt;p&gt;This project is mostly maintained by &lt;a class="reference external" href="https://github.com/jakirkham"&gt;John Kirkham&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-ml.readthedocs.io/en/latest/"&gt;Dask-ML&lt;/a&gt; saw a recent bugfix release&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="http://epistasislab.github.io/tpot/"&gt;TPOT&lt;/a&gt; library for automated
machine learning recently published a new release that adds Dask support to
parallelize their model training. More information is available on the
&lt;a class="reference external" href="http://epistasislab.github.io/tpot/using/#parallel-training-with-dask"&gt;TPOT documentation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/09/05/dask-0.19.0.md&lt;/span&gt;, line 225)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="acknowledgements"&gt;
&lt;h1&gt;Acknowledgements&lt;/h1&gt;
&lt;p&gt;Since June 14th, the following people have contributed to the following repositories:&lt;/p&gt;
&lt;p&gt;The core Dask repository for parallel algorithms:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Anderson Banihirwe&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Andre Thrill&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Aurélien Ponte&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Christoph Moehl&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cloves Almeida&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Daniel Rothenberg&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Danilo Horta&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Davis Bennett&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Elliott Sales de Andrade&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Eric Bonfadini&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GPistre&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;George Sakkis&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Guido Imperiale&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hans Moritz Günther&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Henrique Ribeiro&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hugo&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Irina Truong&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Itamar Turner-Trauring&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jacob Tomlinson&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;James Bourbeau&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jan Margeta&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Javad&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jeremy Chen&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jim Crist&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Joe Hamman&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;John Kirkham&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;John Mrziglod&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Julia Signell&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Marco Rossi&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mark Harfouche&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martin Durant&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matt Lee&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mike Neish&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Robert Sare&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scott Sievert&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stephan Hoyer&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tobias de Jong&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tom Augspurger&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;WZY&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Yu Feng&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Yuval Langer&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;minebogy&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;nmiles2718&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;rtobar&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The dask/distributed repository for distributed computing:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Anderson Banihirwe&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Aurélien Ponte&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bartosz Marcinkowski&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dave Hirschfeld&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Derek Ludwig&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dror Birkman&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Guillaume EB&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jacob Tomlinson&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Joe Hamman&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;John Kirkham&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loïc Estève&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Luke Canavan&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Marius van Niekerk&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martin Durant&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matt Nicolls&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mike DePalatis&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Olivier Grisel&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Phil Tooley&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ray Bell&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tom Augspurger&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Yu Feng&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The dask/dask-examples repository for easy-to-run examples:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Albert DeFusco&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dan Vatterott&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Guillaume EB&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scott Sievert&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tom Augspurger&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mholtzscher&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/09/05/dask-0.19.0/"/>
    <summary>This work is supported by Anaconda Inc.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-09-05T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/08/28/dataframe-performance-high-level/</id>
    <title>High level performance of Pandas, Dask, Spark, and Arrow</title>
    <updated>2018-08-28T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/28/dataframe-performance-high-level.md&lt;/span&gt;, line 10)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="question"&gt;

&lt;blockquote&gt;
&lt;div&gt;&lt;p&gt;How does Dask dataframe performance compare to Pandas? Also, what about
Spark dataframes and what about Arrow? How do they compare?&lt;/p&gt;
&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;I get this question every few weeks. This post is to avoid repetition.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/28/dataframe-performance-high-level.md&lt;/span&gt;, line 17)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="caveats"&gt;
&lt;h1&gt;Caveats&lt;/h1&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;This answer is likely to change over time. I’m writing this in August 2018&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This question and answer are very high level.
More technical answers are possible, but not contained here.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/28/dataframe-performance-high-level.md&lt;/span&gt;, line 23)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="answers"&gt;
&lt;h1&gt;Answers&lt;/h1&gt;
&lt;section id="pandas"&gt;
&lt;h2&gt;Pandas&lt;/h2&gt;
&lt;p&gt;If you’re coming from Python and have smallish datasets then Pandas is the
right choice. It’s usable, widely understood, efficient, and well maintained.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="benefits-of-parallelism"&gt;
&lt;h2&gt;Benefits of Parallelism&lt;/h2&gt;
&lt;p&gt;The performance benefit (or drawback) of using a parallel dataframe like Dask
dataframes or Spark dataframes over Pandas will differ based on the kinds of
computations you do:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;If you’re doing small computations then Pandas is always the right choice.
The administrative costs of parallelizing will outweigh any benefit.
You should not parallelize if your computations are taking less than, say,
100ms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For simple operations like filtering, cleaning, and aggregating large data
you should expect linear speedup by using a parallel dataframes.&lt;/p&gt;
&lt;p&gt;If you’re on a 20-core computer you might expect a 20x speedup. If you’re
on a 1000-core cluster you might expect a 1000x speedup, assuming that you
have a problem big enough to spread across 1000 cores. As you scale up
administrative overhead will increase, so you should expect the speedup to
decrease a bit.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For complex operations like distributed joins it’s more complicated. You
might get linear speedups like above, or you might even get slowdowns.
Someone experienced in database-like computations and parallel computing
can probably predict pretty well which computations will do well.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;However, configuration may be required. Often people find that parallel
solutions don’t meet expectations when they first try them out. Unfortunately
most distributed systems require some configuration to perform optimally.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="there-are-other-options-to-speed-up-pandas"&gt;
&lt;h2&gt;There are other options to speed up Pandas&lt;/h2&gt;
&lt;p&gt;Many people looking to speed up Pandas don’t need parallelism. There are often
several other tricks like encoding text data, using efficient file formats,
avoiding groupby.apply, and so on that are more effective at speeding up Pandas
than switching to parallelism.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="comparing-apache-spark-and-dask"&gt;
&lt;h2&gt;Comparing Apache Spark and Dask&lt;/h2&gt;
&lt;blockquote&gt;
&lt;div&gt;&lt;p&gt;Assuming that yes, I do want parallelism, should I choose Apache Spark, or Dask dataframes?&lt;/p&gt;
&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;This is often decided more by cultural preferences (JVM vs Python,
all-in-one-tool vs integration with other tools) than performance differences,
but I’ll try to outline a few things here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Spark dataframes will be much better when you have large SQL-style queries
(think 100+ line queries) where their query optimizer can kick in.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dask dataframes will be much better when queries go beyond typical database
queries. This happens most often in time series, random access, and other
complex computations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Spark will integrate better with JVM and data engineering technology.
Spark will also come with everything pre-packaged. Spark is its own
ecosystem.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dask will integrate better with Python code. Dask is designed to integrate
with other libraries and pre-existing systems. If you’re coming from an
existing Pandas-based workflow then it’s usually much easier to evolve to
Dask.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Generally speaking for most operations you’ll be fine using either one. People
often choose between Pandas/Dask and Spark based on cultural preference.
Either they have people that really like the Python ecosystem, or they have
people that really like the Spark ecosystem.&lt;/p&gt;
&lt;p&gt;Dataframes are also only a small part of each project. Spark and Dask both do
many other things that aren’t dataframes. For example Spark has a graph
analysis library, Dask doesn’t. Dask supports multi-dimensional arrays, Spark
doesn’t. Spark is generally higher level and all-in-one while Dask is
lower-level and focuses on integrating into other tools.&lt;/p&gt;
&lt;p&gt;For more information, see &lt;a class="reference external" href="http://dask.pydata.org/en/latest/spark.html"&gt;Dask’s “Comparison to Spark documentation”&lt;/a&gt;
or &lt;a class="reference external" href="https://www.youtube.com/watch?v=jR0Y7NqKJs8&amp;amp;amp;list=PLJ0vO2F_f6OAE1xiEUE7DwMFWbdLCbN3P&amp;amp;amp;index=11&amp;amp;amp;t=413s"&gt;this interview with Steppingblocks&lt;/a&gt;, a data analytics company, on why they switched from Spark to Dask.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="apache-arrow"&gt;
&lt;h2&gt;Apache Arrow&lt;/h2&gt;
&lt;blockquote&gt;
&lt;div&gt;&lt;p&gt;What about Arrow? Is Arrow faster than Pandas?&lt;/p&gt;
&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;This question doesn’t quite make sense… &lt;em&gt;yet&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Arrow is not a replacement for Pandas. Today Arrow is useful to people
building &lt;em&gt;systems&lt;/em&gt; and not to analysts directly like Pandas. Arrow is used to
move data between different computational systems and file formats. Arrow does
not do computation today, but is commonly used as a component in other
libraries that do do computation. For example, if you use Pandas or Spark or
Dask today you may be using Arrow without knowing it. Today Arrow is more
useful for other libraries than it is to end-users.&lt;/p&gt;
&lt;p&gt;However, this is likely to change in the future. Arrow developers plan
to write computational code around Arrow that we would expect to be faster than
the code in either Pandas or Spark. This is probably a year or two away
though. There will probably be some effort to make this semi-compatible with
Pandas, but it’s much too early to tell.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/08/28/dataframe-performance-high-level/"/>
    <summary>This work is supported by Anaconda Inc</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-08-28T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/08/07/incremental-saga/</id>
    <title>Building SAGA optimization for Dask arrays</title>
    <updated>2018-08-07T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="https://www.ethz.ch/en.html"&gt;ETH Zurich&lt;/a&gt;, &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda
Inc&lt;/a&gt;, and the &lt;a class="reference external" href="https://bids.berkeley.edu/"&gt;Berkeley Institute for Data
Science&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;At a recent Scikit-learn/Scikit-image/Dask sprint at BIDS, &lt;a class="reference external" href="http://fa.bianp.net"&gt;Fabian Pedregosa&lt;/a&gt; (a
machine learning researcher and Scikit-learn developer) and Matthew
Rocklin (Dask core developer) sat down together to develop an implementation of the incremental optimization algorithm
&lt;a class="reference external" href="https://arxiv.org/pdf/1407.0202.pdf"&gt;SAGA&lt;/a&gt; on parallel Dask datasets. The result is a sequential algorithm that can be run on any dask array, and so allows the data to be stored on disk or even distributed among different machines.&lt;/p&gt;
&lt;p&gt;It was interesting both to see how the algorithm performed and also to see
the ease and challenges to run a research algorithm on a Dask distributed dataset.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/07/incremental-saga.md&lt;/span&gt;, line 20)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="start"&gt;

&lt;p&gt;We started with an initial implementation that Fabian had written for Numpy
arrays using Numba. The following code solves an optimization problem of the form&lt;/p&gt;
&lt;div class="math notranslate nohighlight"&gt;
\[
min_x \sum_{i=1}^n f(a_i^t x, b_i)
\]&lt;/div&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numba&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;njit&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.linear_model.sag&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_auto_step_size&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.utils.extmath&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;row_norms&lt;/span&gt;

&lt;span class="nd"&gt;@njit&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;deriv_logistic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# derivative of logistic loss&lt;/span&gt;
    &lt;span class="c1"&gt;# same as in lightning (with minus sign)&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;exp_t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exp_t&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;exp_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phi&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;

&lt;span class="nd"&gt;@njit&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;SAGA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;  SAGA algorithm&lt;/span&gt;

&lt;span class="sd"&gt;  A : n_samples x n_features numpy array&lt;/span&gt;
&lt;span class="sd"&gt;  b : n_samples numpy array with values -1 or 1&lt;/span&gt;
&lt;span class="sd"&gt;  &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;

    &lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
    &lt;span class="n"&gt;memory_gradient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;gradient_average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# vector of coefficients&lt;/span&gt;
    &lt;span class="n"&gt;step_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;get_auto_step_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row_norms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;squared&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;log&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# sample randomly&lt;/span&gt;
        &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# .. inner iteration ..&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;grad_i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;deriv_logistic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

            &lt;span class="c1"&gt;# .. update coefficients ..&lt;/span&gt;
            &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grad_i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# .. update memory terms ..&lt;/span&gt;
            &lt;span class="n"&gt;gradient_average&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grad_i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_samples&lt;/span&gt;
            &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;grad_i&lt;/span&gt;

        &lt;span class="c1"&gt;# monitor convergence&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;gradient norm:&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This implementation is a simplified version of the &lt;a class="reference external" href="https://github.com/openopt/copt/blob/master/copt/randomized.py"&gt;SAGA
implementation&lt;/a&gt;
that Fabian uses regularly as part of his research, and that assumes that &lt;span class="math notranslate nohighlight"&gt;\(f\)&lt;/span&gt; is the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Loss_functions_for_classification#Logistic_loss"&gt;logistic loss&lt;/a&gt;, i.e., &lt;span class="math notranslate nohighlight"&gt;\(f(z) = \log(1 + \exp(-z))\)&lt;/span&gt;. It can be used to solve problems with other values of &lt;span class="math notranslate nohighlight"&gt;\(f\)&lt;/span&gt; by overwriting the function &lt;code&gt;deriv_logistic&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We wanted to apply it across a parallel Dask array by applying it to each chunk of the Dask array, a smaller Numpy array, one at a time, carrying along a set of parameters along the way.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/07/incremental-saga.md&lt;/span&gt;, line 91)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="development-process"&gt;
&lt;h1&gt;Development Process&lt;/h1&gt;
&lt;p&gt;In order to better understand the challenges of writing Dask algorithms, Fabian
did most of the actual coding to start. Fabian is good example of a researcher who
knows how to program well and how to design ML algorithms, but has no direct
exposure to the Dask library. This was an educational opportunity both for
Fabian and for Matt. Fabian learned how to use Dask, and Matt learned how to
introduce Dask to researchers like Fabian.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/07/incremental-saga.md&lt;/span&gt;, line 100)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="step-1-build-a-sequential-algorithm-with-pure-functions"&gt;
&lt;h1&gt;Step 1: Build a sequential algorithm with pure functions&lt;/h1&gt;
&lt;p&gt;To start we actually didn’t use Dask at all, instead, Fabian modified his implementation in a few ways:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;It should operate over a list of Numpy arrays. A list of Numpy arrays is similar to a Dask array, but simpler.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It should separate blocks of logic into separate functions, these will
eventually become tasks, so they should be sizable chunks of work. In this
case, this led to the creating of the function &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;_chunk_saga&lt;/span&gt;&lt;/code&gt; that
performs an iteration of the SAGA algorithm on a subset of the data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;These functions should not modify their inputs, nor should they depend on
global state. All information that those functions require (like
the parameters that we’re learning in our algorithm) should be
explicitly provided as inputs.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These requested modifications affect performance a bit, we end up making more
copies of the parameters and more copies of intermediate state. In terms of
programming difficulty this took a bit of time (around a couple hours) but is a
straightforward task that Fabian didn’t seem to find challenging or foreign.&lt;/p&gt;
&lt;p&gt;These changes resulted in the following code:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numba&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;njit&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.utils.extmath&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;row_norms&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.linear_model.sag&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_auto_step_size&lt;/span&gt;


&lt;span class="nd"&gt;@njit&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;_chunk_saga&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f_deriv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Make explicit copies of inputs&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;gradient_average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;memory_gradient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Sample randomly&lt;/span&gt;
    &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# .. inner iteration ..&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;grad_i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f_deriv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="c1"&gt;# .. update coefficients ..&lt;/span&gt;
        &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grad_i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# .. update memory terms ..&lt;/span&gt;
        &lt;span class="n"&gt;gradient_average&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grad_i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_samples&lt;/span&gt;
        &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;grad_i&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;full_saga&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;  data: list of (A, b), where A is a n_samples x n_features&lt;/span&gt;
&lt;span class="sd"&gt;  numpy array and b is a n_samples numpy array&lt;/span&gt;
&lt;span class="sd"&gt;  &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;n_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;n_samples&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;n_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;memory_gradients&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;gradient_average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;get_auto_step_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row_norms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;squared&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;log&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;step_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradients&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_chunk_saga&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deriv_logistic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradients&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/07/incremental-saga.md&lt;/span&gt;, line 180)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="step-2-apply-dask-delayed"&gt;
&lt;h1&gt;Step 2: Apply dask.delayed&lt;/h1&gt;
&lt;p&gt;Once functions neither modified their inputs nor relied on global state we went
over a &lt;a class="reference external" href="https://mybinder.org/v2/gh/dask/dask-examples/main?urlpath=%2Ftree%2Fdelayed.ipynb"&gt;dask.delayed example&lt;/a&gt;,
and then applied the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;&amp;#64;dask.delayed&lt;/span&gt;&lt;/code&gt; decorator to the functions that
Fabian had written. Fabian did this at first in about five minutes and to our
mutual surprise, things actually worked&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                               &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- New&lt;/span&gt;
&lt;span class="nd"&gt;@njit&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;_chunk_saga&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f_deriv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;full_saga&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;n_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;n_samples&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                      &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- New&lt;/span&gt;

    &lt;span class="o"&gt;...&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradients&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_chunk_saga&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deriv_logistic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradients&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- Changed&lt;/span&gt;

        &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                 &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- New&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;However, they didn’t work &lt;em&gt;that well&lt;/em&gt;. When we took a look at the dask
dashboard we find that there is a lot of dead space, a sign that we’re still
doing a lot of computation on the client side.&lt;/p&gt;
&lt;a href="/images/saga-1.png"&gt;
  &lt;img src="/images/saga-1.png" width="90%"&gt;
&lt;/a&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/07/incremental-saga.md&lt;/span&gt;, line 221)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="step-3-diagnose-and-add-more-dask-delayed-calls"&gt;
&lt;h1&gt;Step 3: Diagnose and add more dask.delayed calls&lt;/h1&gt;
&lt;p&gt;While things worked, they were also fairly slow. If you notice the
dashboard plot above you’ll see that there is plenty of white in between
colored rectangles. This shows that there are long periods where none of the
workers is doing any work.&lt;/p&gt;
&lt;p&gt;This is a common sign that we’re mixing work between the workers (which shows
up on the dashbaord) and the client. The solution to this is usually more
targetted use of dask.delayed. Dask delayed is trivial to start using, but
does require some experience to use well. It’s important to keep track of
which operations and variables are delayed and which aren’t. There is some
cost to mixing between them.&lt;/p&gt;
&lt;p&gt;At this point Matt stepped in and added delayed in a few more places and the
dashboard plot started looking cleaner.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                               &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- New&lt;/span&gt;
&lt;span class="nd"&gt;@njit&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;_chunk_saga&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f_deriv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;full_saga&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;n_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;n_samples&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;n_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                      &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- New&lt;/span&gt;
    &lt;span class="n"&gt;memory_gradients&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;         &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- Changed&lt;/span&gt;
    &lt;span class="n"&gt;gradient_average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;#  Changed&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- Changed&lt;/span&gt;

    &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_auto_step_size&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;
                &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row_norms&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;squared&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;log&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
             &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;                    &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- Changed&lt;/span&gt;
    &lt;span class="n"&gt;step_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- Changed&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradients&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_chunk_saga&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deriv_logistic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradients&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- Changed&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradients&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; \
            &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_gradients&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gradient_average&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# New&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;                         &lt;span class="c1"&gt;# &amp;lt;&amp;lt;&amp;lt;---- changed&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;a href="/images/saga-2.png"&gt;
  &lt;img src="/images/saga-2.png" width="90%"&gt;
&lt;/a&gt;
&lt;p&gt;From a dask perspective this now looks good. We see that one &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt;
call is active at any given time with no large horizontal gaps between
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt; calls. We’re not getting any parallelism (this is just a
sequential algorithm) but we don’t have much dead space. The model seems to
jump between the various workers, processing on a chunk of data before moving
on to new data.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/07/incremental-saga.md&lt;/span&gt;, line 285)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="step-4-profile"&gt;
&lt;h1&gt;Step 4: Profile&lt;/h1&gt;
&lt;p&gt;The dashboard image above gives confidence that our algorithm is operating as
it should. The block-sequential nature of the algorithm comes out cleanly, and
the gaps between tasks are very short.&lt;/p&gt;
&lt;p&gt;However, when we look at the profile plot of the computation across all of our
cores (Dask constantly runs a profiler on all threads on all workers to get
this information) we see that most of our time is spent compiling Numba code.&lt;/p&gt;
&lt;a href="/images/saga-profile.png"&gt;
  &lt;img src="/images/saga-profile.png" width="100%"&gt;
&lt;/a&gt;
&lt;p&gt;We started a conversation for this on the &lt;a class="reference external" href="https://github.com/numba/numba/issues/3026"&gt;numba issue
tracker&lt;/a&gt; which has since been
resolved. That same computation over the same time now looks like this:&lt;/p&gt;
&lt;a href="/images/saga-3.png"&gt;
  &lt;img src="/images/saga-3.png" width="90%"&gt;
&lt;/a&gt;
&lt;p&gt;The tasks, which used to take seconds, now take tens of milliseconds, so we can
process through many more chunks in the same amount of time.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/07/incremental-saga.md&lt;/span&gt;, line 310)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="future-work"&gt;
&lt;h1&gt;Future Work&lt;/h1&gt;
&lt;p&gt;This was a useful experience to build an interesting algorithm. Most of the
work above took place in an afternoon. We came away from this activity
with a few tasks of our own:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Build a normal Scikit-Learn style estimator class for this algorithm
so that people can use it without thinking too much about delayed objects,
and can instead just use dask arrays or dataframes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Integrate some of Fabian’s research on this algorithm that improves performance with
&lt;a class="reference external" href="https://arxiv.org/pdf/1707.06468.pdf"&gt;sparse data and in multi-threaded environments&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Think about how to improve the learning experience so that dask.delayed can
teach new users how to use it correctly&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/07/incremental-saga.md&lt;/span&gt;, line 324)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="links"&gt;
&lt;h1&gt;Links&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://gist.github.com/5282dcf47505e2a1d214fd15c7da0ec3"&gt;Notebooks for different stages of SAGA+Dask implementation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/scisprints/2018_05_sklearn_skimage_dask"&gt;Scikit-Learn/Image + Dask Sprint issue tracker&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/scisprints/2018_05_sklearn_skimage_dask"&gt;Paper on SAGA algorithm&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/openopt/copt/blob/master/copt/randomized.py"&gt;Fabian’s more fully featured non-Dask SAGA implementation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/numba/numba/issues/3026"&gt;Numba issue on repeated deserialization&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/08/07/incremental-saga/"/>
    <summary>This work is supported by ETH Zurich, Anaconda
Inc, and the Berkeley Institute for Data
Science</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-08-07T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/08/02/dask-dev/</id>
    <title>Dask Development Log</title>
    <updated>2018-08-02T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To increase transparency I’m trying to blog more often about the current work
going on around Dask and related projects. Nothing here is ready for
production. This blogpost is written in haste, so refined polish should not be
expected.&lt;/p&gt;
&lt;p&gt;Over the last two weeks we’ve seen activity in the following areas:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;An experimental Actor solution for stateful processing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Machine learning experiments with hyper-parameter selection and parameter
servers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Development of more preprocessing transformers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Statistical profiling of the distributed scheduler’s internal event loop
thread and internal optimizations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A new release of dask-yarn&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A new narrative on dask-stories about modelling mobile networks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Support for LSF clusters in dask-jobqueue&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Test suite cleanup for intermittent failures&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/02/dask-dev.md&lt;/span&gt;, line 28)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="stateful-processing-with-actors"&gt;

&lt;p&gt;Some advanced workloads want to directly manage and mutate state on workers. A
task-based framework like Dask can be forced into this kind of workload using
long-running-tasks, but it’s an uncomfortable experience. To address this
we’ve been adding an experimental Actors framework to Dask alongside the
standard task-scheduling system. This provides reduced latencies, removes
scheduling overhead, and provides the ability to directly mutate state on a
worker, but loses niceties like resilience and diagnostics.&lt;/p&gt;
&lt;p&gt;The idea to adopt Actors was shamelessly stolen from the &lt;a class="reference external" href="http://ray.readthedocs.io/en/latest/"&gt;Ray Project&lt;/a&gt; :)&lt;/p&gt;
&lt;p&gt;Work for Actors is happening in &lt;a class="reference external" href="https://github.com/dask/distributed/pull/2133"&gt;dask/distributed #2133&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;

&lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/08/02/dask-dev.md&lt;/span&gt;, line 58)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="machine-learning-experiments"&gt;
&lt;h1&gt;Machine learning experiments&lt;/h1&gt;
&lt;section id="hyper-parameter-optimization-on-incrementally-trained-models"&gt;
&lt;h2&gt;Hyper parameter optimization on incrementally trained models&lt;/h2&gt;
&lt;p&gt;Many Scikit-Learn-style estimators feature a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt; method that enables
incremental training on batches of data. This is particularly well suited for
systems like Dask array or Dask dataframe, that are built from many batches of
Numpy arrays or Pandas dataframes. It’s a nice fit because all of the
computational algorithm work is already done in Scikit-Learn, Dask just has to
administratively move models around to data and call scikit-learn (or other
machine learning models that follow the fit/transform/predict/score API). This
approach provides a nice community interface between parallelism and machine
learning developers.&lt;/p&gt;
&lt;p&gt;However, this training is inherently sequential because the model only trains
on one batch of data at a time. We’re leaving a lot of processing power on the
table.&lt;/p&gt;
&lt;p&gt;To address this we can combine incremental training with hyper-parameter
selection and train several models on the same data at the same time. This is
often required anyway, and lets us be more efficient with our computation.&lt;/p&gt;
&lt;p&gt;However there are many ways to do incremental training with hyper-parameter
selection, and the right algorithm likely depends on the problem at hand.
This is an active field of research and so it’s hard for a general project like
Dask to pick and implement a single method that works well for everyone. There
is probably a handful of methods that will be necessary with various options on
them.&lt;/p&gt;
&lt;p&gt;To help experimentation here we’ve been experimenting with some lower-level
tooling that we think will be helpful in a variety of cases. This accepts a
policy from the user as a Python function that gets scores from recent
evaluations, and asks for how much further to progress on each set of
hyper-parameters before checking in again. This allows us to model a few
common situations like random search with early stopping conditions, successive
halving, and variations of those easily without having to write any Dask code:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/288"&gt;dask/dask-ml #288&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://gist.github.com/mrocklin/4c95bd26d15281d82e0bf2d27632e294"&gt;Notebook showing a few approaches&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://gist.github.com/stsievert/c675b3a237a60efbd01dcb112e29115b"&gt;Another notebook showing convergence&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This work is done by &lt;a class="reference external" href="http://github.com/stsievert"&gt;Scott Sievert&lt;/a&gt; and myself&lt;/p&gt;
&lt;p&gt;&lt;img src="https://user-images.githubusercontent.com/1320475/43540881-7184496a-95b8-11e8-975a-96c2f17ee269.png"
     width="70%"
     alt="Successive halving and random search"&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id="parameter-servers"&gt;
&lt;h2&gt;Parameter Servers&lt;/h2&gt;
&lt;p&gt;To improve the speed of training large models &lt;a class="reference external" href="https://github.com/stsievert"&gt;Scott
Sievert&lt;/a&gt; has been using Actors (mentioned above)
to develop simple examples for parameter servers. These are helping to
identify and motivate performance and diagnostic improvements improvements
within Dask itself:&lt;/p&gt;
&lt;script src="https://gist.github.com/ff8a1df9300a82f15a2704e913469522.js"&gt;&lt;/script&gt;
&lt;p&gt;These parameter servers manage the communication of models produced by
different workers, and leave the computation to the underlying deep learning
library. This is ongoing work.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="dataframe-preprocessing-transformers"&gt;
&lt;h2&gt;Dataframe Preprocessing Transformers&lt;/h2&gt;
&lt;p&gt;We’ve started to orient some of the Dask-ML work around case studies. Our
first, written by &lt;a class="reference external" href="https://github.com/stsievert"&gt;Scott Sievert&lt;/a&gt;, uses the
Criteo dataset for ads. It’s a good example of a combined dense/sparse dataset
that can be somewhat large (around 1TB). The first challenge we’re running
into is preprocessing. These have lead to a few preprocessing improvements:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/310"&gt;Label Encoder supports Pandas Categorical dask/dask-ml #310&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/11"&gt;Add Imputer with mean and median strategies dask/dask-ml #11&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/313"&gt;Ad OneHotEncoder dask/dask-ml #313&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/122"&gt;Add Hashing Vectorizer dask/dask-ml #122&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/315"&gt;Add ColumnTransformer dask/dask-ml #315&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some of these are also based off of improved dataframe handling features in the
upcoming 0.20 release for Scikit-Learn.&lt;/p&gt;
&lt;p&gt;This work is done by
&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/122"&gt;Roman Yurchak&lt;/a&gt;,
&lt;a class="reference external" href="https://github.com/jrbourbeau"&gt;James Bourbeau&lt;/a&gt;,
&lt;a class="reference external" href="https://github.com/daniel-severo"&gt;Daniel Severo&lt;/a&gt;, and
&lt;a class="reference external" href="https://github.com/TomAugspurger"&gt;Tom Augspurger&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="profiling-the-main-thread"&gt;
&lt;h2&gt;Profiling the main thread&lt;/h2&gt;
&lt;p&gt;Profiling concurrent code is hard. Traditional profilers like CProfile become
confused by passing control between all of the different coroutines. This
means that we haven’t done a very comprehensive job of profiling and tuning the
distributed scheduler and workers. Statistical profilers on the other hand
tend to do a bit better. We’ve taken the statistical profiler that we usually
use on Dask worker threads (available in the dashboard on the “Profile” tab)
and have applied it to the central administrative threads running the Tornado
event loop as well. This has highlighted a few issues that we weren’t able to
spot before, and should hopefully result in reduced overhead in future
releases.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/pull/2144"&gt;dask/distributed #2144&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://stackoverflow.com/questions/51582394/which-functions-are-free-when-profiling-tornado-asyncio"&gt;stackoverflow.com/questions/51582394/which-functions-are-free-when-profiling-tornado-asyncio&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src="https://user-images.githubusercontent.com/306380/43368136-4574f46c-930d-11e8-9d5b-6f4b4f6aeffe.png"
     width="70%"
     alt="Profile of event loop thread"&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id="new-release-of-dask-yarn"&gt;
&lt;h2&gt;New release of Dask-Yarn&lt;/h2&gt;
&lt;p&gt;There is a new release of &lt;a class="reference external" href="http://dask-yarn.readthedocs.io/en/latest"&gt;Dask-Yarn&lt;/a&gt;
and the underlying library for managing Yarn jobs,
&lt;a class="reference external" href="https://jcrist.github.io/skein/"&gt;Skein&lt;/a&gt;. These include a number of bug-fixes
and improved concurrency primitives for YARN applications. The new features are
documented &lt;a class="reference external" href="https://jcrist.github.io/skein/key-value-store.html"&gt;here&lt;/a&gt;, and were
implemented in &lt;a class="reference external" href="https://github.com/jcrist/skein/pull/40"&gt;jcrist/skein #40&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This work was done by &lt;a class="reference external" href="https://jcrist.github.io/"&gt;Jim Crist&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id="support-for-lsf-clusters-in-dask-jobqueue"&gt;
&lt;h2&gt;Support for LSF clusters in Dask-Jobqueue&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://dask-jobqueue.readthedocs.io/en/latest/"&gt;Dask-jobqueue&lt;/a&gt; supports Dask
use on traditional HPC cluster managers like SGE, SLURM, PBS, and others.
We’ve recently &lt;a class="reference external" href="http://dask-jobqueue.readthedocs.io/en/latest/generated/dask_jobqueue.LSFCluster.html#dask_jobqueue.LSFCluster"&gt;added support for LSF clusters&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Work was done in &lt;a class="reference external" href="https://github.com/dask/dask-jobqueue/pull/78"&gt;dask/dask-jobqueue #78&lt;/a&gt; by &lt;a class="reference external" href="https://github.com/raybellwaves"&gt;Ray Bell&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="new-dask-story-on-mobile-networks"&gt;
&lt;h2&gt;New Dask Story on mobile networks&lt;/h2&gt;
&lt;p&gt;The &lt;a class="reference external" href="http://dask-stories.readthedocs.io/en/latest/"&gt;Dask Stories&lt;/a&gt;
repository holds narrative about how people use Dask.
&lt;a class="reference external" href="https://www.linkedin.com/in/lalwanisameer/"&gt;Sameer Lalwani&lt;/a&gt;
recently added a story about using Dask to
&lt;a class="reference external" href="http://dask-stories.readthedocs.io/en/latest/network-modeling.html"&gt;model mobile communication networks&lt;/a&gt;.
It’s worth a read.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="test-suite-cleanup"&gt;
&lt;h2&gt;Test suite cleanup&lt;/h2&gt;
&lt;p&gt;The dask.distributed test suite has been suffering from intermittent failures
recently. These are tests that fail very infrequently, and so are hard to
catch when writing them, but show up when future unrelated PRs run the test
suite on continuous integration and get failures. They add friction to the
development process, but are expensive to track down (testing distributed
systems is hard).&lt;/p&gt;
&lt;p&gt;We’re taking a bit of time this week to track these down. Progress here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/pull/2146"&gt;dask/distributed #2146&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/pull/2152"&gt;dask/distributed #2152&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/08/02/dask-dev/"/>
    <summary>This work is supported by Anaconda Inc</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-08-02T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/07/23/protocols-pickle/</id>
    <title>Pickle isn't slow, it's a protocol</title>
    <updated>2018-07-23T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;tl;dr:&lt;/strong&gt; &lt;em&gt;Pickle isn’t slow, it’s a protocol. Protocols are important for
ecosystems.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;A recent Dask issue showed that using Dask with PyTorch was
slow because sending PyTorch models between Dask workers took a long time
(&lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/281"&gt;Dask GitHub issue&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;This turned out to be because serializing PyTorch models with pickle was very
slow (1 MB/s for GPU based models, 50 MB/s for CPU based models). There is no
architectural reason why this needs to be this slow. Every part of the
hardware pipeline is much faster than this.&lt;/p&gt;
&lt;p&gt;We could have fixed this in Dask by special-casing PyTorch models (Dask has
it’s own optional serialization system for performance), but being good
ecosystem citizens, we decided to raise the performance problem in an issue
upstream (&lt;a class="reference external" href="https://github.com/pytorch/pytorch/issues/9168"&gt;PyTorch Github
issue&lt;/a&gt;). This resulted in a
five-line-fix to PyTorch that turned a 1-50 MB/s serialization bandwidth into a
1 GB/s bandwidth, which is more than fast enough for many use cases (&lt;a class="reference external" href="https://github.com/pytorch/pytorch/pull/9184"&gt;PR to
PyTorch&lt;/a&gt;).&lt;/p&gt;
&lt;div class="highlight-diff notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;    def __reduce__(self):
&lt;span class="gd"&gt;-        return type(self), (self.tolist(),)&lt;/span&gt;
&lt;span class="gi"&gt;+        b = io.BytesIO()&lt;/span&gt;
&lt;span class="gi"&gt;+        torch.save(self, b)&lt;/span&gt;
&lt;span class="gi"&gt;+        return (_load_from_bytes, (b.getvalue(),))&lt;/span&gt;


&lt;span class="gi"&gt;+def _load_from_bytes(b):&lt;/span&gt;
&lt;span class="gi"&gt;+    return torch.load(io.BytesIO(b))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Thanks to the PyTorch maintainers this problem was solved pretty easily.
PyTorch tensors and models now serialize efficiently in Dask or in &lt;em&gt;any other
Python library&lt;/em&gt; that might want to use them in distributed systems like
PySpark, IPython parallel, Ray, or anything else without having to add
special-case code or do anything special. We didn’t solve a Dask problem, we
solved an ecosystem problem.&lt;/p&gt;
&lt;p&gt;However before we solved this problem we discussed things a bit. This comment
stuck with me:&lt;/p&gt;
&lt;a href="https://github.com/pytorch/pytorch/issues/9168#issuecomment-402514019"&gt;
  &lt;img src="/images/pytorch-pickle-is-slow-comment.png"
     alt="Github Image of maintainer saying that PyTorch's pickle implementation is slow"
     width="100%"&gt;&lt;/a&gt;
&lt;p&gt;This comment contains two beliefs that are both very common, and that I find
somewhat counter-productive:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Pickle is slow&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You should use our specialized methods instead&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I’m sort of picking on the PyTorch maintainers here a bit (sorry!) but I’ve
found that they’re quite widespread, so I’d like to address them here.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/23/protocols-pickle.md&lt;/span&gt;, line 67)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="pickle-is-slow"&gt;

&lt;p&gt;Pickle is &lt;em&gt;not&lt;/em&gt; slow. Pickle is a protocol. &lt;em&gt;We&lt;/em&gt; implement pickle. If it’s slow
then it is &lt;em&gt;our&lt;/em&gt; fault, not Pickle’s.&lt;/p&gt;
&lt;p&gt;To be clear, there are many reasons not to use Pickle.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;It’s not cross-language&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It’s not very easy to parse&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It doesn’t provide random access&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It’s insecure&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;etc..&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So you shouldn’t store your data or create public services using Pickle, but
for things like moving data on a wire it’s a great default choice if you’re
moving strictly from Python processes to Python processes in a trusted and
uniform environment.&lt;/p&gt;
&lt;p&gt;It’s great because it’s as fast as you can make it (up a a memory copy) and
other libraries in the ecosystem can use it without needing to special case
your code into theirs.&lt;/p&gt;
&lt;p&gt;This is the change we did for PyTorch.&lt;/p&gt;
&lt;div class="highlight-diff notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;    def __reduce__(self):
&lt;span class="gd"&gt;-        return type(self), (self.tolist(),)&lt;/span&gt;
&lt;span class="gi"&gt;+        b = io.BytesIO()&lt;/span&gt;
&lt;span class="gi"&gt;+        torch.save(self, b)&lt;/span&gt;
&lt;span class="gi"&gt;+        return (_load_from_bytes, (b.getvalue(),))&lt;/span&gt;


&lt;span class="gi"&gt;+def _load_from_bytes(b):&lt;/span&gt;
&lt;span class="gi"&gt;+    return torch.load(io.BytesIO(b))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The slow part wasn’t Pickle, it was the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.tolist()&lt;/span&gt;&lt;/code&gt; call within &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;__reduce__&lt;/span&gt;&lt;/code&gt;
that converted a PyTorch tensor into a list of Python ints and floats. I
suspect that the common belief of “Pickle is just slow” stopped anyone else
from investigating the poor performance here. I was surprised to learn that a
project as active and well maintained as PyTorch hadn’t fixed this already.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;As a reminder, you can implement the pickle protocol by providing the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;__reduce__&lt;/span&gt;&lt;/code&gt; method on your class. The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;__reduce__&lt;/span&gt;&lt;/code&gt; function returns a
loading function and sufficient arguments to reconstitute your object. Here we
used torch’s existing save/load functions to create a bytestring that we could
pass around.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/23/protocols-pickle.md&lt;/span&gt;, line 115)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="just-use-our-specialized-option"&gt;
&lt;h1&gt;Just use our specialized option&lt;/h1&gt;
&lt;p&gt;Specialized options can be great. They can have nice APIs with many options,
they can tune themselves to specialized communication hardware if it exists
(like RDMA or NVLink), and so on. But people need to learn about them first, and
learning about them can be hard in two ways.&lt;/p&gt;
&lt;section id="hard-for-users"&gt;
&lt;h2&gt;Hard for users&lt;/h2&gt;
&lt;p&gt;Today we use a large and rapidly changing set of libraries. It’s hard
for users to become experts in all of them. Increasingly we rely on new
libraries making it easy for us by adhering to standard APIs, providing
informative error messages that lead to good behavior, and so on..&lt;/p&gt;
&lt;/section&gt;
&lt;section id="hard-for-other-libraries"&gt;
&lt;h2&gt;Hard for other libraries&lt;/h2&gt;
&lt;p&gt;Other libraries that need to interact &lt;em&gt;definitely&lt;/em&gt; won’t read the
documentation, and even if they did it’s not sensible for every library to
special case every other library’s favorite method to turn their objects into
bytes. Ecosystems of libraries depend strongly on the presence of protocols
and a strong consensus around implementing them consistently and efficiently.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/23/protocols-pickle.md&lt;/span&gt;, line 137)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="sometimes-specialized-options-are-appropriate"&gt;
&lt;h1&gt;Sometimes Specialized Options are Appropriate&lt;/h1&gt;
&lt;p&gt;There &lt;em&gt;are&lt;/em&gt; good reasons to support specialized options. Sometimes you need
more than 1GB/s bandwidth. While this is rare in general (very few pipelines
process faster than 1GB/s/node), it is true in the particular case of PyTorch
when they are doing parallel training on a single machine with multiple
processes. Soumith (PyTorch maintainer) writes the following:&lt;/p&gt;
&lt;p&gt;When sending Tensors over multiprocessing, our custom serializer actually
shortcuts them through shared memory, i.e. it moves the underlying Storages
to shared memory and restores the Tensor in the other process to point to the
shared memory. We did this for the following reasons:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Speed:&lt;/strong&gt; we save on memory copies, especially if we amortize the cost of
moving a Tensor to shared memory before sending it into the multiprocessing
Queue. The total cost of actually moving a Tensor from one process to another
ends up being O(1), and independent of the Tensor’s size&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sharing:&lt;/strong&gt; If Tensor A and Tensor B are views of each other, once we
serialize and send them, we want to preserve this property of them being
views. This is critical for neural-nets where it’s common to re-view the
weights / biases and use them for another. With the default pickle solution,
this property is actually lost.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/07/23/protocols-pickle/"/>
    <summary>This work is supported by Anaconda Inc</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-07-23T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/07/17/dask-dev/</id>
    <title>Dask Development Log, Scipy 2018</title>
    <updated>2018-07-17T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To increase transparency I’m trying to blog more often about the current work
going on around Dask and related projects. Nothing here is ready for
production. This blogpost is written in haste, so refined polish should not be
expected.&lt;/p&gt;
&lt;p&gt;Last week many Dask developers gathered for the annual SciPy 2018 conference.
As a result, very little work was completed, but many projects were started or
discussed. To reflect this change in activity this blogpost will highlight
possible changes and opportunities for readers to further engage in
development.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 21)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="dask-on-hpc-machines"&gt;

&lt;p&gt;The &lt;a class="reference external" href="https://dask-jobqueue.readthedocs.io/"&gt;dask-jobqueue&lt;/a&gt; project was a hit at
the conference. Dask-jobqueue helps people launch Dask on traditional job
schedulers like PBS, SGE, SLURM, Torque, LSF, and others that are commonly
found on high performance computers. These are &lt;em&gt;very common&lt;/em&gt; among scientific,
research, and high performance machine learning groups but commonly a bit hard
to use with anything other than MPI.&lt;/p&gt;
&lt;p&gt;This project came up in the &lt;a class="reference external" href="https://youtu.be/2rgD5AJsAbE"&gt;Pangeo talk&lt;/a&gt;,
lightning talks, and the Dask Birds of a Feather session.&lt;/p&gt;
&lt;p&gt;During sprints a number of people came up and we went through the process of
configuring Dask on common supercomputers like Cheyenne, Titan, and Cori. This
process usually takes around fifteen minutes and will likely be the subject of
a future blogpost. We published known-good configurations for these clusters
on our &lt;a class="reference external" href="http://dask-jobqueue.readthedocs.io/en/latest/configurations.html"&gt;configuration documentation&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Additionally, there is a &lt;a class="reference external" href="https://github.com/jupyterhub/batchspawner/issues/101"&gt;JupyterHub
issue&lt;/a&gt; to improve
documentation on best practices to deploy JupyterHub on these machines. The
community has done this well a few times now, and it might be time to write up
something for everyone else.&lt;/p&gt;
&lt;section id="get-involved"&gt;
&lt;h2&gt;Get involved&lt;/h2&gt;
&lt;p&gt;If you have access to a supercomputer then please try things out. There is a
30-minute Youtube video screencast on the
&lt;a class="reference external" href="https://dask-jobqueue.readthedocs.io/"&gt;dask-jobqueue&lt;/a&gt; documentation that should
help you get started.&lt;/p&gt;
&lt;p&gt;If you are an administrator on a supercomputer you might consider helping to
build a configuration file and place it in &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;/etc/dask&lt;/span&gt;&lt;/code&gt; for your users. You
might also want to get involved in the &lt;a class="reference external" href="http://dask-jobqueue.readthedocs.io/en/latest/configurations.html"&gt;JupyterHub on
HPC&lt;/a&gt;
conversation.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 58)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="dask-scikit-learn-talk"&gt;
&lt;h1&gt;Dask / Scikit-learn talk&lt;/h1&gt;
&lt;p&gt;Olivier Grisel and Tom Augspurger prepared and delivered a great talk on the
current state of the new Dask-ML project.&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/ccfsbuqsjgI"
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen&gt;&lt;/iframe&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 66)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="mybinder-and-bokeh-servers"&gt;
&lt;h1&gt;MyBinder and Bokeh Servers&lt;/h1&gt;
&lt;p&gt;Not a Dask change, but Min Ragan-Kelley showed how to run services through
&lt;a class="reference external" href="https://mybinder.org/"&gt;mybinder.org&lt;/a&gt; that are not only Jupyter. As an example,
here is a repository that deploys a Bokeh server application with a single
click.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/minrk/binder-bokeh-server"&gt;Github repository&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://mybinder.org/v2/gh/minrk/binder-bokeh-server/master?urlpath=%2Fproxy%2F5006%2Fbokeh-app"&gt;Binder link&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I think that by composing with Binder Min effectively just created the
free-to-use hosted Bokeh server service. Presumably this same model could be
easily adapted to other applications just as easily.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 80)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="dask-and-automated-machine-learning-with-tpot"&gt;
&lt;h1&gt;Dask and Automated Machine Learning with TPOT&lt;/h1&gt;
&lt;p&gt;Dask and TPOT developers are discussing paralellizing the
automatic-machine-learning tool &lt;a class="reference external" href="http://epistasislab.github.io/tpot/"&gt;TPOT&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;TPOT uses genetic algorithms to search over a space of scikit-learn style
pipelines to automatically find a decently performing pipeline and model. This
involves a fair amount of computation which Dask can help to parallelize out to
multiple machines.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/EpistasisLab/tpot/issues/304"&gt;Issue: EpistasisLab/tpot #304&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/EpistasisLab/tpot/pull/730"&gt;PR: EpistasisLab/tpot #730&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/QrJlj0VCHys"
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen&gt;&lt;/iframe&gt;
&lt;section id="id1"&gt;
&lt;h2&gt;Get involved&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 96); &lt;em&gt;&lt;a href="#id1"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “get involved”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;Trivial things work now, but to make this efficient we’ll need to dive in a bit
more deeply. Extending that pull request to dive within pipelines would be a
good task if anyone wants to get involved. This would help to share
intermediate results between pipelines.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 103)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="dask-and-scikit-optimize"&gt;
&lt;h1&gt;Dask and Scikit-Optimize&lt;/h1&gt;
&lt;p&gt;Among various features, &lt;a class="reference external" href="https://scikit-optimize.github.io/"&gt;Scikit-optimize&lt;/a&gt;
offers a &lt;a class="reference external" href="https://scikit-optimize.github.io/#skopt.BayesSearchCV"&gt;BayesSearchCV&lt;/a&gt;
object that is like Scikit-Learn’s &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;GridSearchCV&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomSearchCV&lt;/span&gt;&lt;/code&gt;, but is a
bit smarter about how to choose new parameters to test given previous results.
Hyper-parameter optimization is a low-hanging fruit for Dask-ML workloads today,
so we investigated how the project might help here.&lt;/p&gt;
&lt;p&gt;So far we’re just experimenting using Scikit-Learn/Dask integration through
joblib to see what opportunities there are. Dicussion among Dask and
Scikit-Optimize developers is happening here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/300"&gt;Issue: dask/dask-ml #300&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 118)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="centralize-pydata-scipy-tutorials-on-binder"&gt;
&lt;h1&gt;Centralize PyData/Scipy tutorials on Binder&lt;/h1&gt;
&lt;p&gt;We’re putting a bunch of the PyData/Scipy tutorials on Binder, and hope to
embed snippets of Youtube videos into the notebooks themselves.&lt;/p&gt;
&lt;p&gt;This effort lives here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://pydata-tutorials.readthedocs.io"&gt;pydata-tutorials.readthedocs.io&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;section id="motivation"&gt;
&lt;h2&gt;Motivation&lt;/h2&gt;
&lt;p&gt;The PyData and SciPy community delivers tutorials as part of most conferences.
This activity generates both educational Jupyter notebooks and explanatory
videos that teach people how to use the ecosystem.&lt;/p&gt;
&lt;p&gt;However, this content isn’t very discoverable &lt;em&gt;after&lt;/em&gt; the conference. People
can search on Youtube for their topic of choice and hopefully find a link to
the notebooks to download locally, but this is a somewhat noisy process. It’s
not clear which tutorial to choose and it’s difficult to match up the video
with the notebooks during exercises.
We’re probably not getting as much value out of these resources as we could be.&lt;/p&gt;
&lt;p&gt;To help increase access we’re going to try a few things:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Produce a centralized website with links to recent tutorials delivered for
each topic&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ensure that those notebooks run easily on Binder&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Embed sections of the talk on Youtube within each notebook so that the
explanation of the section is tied to the exercises&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;
&lt;section id="id2"&gt;
&lt;h2&gt;Get involved&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 148); &lt;em&gt;&lt;a href="#id2"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “get involved”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;This only really works long-term under a community maintenance model. So far
we’ve only done a few hours of work and there is still plenty to do in the
following tasks:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Find good tutorials for inclusion&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ensure that they work well on &lt;a class="reference external" href="https://mybinder.org/"&gt;mybinder.org&lt;/a&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;are self-contained and don’t rely on external scripts to run&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;have an environment.yml or requirements.txt&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;don’t require a lot of resources&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Find video for the tutorial&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Submit a pull request to the tutorial repository that embeds a link to the
youtube talk at the top cell of the notebook at the proper time for each
notebook&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 164)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="dask-actors-and-ray"&gt;
&lt;h1&gt;Dask, Actors, and Ray&lt;/h1&gt;
&lt;p&gt;I really enjoyed the &lt;a class="reference external" href="https://youtu.be/D_oz7E4v-U0"&gt;talk on Ray&lt;/a&gt; another
distributed task scheduler for Python. I suspect that Dask will steal ideas
for &lt;a class="reference external" href="https://github.com/dask/distributed/issues/2109"&gt;actors for stateful operation&lt;/a&gt;.
I hope that Ray takes on ideas for using standard Python interfaces so that
more of the community can adopt it more quickly. I encourage people to check
out the talk and give Ray a try. It’s pretty slick.&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/D_oz7E4v-U0"
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen&gt;&lt;/iframe&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 176)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="planning-conversations-for-dask-ml"&gt;
&lt;h1&gt;Planning conversations for Dask-ML&lt;/h1&gt;
&lt;p&gt;Dask and Scikit-learn developers had the opportunity to sit down again and
raise a number of issues to help plan near-term development. This focused
mostly around building important case studies to motivate future development,
and identifying algorithms and other projects to target for near-term
integration.&lt;/p&gt;
&lt;section id="case-studies"&gt;
&lt;h2&gt;Case Studies&lt;/h2&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/302"&gt;What is the purpose of a case study: dask/dask-ml #302&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/295"&gt;Case study: Sparse Criteo Dataset: dask/dask-ml #295&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/296"&gt;Case study: Large scale text classification: dask/dask-ml #296&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/297"&gt;Case study: Transfer learning from pre-trained model: dask/dask-ml #297&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;section id="algorithms"&gt;
&lt;h2&gt;Algorithms&lt;/h2&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/299"&gt;Gradient boosted trees with Numba: dask/dask-ml #299&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/300"&gt;Parallelize Scikit-Optimize for hyperparameter optimization: dask/dask-ml #300&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;section id="id3"&gt;
&lt;h2&gt;Get involved&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 196); &lt;em&gt;&lt;a href="#id3"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “get involved”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;We could use help in building out case studies to drive future development in
the project. There are also several algorithmic places to get involved.
Dask-ML is a young and fast-moving project with many opportunities for new
developers to get involved.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 203)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="dask-and-umap-for-low-dimensional-embeddings"&gt;
&lt;h1&gt;Dask and UMAP for low-dimensional embeddings&lt;/h1&gt;
&lt;p&gt;Leland McKinnes gave a great talk &lt;a class="reference external" href="https://youtu.be/nq6iPZVUxZU"&gt;Uniform Manifold Approximation and
Projection for Dimensionality Reduction&lt;/a&gt; in which
he lays out a well founded algorithm for dimensionality reduction, similar to
PCA or T-SNE, but with some nice properties. He worked together with some Dask
developers where we identified some challenges due to dask array slicing with
random-ish slices.&lt;/p&gt;
&lt;p&gt;A proposal to fix this problem lives here, if anyone wants a fun problem to work on:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/issues/3409#issuecomment-405254656"&gt;dask/dask #3409 (comment)&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/nq6iPZVUxZU"
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen&gt;&lt;/iframe&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/17/dask-dev.md&lt;/span&gt;, line 219)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="dask-stories"&gt;
&lt;h1&gt;Dask stories&lt;/h1&gt;
&lt;p&gt;We soft-launched &lt;a class="reference external" href="http://dask-stories.readthedocs.io/en/latest/"&gt;Dask Stories&lt;/a&gt;
a webpage and project to collect user and share stories about how people use
Dask in practice. We’re also delivering a separate blogpost about this today.&lt;/p&gt;
&lt;p&gt;See blogpost: &lt;a class="reference internal" href="../../2018/07/16/dask-stories/"&gt;&lt;span class="doc std std-doc"&gt;Who uses Dask?&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If you use Dask and want to share your story we would absolutely welcome your
experience. Having people like yourself share how they use Dask is incredibly
important for the project.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/07/17/dask-dev/"/>
    <summary>This work is supported by Anaconda Inc</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-07-17T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/07/16/dask-stories/</id>
    <title>Who uses Dask?</title>
    <updated>2018-07-16T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;People often ask general questions like “Who uses Dask?” or more specific
questions like the following:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;For what applications do people use Dask dataframe?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How many machines do people often use with Dask?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How far does Dask scale?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Does dask get used on imaging data?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Does anyone use Dask with Kubernetes/Yarn/SGE/Mesos/… ?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Does anyone in the insurance industry use Dask?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;…&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This yields interesting and productive conversations where new users can dive
into historical use cases which informs their choices if and how they use the
project in the future.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;New users can learn a lot from existing users.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To further enable this conversation we’ve made a new tiny project,
&lt;a class="reference external" href="https://dask-stories.readthedocs.io"&gt;dask-stories&lt;/a&gt;. This is a small
documentation page where people can submit how they use Dask and have that
published for others to see.&lt;/p&gt;
&lt;p&gt;To seed this site six generous users have written down how their group uses
Dask. You can read about them here:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://dask-stories.readthedocs.io/en/latest/sidewalk-labs.html"&gt;Sidewalk Labs: Civic Modeling&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://dask-stories.readthedocs.io/en/latest/mosquito-sequencing.html"&gt;Genome Sequencing for Mosquitoes&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://dask-stories.readthedocs.io/en/latest/fullspectrum.html"&gt;Full Spectrum: Credit and Banking&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://dask-stories.readthedocs.io/en/latest/icecube-cosmic-rays.html"&gt;Ice Cube: Detecting Cosmic Rays&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://dask-stories.readthedocs.io/en/latest/pangeo.html"&gt;Pangeo: Earth Science&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://dask-stories.readthedocs.io/en/latest/hydrologic-modeling.html"&gt;NCAR: Hydrologic Modeling&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We’ve focused on a few questions, available in &lt;a class="reference external" href="http://dask-stories.readthedocs.io/en/latest/template.html"&gt;our
template&lt;/a&gt; that
focus on problems over technology, and include negative as well as positive
feedback to get a complete picture.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Who am I?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What problem am I trying to solve?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How Dask helps?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What pain points did I run into with Dask?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What technology do I use around Dask?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/16/dask-stories.md&lt;/span&gt;, line 53)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="easy-to-contribute"&gt;

&lt;p&gt;Contributions to this site are simple Markdown documents submitted as pull
requests to
&lt;a class="reference external" href="https://github.com/dask/dask-stories"&gt;github.com/dask/dask-stories&lt;/a&gt;. The site
is then built with ReadTheDocs and updated immediately. We tried to make this
as smooth and familiar to our existing userbase as possible.&lt;/p&gt;
&lt;p&gt;This is important. Sharing real-world experiences like this are probably more
valuable than code contributions to the Dask project at this stage. Dask is
more technically mature than it is well-known. Users look to other users to
help them understand a project (think of every time you’ve Googled for “&lt;em&gt;some
tool&lt;/em&gt; in &lt;em&gt;some topic&lt;/em&gt;”)&lt;/p&gt;
&lt;p&gt;If you use Dask today in an interesting way then please share your story.
The world would love to hear your voice.&lt;/p&gt;
&lt;p&gt;If you maintain another project you might consider implementing the same model.
I hope that this proves successful enough for other projects in the ecosystem
to reuse.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/07/16/dask-stories/"/>
    <summary>This work is supported by Anaconda Inc</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-07-16T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/07/08/dask-dev/</id>
    <title>Dask Development Log</title>
    <updated>2018-07-08T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To increase transparency I’m trying to blog more often about the current work
going on around Dask and related projects. Nothing here is ready for
production. This blogpost is written in haste, so refined polish should not be
expected.&lt;/p&gt;
&lt;p&gt;Current efforts for June 2018 in Dask and Dask-related projects include
the following:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Yarn Deployment&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;More examples for machine learning&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Incremental machine learning&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;HPC Deployment configuration&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/08/dask-dev.md&lt;/span&gt;, line 23)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="yarn-deployment"&gt;

&lt;p&gt;Dask developers often get asked &lt;em&gt;How do I deploy Dask on my Hadoop/Spark/Hive
cluster?&lt;/em&gt;. We haven’t had a very good answer until recently.&lt;/p&gt;
&lt;p&gt;Most Hadoop/Spark/Hive clusters are actually &lt;em&gt;Yarn&lt;/em&gt; clusters. Yarn is the most
common cluster manager used by many clusters that are typically used to run
Hadoop/Spark/Hive jobs including any cluster purchased from a vendor like
Cloudera or Hortonworks. If your application can run on Yarn then it can be a
first class citizen here.&lt;/p&gt;
&lt;p&gt;Unfortunately Yarn has really only been accessible through a Java API, and so
has been difficult for Dask to interact with. That’s changing now with a few
projects, including:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask-yarn.readthedocs.io"&gt;dask-yarn&lt;/a&gt;: an easy way to launch Dask on
Yarn clusters&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://jcrist.github.io/skein/"&gt;skein&lt;/a&gt;: an easy way to launch generic
services on Yarn clusters (this is primarily what backs dask-yarn)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://conda.github.io/conda-pack/"&gt;conda-pack&lt;/a&gt;: an easy way to bundle
together a conda package into a redeployable environment, such as is useful
when launching Python applications on Yarn&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This work is all being done by &lt;a class="reference external" href="http://jcrist.github.io/"&gt;Jim Crist&lt;/a&gt; who is, I
believe, currently writing up a blogpost about the topic at large. Dask-yarn
was soft-released last week though, so people should give it a try and report
feedback on the &lt;a class="reference external" href="https://github.com/dask/dask-yarn"&gt;dask-yarn issue tracker&lt;/a&gt;.
If you ever wanted direct help on your cluster, now is the right time because
Jim is working on this actively and is not yet drowned in user requests so
generally has a fair bit of time to investigate particular cases.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_yarn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YarnCluster&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="c1"&gt;# Create a cluster where each worker has two cores and eight GB of memory&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;YarnCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;environment.tar.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;worker_vcores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;worker_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;8GB&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Scale out to ten such workers&lt;/span&gt;
&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to the cluster&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/08/dask-dev.md&lt;/span&gt;, line 69)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="more-examples-for-machine-learning"&gt;
&lt;h1&gt;More examples for machine learning&lt;/h1&gt;
&lt;p&gt;Dask maintains a Binder of simple examples that show off various ways to use
the project. This allows people to click a link on the web and quickly be
taken to a Jupyter notebook running on the cloud. It’s a fun way to quickly
experience and learn about a new project.&lt;/p&gt;
&lt;p&gt;Previously we had a single example for arrays, dataframes, delayed, machine
learning, etc.&lt;/p&gt;
&lt;p&gt;Now &lt;a class="reference external" href="https://stsievert.com/"&gt;Scott Sievert&lt;/a&gt; is expanding the examples within
the machine learning section. He has submitted the following two so far:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://mybinder.org/v2/gh/dask/dask-examples/main?urlpath=%2Ftree%2Fmachine-learning%2Fincremental.ipynb"&gt;Incremental training with Scikit-Learn and large datasets&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://mybinder.org/v2/gh/dask/dask-examples/main?urlpath=%2Ftree%2Fmachine-learning%2Fxgboost.ipynb"&gt;Dask and XGBoost&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I believe he’s planning on more. If you use
&lt;a class="reference external" href="http://dask-ml.readthedocs.io/en/latest/"&gt;dask-ml&lt;/a&gt; and have recommendations or
want to help, you might want to engage in the &lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/new"&gt;dask-ml issue
tracker&lt;/a&gt; or &lt;a class="reference external" href="https://github.com/dask/dask-examples/issues/new"&gt;dask-examples issue
tracker&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/08/dask-dev.md&lt;/span&gt;, line 91)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="incremental-training"&gt;
&lt;h1&gt;Incremental training&lt;/h1&gt;
&lt;p&gt;The incremental training mentioned as an example above is also new-ish. This
is a Scikit-Learn style meta-estimator that wraps around other estimators that
support the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt; method. It enables training on large datasets in an
incremental or batchwise fashion.&lt;/p&gt;
&lt;section id="before"&gt;
&lt;h2&gt;Before&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SGDClassifier&lt;/span&gt;

&lt;span class="n"&gt;sgd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SGDClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pd&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;

    &lt;span class="n"&gt;sgd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;partial_fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="after"&gt;
&lt;h2&gt;After&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SGDClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_ml.wrappers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Incremental&lt;/span&gt;

&lt;span class="n"&gt;sgd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SGDClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;inc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Incremental&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sgd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.dataframe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dd&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;inc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="analysis"&gt;
&lt;h2&gt;Analysis&lt;/h2&gt;
&lt;p&gt;From a parallel computing perspective this is a very simple and un-sexy way of
doing things. However my understanding is that it’s also quite pragmatic. In
a distributed context we leave a lot of possible computation on the table (the
solution is inherently sequential) but it’s fun to see the model jump around
the cluster as it absorbs various chunks of data and then moves on.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://user-images.githubusercontent.com/1320475/42237033-2bddf11e-7eec-11e8-88c5-5f0ebd2fb4df.png"
     width="70%"
     alt="Incremental training with Dask-ML"&gt;&lt;/p&gt;
&lt;p&gt;There’s ongoing work on how best to combine this with other work like pipelines
and hyper-parameter searches to fill in the extra computation.&lt;/p&gt;
&lt;p&gt;This work was primarily done by &lt;a class="reference external" href="https://tomaugspurger.github.io/"&gt;Tom Augspurger&lt;/a&gt;
with help from &lt;a class="reference external" href="https://stsievert.com/"&gt;Scott Sievert&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/08/dask-dev.md&lt;/span&gt;, line 148)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="dask-user-stories"&gt;
&lt;h1&gt;Dask User Stories&lt;/h1&gt;
&lt;p&gt;Dask developers are often asked “Who uses Dask?”. This is a hard question to
answer because, even though we’re inundated with thousands of requests for
help from various companies and research groups, it’s never fully clear who
minds having their information shared with others.&lt;/p&gt;
&lt;p&gt;We’re now trying to crowdsource this information in a more explicit way by
having users tell their own stories. Hopefully this helps other users in their
field understand how Dask can help and when it might (or might not) be useful
to them.&lt;/p&gt;
&lt;p&gt;We originally collected this information in a &lt;a class="reference external" href="https://goo.gl/forms/JEebEFTOPrWa3P4h1"&gt;Google
Form&lt;/a&gt; but have since then moved it to a
&lt;a class="reference external" href="https://github.com/mrocklin/dask-stories"&gt;Github repository&lt;/a&gt;. Eventually
we’ll publish this as a &lt;a class="reference external" href="https://github.com/mrocklin/dask-stories/issues/7"&gt;proper web
site&lt;/a&gt; and include it in our
documentation.&lt;/p&gt;
&lt;p&gt;If you use Dask and want to share your story this is a great way to contribute
to the project. Arguably Dask needs more help with spreading the word than it
does with technical solutions.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/07/08/dask-dev.md&lt;/span&gt;, line 171)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="hpc-deployments"&gt;
&lt;h1&gt;HPC Deployments&lt;/h1&gt;
&lt;p&gt;The &lt;a class="reference external" href="http://dask-jobqueue.readthedocs.io/en/latest/"&gt;Dask Jobqueue&lt;/a&gt; package for
deploying Dask on traditional HPC machines is nearing another release. We’ve
changed around a lot of the parameters and configuration options in order to
improve the onboarding experience for new users. It has been going very
smoothly in recent engagements with new groups, but will mean a breaking
change for existing users of the sub-project.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/07/08/dask-dev/"/>
    <summary>This work is supported by Anaconda Inc</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-07-08T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/06/26/dask-scaling-limits/</id>
    <title>Dask Scaling Limits</title>
    <updated>2018-06-26T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/26/dask-scaling-limits.md&lt;/span&gt;, line 10)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="history"&gt;

&lt;p&gt;For the first year of Dask’s life it focused exclusively on single node
parallelism. We felt then that efficiently supporting 100+GB datasets on
personal laptops or 1TB datasets on large workstations was a sweet spot for
productivity, especially when avoiding the pain of deploying and configuring
distributed systems. We still believe in the efficiency of single-node
parallelism, but in the years since, Dask has extended itself to support larger
distributed systems.&lt;/p&gt;
&lt;p&gt;After that first year, Dask focused equally on both single-node and distributed
parallelism. We maintain &lt;a class="reference external" href="http://dask.pydata.org/en/latest/scheduling.html"&gt;two entirely separate
schedulers&lt;/a&gt;, one optimized for
each case. This allows Dask to be very simple to use on single machines, but
also scale up to thousand-node clusters and 100+TB datasets when needed with
the same API.&lt;/p&gt;
&lt;p&gt;Dask’s distributed system has a single central scheduler and many distributed
workers. This is a common architecture today that scales out to a few thousand
nodes. Roughly speaking Dask scales about the same as a system like Apache
Spark, but less well than a high-performance system like MPI.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/26/dask-scaling-limits.md&lt;/span&gt;, line 32)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="an-example"&gt;
&lt;h1&gt;An Example&lt;/h1&gt;
&lt;p&gt;Most Dask examples in blogposts or talks are on modestly sized datasets,
usually in the 10-50GB range. This, combined with Dask’s history with
medium-data on single-nodes may have given people a more humble impression of
Dask than is appropriate.&lt;/p&gt;
&lt;p&gt;As a small nudge, here is an example using Dask to interact with 50 36-core
nodes on an artificial terabyte dataset.&lt;/p&gt;
&lt;iframe width="700"
        height="394"
        src="https://www.youtube.com/embed/nH_AQo8WdKw"
        frameborder="0"
        allow="autoplay; encrypted-media"
        allowfullscreen&gt;&lt;/iframe&gt;
&lt;p&gt;This is a common size for a typical modestly sized Dask cluster. We usually
see Dask deployment sizes either in the tens of machines (usually with Hadoop
style or ad-hoc enterprise clusters), or in the few-thousand range (usually
with high performance computers or cloud deployments). We’re showing the
modest case here just due to lack of resources. Everything in that example
should work fine scaling out a couple extra orders of magnitude.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/26/dask-scaling-limits.md&lt;/span&gt;, line 56)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="challenges-to-scaling-out"&gt;
&lt;h1&gt;Challenges to Scaling Out&lt;/h1&gt;
&lt;p&gt;For the rest of the article we’ll talk about common causes that we see today
that get in the way of scaling out. These are collected from experience
working both with people in the open source community, as well as private
contracts.&lt;/p&gt;
&lt;section id="simple-map-reduce-style"&gt;
&lt;h2&gt;Simple Map-Reduce style&lt;/h2&gt;
&lt;p&gt;If you’re doing simple map-reduce style parallelism then things will be pretty
smooth out to a large number of nodes. However, there are still some
limitations to keep in mind:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;The scheduler will have at least one, and possibly a few connections open
to each worker. You’ll want to ensure that your machines can have many
open file handles at once. Some Linux distributions cap this at 1024 by
default, but it is easy to change.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The scheduler has an overhead of around 200 microseconds per task.
So if each task takes one second then your scheduler can saturate 5000
cores, but if each task takes only 100ms then your scheduler can only
saturate around 500 cores, and so on. Task duration imposes an inversely
proportional constraint on scaling.&lt;/p&gt;
&lt;p&gt;If you want to scale larger than this then your tasks will need to
start doing more work in each task to avoid overhead. Often this involves
moving inner for loops within tasks rather than spreading them out to many
tasks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;
&lt;section id="more-complex-algorithms"&gt;
&lt;h2&gt;More complex algorithms&lt;/h2&gt;
&lt;p&gt;If you’re doing more complex algorithms (which is common among Dask users) then
many more things can break along the way. High performance computing isn’t
about doing any one thing well, it’s about doing &lt;em&gt;nothing badly&lt;/em&gt;. This section
lists a few issues that arise for larger deployments:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;Dask collection algorithms may be suboptimal.&lt;/p&gt;
&lt;p&gt;The parallel algorithms in Dask-array/bag/dataframe/ml are &lt;em&gt;pretty&lt;/em&gt; good,
but as Dask scales out to larger clusters and its algorithms are used by
more domains we invariably find that small corners of the API fail beyond a
certain point. Luckily these are usually pretty easy to fix after they are
reported.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The graph size may grow too large for the scheduler&lt;/p&gt;
&lt;p&gt;The metadata describing your computation has to all fit on a single
machine, the Dask scheduler. This metadata, the task graph, can grow big
if you’re not careful. It’s nice to have a scheduler process with at least
a few gigabytes of memory if you’re going to be processing million-node
task graphs. A task takes up around 1kB of memory &lt;em&gt;if&lt;/em&gt; you’re careful to
avoid closing over any unnecessary local data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The graph serialization time may become annoying for interactive use&lt;/p&gt;
&lt;p&gt;Again, if you have million node task graphs you’re going to be serializaing
them up and passing them from the client to the scheduler. This is &lt;em&gt;fine&lt;/em&gt;,
assuming they fit at both ends, but can take up some time and limit
interactivity. If you press &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;compute&lt;/span&gt;&lt;/code&gt; and nothing shows up on the
dashboard for a minute or two, this is what’s happening.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The interactive dashboard plots stop being as useful&lt;/p&gt;
&lt;p&gt;Those beautiful plots on the dashboard were mostly designed for deployments
with 1-100 nodes, but not 1000s. Seeing the start and stop time of every
task of a million-task computation just isn’t something that our brains can
fully understand.&lt;/p&gt;
&lt;p&gt;This is something that we would like to improve. If anyone out there is
interested in scalable performance diagnostics, please get involved.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Other components that you rely on, like distributed storage, may also start
to break&lt;/p&gt;
&lt;p&gt;Dask provides users more power than they’re accustomed to.
It’s easy for them to accidentally clobber some other component of their
systems, like distributed storage, a local database, the network, and so
on, with too many requests.&lt;/p&gt;
&lt;p&gt;Many of these systems provide abstractions that are very well tested and
stable for normal single-machine use, but that quickly become brittle when
you have a thousand machines acting on them with the full creativity of a
novice user. Dask provies some primitives like distributed locks and
queues to help control access to these resources, but it’s on the user to
use them well and not break things.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/26/dask-scaling-limits.md&lt;/span&gt;, line 142)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Dask scales happily out to tens of nodes, like in the example above, or to
thousands of nodes, which I’m not showing here simply due to lack of resources.&lt;/p&gt;
&lt;p&gt;Dask provides this scalability while still maintaining the flexibility and
freedom to build custom systems that has defined the project since it began.
However, the combination of scalability and freedom makes it hard for Dask to
fully protect users from breaking things. It’s much easier to protect users
when you can constrain what they can do. When users stick to standard
workflows like Dask dataframe or Dask array they’ll probably be ok, but when
operating with full creativity at the thousand-node scale some expertise will
invariably be necessary. We try hard to provide the diagnostics and tools
necessary to investigate issues and control operation. The project is getting
better at this every day, in large part due to some expert users out there.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/26/dask-scaling-limits.md&lt;/span&gt;, line 158)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="a-call-for-examples"&gt;
&lt;h1&gt;A Call for Examples&lt;/h1&gt;
&lt;p&gt;Do you use Dask on more than one machine to do interesting work?
We’d love to hear about it either in the comments below, or in this &lt;a class="reference external" href="https://goo.gl/forms/ueIMoGl6ZPl529203"&gt;online
form&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/06/26/dask-scaling-limits/"/>
    <summary>This work is supported by Anaconda Inc.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-06-26T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/06/14/dask-0.18.0/</id>
    <title>Dask Release 0.18.0</title>
    <updated>2018-06-14T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I’m pleased to announce the release of Dask version 0.18.0. This is a major
release with breaking changes and new features.
The last release was 0.17.5 on May 4th.
This blogpost outlines notable changes since the last release blogpost for
0.17.2 on March 21st.&lt;/p&gt;
&lt;p&gt;You can conda install Dask:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;conda install dask
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;or pip install from PyPI:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;pip install dask[complete] --upgrade
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Full changelogs are available here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/blob/master/docs/source/changelog.rst"&gt;dask/dask&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/blob/master/docs/source/changelog.rst"&gt;dask/distributed&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We list some breaking changes below, followed up by changes that are less
important, but still fun.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/14/dask-0.18.0.md&lt;/span&gt;, line 32)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="context"&gt;

&lt;p&gt;The Dask core library is nearing a 1.0 release.
Before that happens, we need to do some housecleaning.
This release starts that process,
replaces some existing interfaces,
and builds up some needed infrastructure.
Almost all of the changes in this release include clean deprecation warnings,
but future releases will remove the old functionality, so now would be a good
time to check in.&lt;/p&gt;
&lt;p&gt;As happens with any release that starts breaking things,
many other smaller breaks get added on as well.
I’m personally very happy with this release because many aspects of using Dask
now feel a lot cleaner, however heavy users of Dask will likely experience
mild friction. Hopefully this post helps explain some of the larger changes.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/14/dask-0.18.0.md&lt;/span&gt;, line 49)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="notable-breaking-changes"&gt;
&lt;h1&gt;Notable Breaking changes&lt;/h1&gt;
&lt;section id="centralized-configuration"&gt;
&lt;h2&gt;Centralized configuration&lt;/h2&gt;
&lt;p&gt;Taking full advantage of Dask sometimes requires user configuration, especially
in a distributed setting. This might be to control logging verbosity, specify
cluster configuration, provide credentials for security, or any of several
other options that arise in production.&lt;/p&gt;
&lt;p&gt;We’ve found that different computing cultures like to specify configuration in
several different ways:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Configuration files&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Environment variables&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Directly within Python code&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Previously this was handled with a variety of different solutions among the
different dask subprojects. The dask-distributed project had one system,
dask-kubernetes had another, and so on.&lt;/p&gt;
&lt;p&gt;Now we centralize configuration in the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.config&lt;/span&gt;&lt;/code&gt; module, which collects
configuration from config files, environment variables, and runtime code, and
makes it centrally available to all Dask subprojects. A number of Dask
subprojects (dask.distributed,
&lt;a class="reference external" href="http://dask-kubernetes.readthedocs.io/en/latest/"&gt;dask-kubernetes&lt;/a&gt;, and
&lt;a class="reference external" href="http://dask-jobqueue.readthedocs.io/en/latest/"&gt;dask-jobqueue&lt;/a&gt;), are being
co-released at the same time to take advantage of this.&lt;/p&gt;
&lt;p&gt;If you were actively using Dask.distributed’s configuration files some things
have changed:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;The configuration is now namespaced and more heavily nested. Here is an
example from the &lt;a class="reference external" href="https://github.com/dask/distributed/blob/master/distributed/distributed.yaml"&gt;dask.distributed default config
file&lt;/a&gt;
today:&lt;/p&gt;
&lt;div class="highlight-yaml notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nt"&gt;distributed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;2&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;allowed-failures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# number of retries before a task is considered bad&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;work-stealing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;True&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# workers should steal tasks from each other&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;worker-ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;null&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# like &amp;#39;60s&amp;#39;. Workers must heartbeat faster than this&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;multiprocessing-method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;forkserver&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;use-file-locking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;True&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The default configuration location has moved from &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;~/.dask/config.yaml&lt;/span&gt;&lt;/code&gt; to
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;~/.config/dask/distributed.yaml&lt;/span&gt;&lt;/code&gt;, where it will live along side several
other files like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;kubernetes.yaml&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;jobqueue.yaml&lt;/span&gt;&lt;/code&gt;, and so on.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;However, your old configuration files will still be found and their values
will be used appropriately. We don’t make any attempt to migrate your old
config values to the new location though. You may want to delete the
auto-generated &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;~/.dask/config.yaml&lt;/span&gt;&lt;/code&gt; file at some point, if you felt like being
particularly clean.&lt;/p&gt;
&lt;p&gt;You can learn more about Dask’s configuration in &lt;a class="reference external" href="http://dask.pydata.org/en/latest/configuration.html"&gt;Dask’s configuration
documentation&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id="replaced-the-common-get-keyword-with-scheduler"&gt;
&lt;h2&gt;Replaced the common get= keyword with scheduler=&lt;/h2&gt;
&lt;p&gt;Dask can execute code with a variety of scheduler backends based on threads,
processes, single-threaded execution, or distributed clusters.&lt;/p&gt;
&lt;p&gt;Previously, users selected between these backends using the somewhat
generically named &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;get=&lt;/span&gt;&lt;/code&gt; keyword:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threaded&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;multiprocessing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_sync&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We’ve replaced this with a newer, and hopefully more clear, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;scheduler=&lt;/span&gt;&lt;/code&gt; keyword:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;threads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;processes&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;single-threaded&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;get=&lt;/span&gt;&lt;/code&gt; keyword has been deprecated and will raise a warning. It will be
removed entirely on the next major release.&lt;/p&gt;
&lt;p&gt;For more information, see &lt;a class="reference external" href="http://dask.pydata.org/en/latest/scheduling.html"&gt;documentation on selecting different schedulers&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="replaced-dask-set-options-with-dask-config-set"&gt;
&lt;h2&gt;Replaced dask.set_options with dask.config.set&lt;/h2&gt;
&lt;p&gt;Related to the configuration changes, we now include runtime state in the
configuration. Previously people used to set runtime state with the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.set_options&lt;/span&gt;&lt;/code&gt; context manager. Now we recommend using &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.config.set&lt;/span&gt;&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_options&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;threads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  &lt;span class="c1"&gt;# Before&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;threads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  &lt;span class="c1"&gt;# After&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.set_options&lt;/span&gt;&lt;/code&gt; function is now an alias to &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.config.set&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="removed-the-dask-array-learn-subpackage"&gt;
&lt;h2&gt;Removed the dask.array.learn subpackage&lt;/h2&gt;
&lt;p&gt;This was unadvertised and saw very little use. All functionality (and much
more) is now available in &lt;a class="reference external" href="http://dask-ml.readthedocs.io/en/latest/"&gt;Dask-ML&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="other"&gt;
&lt;h2&gt;Other&lt;/h2&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;We’ve removed the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;token=&lt;/span&gt;&lt;/code&gt; keyword from map_blocks and moved the
functionality to the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;name=&lt;/span&gt;&lt;/code&gt; keyword.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.distributed.worker_client&lt;/span&gt;&lt;/code&gt; automatically rejoins the threadpool when
you close the context manager.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Dask.distributed protocol now interprets msgpack arrays as tuples
rather than lists.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/14/dask-0.18.0.md&lt;/span&gt;, line 168)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="fun-new-features"&gt;
&lt;h1&gt;Fun new features&lt;/h1&gt;
&lt;section id="arrays"&gt;
&lt;h2&gt;Arrays&lt;/h2&gt;
&lt;section id="generalized-universal-functions"&gt;
&lt;h3&gt;Generalized Universal Functions&lt;/h3&gt;
&lt;p&gt;Dask.array now supports Numpy-style
&lt;a class="reference external" href="https://docs.scipy.org/doc/numpy-1.13.0/reference/c-api.generalized-ufuncs.html"&gt;Generalized Universal Functions (gufuncs)&lt;/a&gt;
transparently.
This means that you can apply normal Numpy GUFuncs, like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;eig&lt;/span&gt;&lt;/code&gt; in the example
below, directly onto a Dask arrays:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Apply a Numpy GUFunc, eig, directly onto a Dask array&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_umath_linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_dtypes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# w and v are dask arrays with eig applied along the latter two axes&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Numpy has gufuncs of many of its internal functions, but they haven’t
yet decided to switch these out to the public API.
Additionally we can define GUFuncs with other projects, like Numba:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numba&lt;/span&gt;

&lt;span class="nd"&gt;@numba&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorize&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;

&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# if x and y are dask arrays, then z will be too&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;What I like about this is that Dask and Numba developers didn’t coordinate
at all on this feature, it’s just that they both support the Numpy GUFunc
protocol, so you get interactions like this for free.&lt;/p&gt;
&lt;p&gt;For more information see &lt;a class="reference external" href="http://dask.pydata.org/en/latest/array-gufunc.html"&gt;Dask’s GUFunc documentation&lt;/a&gt;. This work was done by &lt;a class="reference external" href="https://github.com/magonser"&gt;Markus Gonser (&amp;#64;magonser)&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="new-auto-value-for-rechunking"&gt;
&lt;h3&gt;New “auto” value for rechunking&lt;/h3&gt;
&lt;p&gt;Dask arrays now accept a value, “auto”, wherever a chunk value would previously
be accepted. This asks Dask to rechunk those dimensions to achieve a good
default chunk size.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rechunk&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="c1"&gt;# single chunk in this dimension&lt;/span&gt;
  &lt;span class="c1"&gt;# 1: 100e6 / x.dtype.itemsize / x.shape[0],  # before we had to calculate manually&lt;/span&gt;
    &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;auto&amp;#39;&lt;/span&gt;      &lt;span class="c1"&gt;# Now we allow this dimension to respond to get ideal chunk size&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# or&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;auto&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This also checks the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;array.chunk-size&lt;/span&gt;&lt;/code&gt; config value for optimal chunk sizes&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;array.chunk-size&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;&amp;#39;128MiB&amp;#39;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To be clear, this doesn’t support “automatic chunking”, which is a very hard
problem in general. Users still need to be aware of their computations and how
they want to chunk, this just makes it marginally easier to make good
decisions.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="algorithmic-improvements"&gt;
&lt;h3&gt;Algorithmic improvements&lt;/h3&gt;
&lt;p&gt;Dask.array gained a full &lt;a class="reference external" href="http://dask.pydata.org/en/latest/array-api.html#dask.array.einsum"&gt;einsum&lt;/a&gt; implementation thanks to &lt;a class="reference external" href="https://github.com/sjperkins"&gt;Simon Perkins&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Also, Dask.array’s QR decompositions has become nicer in two ways:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;They support &lt;a class="reference external" href="http://dask.pydata.org/en/latest/array-api.html#dask.array.linalg.sfqr"&gt;short-and-fat arrays&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="http://dask.pydata.org/en/latest/array-api.html#dask.array.linalg.tsqr"&gt;tall-and-skinny&lt;/a&gt;
variant now operates more robustly in less memory. Here is a friendly GIF
of execution:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;img src="https://user-images.githubusercontent.com/306380/41350133-175cac7e-6ee0-11e8-9a0e-785c6e846409.gif" width="40%"&gt;
&lt;p&gt;This work is greatly appreciated and was done by &lt;a class="reference external" href="https://github.com/convexset"&gt;Jeremy Chan&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Native support for the &lt;a class="reference external" href="http://zarr.readthedocs.io/en/stable/"&gt;Zarr format&lt;/a&gt; for
chunked n-dimensional arrays landed thanks to &lt;a class="reference external" href="https://github.com/martindurant"&gt;Martin
Durant&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/jakirkham"&gt;John A
Kirkham&lt;/a&gt;. Zarr has been especially useful due to
its speed, simple spec, support of the full NetCDF style conventions, and
amenability to cloud storage.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="dataframes-and-pandas-0-23"&gt;
&lt;h2&gt;Dataframes and Pandas 0.23&lt;/h2&gt;
&lt;p&gt;As usual, Dask Dataframes had many small improvements. Of note is continued
compatibility with the just-released Pandas 0.23, and some new data ingestion
formats.&lt;/p&gt;
&lt;p&gt;Dask.dataframe is consistent with changes in the recent Pandas 0.23 release
thanks to &lt;a class="reference external" href="https://github.com/TomAugspurger"&gt;Tom Augspurger&lt;/a&gt;.&lt;/p&gt;
&lt;section id="orc-support"&gt;
&lt;h3&gt;Orc support&lt;/h3&gt;
&lt;p&gt;Dask.dataframe has grown a reader for the &lt;a class="reference external" href="https://orc.apache.org/"&gt;Apache ORC&lt;/a&gt; format.&lt;/p&gt;
&lt;p&gt;Orc is a format for tabular data storage that is common in the Hadoop ecosystem.
The new
&lt;a class="reference external" href="http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_hdf"&gt;dd.read_orc&lt;/a&gt;
function parallelizes around similarly new ORC functionality within PyArrow .
Thanks to &lt;a class="reference external" href="https://github.com/jcrist"&gt;Jim Crist&lt;/a&gt; for the work on the Arrow side
and &lt;a class="reference external" href="https://github.com/martindurant"&gt;Martin Durant&lt;/a&gt; for parallelizing it with
Dask.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="read-json-support"&gt;
&lt;h3&gt;Read_json support&lt;/h3&gt;
&lt;p&gt;Dask.dataframe now has also grown a reader for JSON files.&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_json"&gt;dd.read_json&lt;/a&gt;
function matches most of the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;pandas.read_json&lt;/span&gt;&lt;/code&gt; API.&lt;/p&gt;
&lt;p&gt;This came about shortly after a recent &lt;a class="reference external" href="https://www.youtube.com/watch?v=X4YHGKj3V5M"&gt;PyCon 2018 talk comparing Spark and
Dask dataframe&lt;/a&gt; where &lt;a class="reference external" href="https://github.com/j-bennet"&gt;Irina
Truong&lt;/a&gt; mentioned that it was missing. Thanks to
&lt;a class="reference external" href="https://github.com/martindurant"&gt;Martin Durant&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/j-bennet"&gt;Irina
Truong&lt;/a&gt; for this contribution.&lt;/p&gt;
&lt;p&gt;See the &lt;a class="reference external" href="http://dask.pydata.org/en/latest/dataframe-create.html"&gt;dataframe data ingestion documentation&lt;/a&gt;
for more information about JSON, ORC, or any of the other formats
supported by Dask.dataframe.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="joblib"&gt;
&lt;h2&gt;Joblib&lt;/h2&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://pythonhosted.org/joblib/"&gt;Joblib&lt;/a&gt; library for parallel computing within
Scikit-Learn has had a &lt;a class="reference external" href="http://dask-ml.readthedocs.io/en/latest/joblib.html"&gt;Dask backend&lt;/a&gt;
for a while now. While it has always been pretty easy to use, it’s now
becoming much easier to use well without much expertise. After using this in
practice for a while together with the Scikit-Learn developers, we’ve
identified and smoothed over a number of usability issues. These changes will
only be fully available after the next Scikit-Learn release (hopefully soon) at
which point we’ll probably release a new blogpost dedicated to the topic.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/14/dask-0.18.0.md&lt;/span&gt;, line 310)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="related-projects"&gt;
&lt;h1&gt;Related projects&lt;/h1&gt;
&lt;p&gt;This release is timed with the following packages:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;dask&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;distributed&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;dask-kubernetes&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;There is also a new repository for deploying applications on YARN (a job
scheduler common in Hadoop environments) called
&lt;a class="reference external" href="https://jcrist.github.io/skein/"&gt;skein&lt;/a&gt;. Early adopters welcome.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/14/dask-0.18.0.md&lt;/span&gt;, line 322)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="acknowledgements"&gt;
&lt;h1&gt;Acknowledgements&lt;/h1&gt;
&lt;p&gt;Since March 21st, the following people have contributed to the following repositories:&lt;/p&gt;
&lt;p&gt;The core Dask repository for parallel algorithms:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Andrethrill&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Beomi&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Brendan Martin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Christopher Ren&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Guido Imperiale&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Diane Trout&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;fjetter&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Frederick&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Henry Doupe&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;James Bourbeau&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jeremy Chen&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jim Crist&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;John A Kirkham&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jon Mease&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jörg Dietrich&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kevin Mader&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ksenia Bobrova&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Larsr&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Marc Pfister&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Markus Gonser&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martin Durant&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matt Lee&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pierre-Bartet&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scott Sievert&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simon Perkins&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stefan van der Walt&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stephan Hoyer&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tom Augspurger&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Uwe L. Korn&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Yu Feng&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The dask/distributed repository for distributed computing:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Bmaisonn&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Grant Jenks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Henry Doupe&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Irene Rodriguez&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Irina Truong&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;John A Kirkham&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Joseph Atkins-Turkish&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kenneth Koski&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loïc Estève&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Marius van Niekerk&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martin Durant&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Olivier Grisel&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Russ Bubley&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tom Augspurger&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tony Lorenzo&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The dask-kubernetes repository for deploying Dask on Kubernetes&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Brendan Martin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;J Gerard&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Olivier Grisel&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Yuvi Panda&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The dask-jobqueue repository for deploying Dask on HPC job schedulers&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Guillaume Eynard-Bontemps&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;jgerardsimcock&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Joseph Hamman&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loïc Estève&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ray Bell&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rich Signell&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Shawn Taylor&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Spencer Clark&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The dask-ml repository for scalable machine learning:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Christopher Ren&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jeremy Chen&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scott Sievert&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tom Augspurger&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;section id="id1"&gt;
&lt;h2&gt;Acknowledgements&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/06/14/dask-0.18.0.md&lt;/span&gt;, line 407); &lt;em&gt;&lt;a href="#id1"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “acknowledgements”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;Thanks to Scott Sievert and James Bourbeau for their help editing this article.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/06/14/dask-0.18.0/"/>
    <summary>This work is supported by Anaconda Inc.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-06-14T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/05/27/beyond-numpy/</id>
    <title>Beyond Numpy Arrays in Python</title>
    <updated>2018-05-27T00:00:00+00:00</updated>
    <content type="html">&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/05/27/beyond-numpy.md&lt;/span&gt;, line 9)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="executive-summary"&gt;

&lt;p&gt;In recent years Python’s array computing ecosystem has grown organically to support
GPUs, sparse, and distributed arrays.
This is wonderful and a great example of the growth that can occur in decentralized open source development.&lt;/p&gt;
&lt;p&gt;However to solidify this growth and apply it across the ecosystem we now need to do some central planning
to move from a pair-wise model where packages need to know about each other
to an ecosystem model where packages can negotiate by developing and adhering to community-standard protocols.&lt;/p&gt;
&lt;p&gt;With moderate effort we can define a subset of the Numpy API that works well across all of them,
allowing the ecosystem to more smoothly transition between hardware.
This post describes the opportunities and challenges to accomplish this.&lt;/p&gt;
&lt;p&gt;We start by discussing two kinds of libraries:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Libraries that &lt;em&gt;implement&lt;/em&gt; the Numpy API&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Libraries that &lt;em&gt;consume&lt;/em&gt; the Numpy API and build new functionality on top
of it&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/05/27/beyond-numpy.md&lt;/span&gt;, line 29)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="libraries-that-implement-the-numpy-api"&gt;
&lt;h1&gt;Libraries that Implement the Numpy API&lt;/h1&gt;
&lt;p&gt;The Numpy array is one of the foundations of the numeric Python ecosystem,
and serves as the standard model for similar libraries in other languages.
Today it is used to analyze satellite and biomedical imagery, financial models,
genomes, oceans and the atmosphere, super-computer simulations,
and data from thousands of other domains.&lt;/p&gt;
&lt;p&gt;However, Numpy was designed several years ago,
and its implementation is no longer optimal for some modern hardware,
particularly multi-core workstations, many-core GPUs, and distributed clusters.&lt;/p&gt;
&lt;p&gt;Fortunately other libraries implement the Numpy array API on these other architectures:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://cupy.chainer.org/"&gt;CuPy&lt;/a&gt;: implements the Numpy API on GPUs with CUDA&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://sparse.pydata.org/"&gt;Sparse&lt;/a&gt;: implements the Numpy API for sparse arrays that are mostly zeros&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask.pydata.org/"&gt;Dask array&lt;/a&gt;: implements the Numpy API in parallel for multi-core workstations or distributed clusters&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So even when the Numpy implementation is no longer ideal,
the &lt;em&gt;Numpy API&lt;/em&gt; lives on in successor projects.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: the Numpy implementation remains ideal most of the time.
Dense in-memory arrays are still the common case.
This blogpost is about the minority of cases where Numpy is not ideal&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;So today we can write code similar code between all of
Numpy, GPU, sparse, and parallel arrays:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Runs on a single CPU&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cupy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cp&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Runs on a GPU&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Runs on many CPUs&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="o"&gt;...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Additionally, each of the deep learning frameworks
(TensorFlow, PyTorch, MXNet)
has a Numpy-like thing that is &lt;em&gt;similar-ish&lt;/em&gt; to Numpy’s API,
but definitely not trying to be an exact match.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/05/27/beyond-numpy.md&lt;/span&gt;, line 84)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="libraries-that-consume-and-extend-the-numpy-api"&gt;
&lt;h1&gt;Libraries that consume and extend the Numpy API&lt;/h1&gt;
&lt;p&gt;At the same time as the development of Numpy APIs for different hardware,
many libraries today build algorithmic functionality on top of the Numpy API:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://xarray.pydata.org/en/stable/"&gt;XArray&lt;/a&gt;
for labeled and indexed collections of arrays&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/hips/autograd"&gt;Autograd&lt;/a&gt; and
&lt;a class="reference external" href="https://github.com/google/tangent/"&gt;Tangent&lt;/a&gt;:
for automatic differentiation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://tensorly.org/stable/index.html"&gt;TensorLy&lt;/a&gt;
for higher order array factorizations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask.pydata.org"&gt;Dask array&lt;/a&gt;
which coordinates many Numpy-like arrays into a logical parallel array&lt;/p&gt;
&lt;p&gt;(dask array both &lt;em&gt;consumes&lt;/em&gt; and &lt;em&gt;implements&lt;/em&gt; the Numpy API)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://optimized-einsum.readthedocs.io/en/latest/"&gt;Opt Einsum&lt;/a&gt;
for more efficient einstein summation operations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;…&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These projects and more enhance array computing in Python,
building on new features beyond what Numpy itself provides.&lt;/p&gt;
&lt;p&gt;There are also projects like Pandas, Scikit-Learn, and SciPy,
that use Numpy’s in-memory internal representation.
We’re going to ignore these libraries for this blogpost
and focus on those libraries that only use the high-level Numpy API
and not the low-level representation.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/05/27/beyond-numpy.md&lt;/span&gt;, line 114)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="opportunities-and-challenges"&gt;
&lt;h1&gt;Opportunities and Challenges&lt;/h1&gt;
&lt;p&gt;Given the two groups of projects:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;New libraries that &lt;em&gt;implement&lt;/em&gt; the Numpy API
(CuPy, Sparse, Dask array)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;New libraries that &lt;em&gt;consume&lt;/em&gt; and &lt;em&gt;extend&lt;/em&gt; the Numpy API
(XArray, Autograd/tangent, TensorLy, Einsum)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We want to use them together, applying Autograd to CuPy, TensorLy to Sparse,
and so on, including all future implementations that might follow.
This is challenging.&lt;/p&gt;
&lt;p&gt;Unfortunately,
while all of the array implementations APIs are &lt;em&gt;very similar&lt;/em&gt; to Numpy’s API,
they use different functions.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt;
&lt;span class="go"&gt;False&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This creates problems for the consumer libraries,
because now they need to switch out which functions they use
depending on which array-like objects they’ve been given.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nb"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;

&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Today each array project implements a custom plugin system
that they use to switch between some of the array options.
Links to these plugin mechanisms are below if you’re interested:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/pydata/xarray/blob/c346d3b7bcdbd6073cf96fdeb0710467a284a611/xarray/core/duck_array_ops.py"&gt;xarray/core/duck_array_ops.py&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/tensorly/tensorly/tree/af0700af61ca2cd104e90755d5e5033e23fd4ec4/tensorly/backend"&gt;tensorly/backend&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/HIPS/autograd/blob/bd3f92fcd4d66424be5fb6b6d3a7f9195c98eebf/autograd/numpy/numpy_vspaces.py"&gt;autograd/numpy/numpy_vspaces.py&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/google/tangent/blob/bc64848bba964c632a6da4965fb91f2f61a3cdd4/tangent/template.py"&gt;tangent/template.py&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/blob/8f164773cb3717b3c5ad856341205f605b8404cf/dask/array/core.py#L59-L62"&gt;dask/array/core.py#L59-L62&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dgasmith/opt_einsum/blob/32c1b0adb50511da1b86dc98bcf169d79b44efce/opt_einsum/backends.py"&gt;opt_einsum/backends.py&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example XArray can use either Numpy arrays or Dask arrays.
This has been hugely beneficial to users of that project,
which today seamlessly transition from small in-memory datasets on their laptops
to 100TB datasets on clusters,
all using the same programming model.
However when considering adding sparse or GPU arrays to XArray’s plugin system,
it quickly became clear that this would be expensive today.&lt;/p&gt;
&lt;p&gt;Building, maintaining, and extending these plugin mechanisms is &lt;em&gt;costly&lt;/em&gt;.
The plugin systems in each project are not alike,
so any new array implementation has to go to each library and build the same code several times.
Similarly, any new algorithmic library must build plugins to every ndarray implementation.
Each library has to explicitly import and understand each other library,
and has to adapt as those libraries change over time.
This coverage is not complete,
and so users lack confidence that their applications are portable between hardware.&lt;/p&gt;
&lt;p&gt;Pair-wise plugin mechanisms make sense for a single project,
but are not an efficient choice for the full ecosystem.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/05/27/beyond-numpy.md&lt;/span&gt;, line 181)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="solutions"&gt;
&lt;h1&gt;Solutions&lt;/h1&gt;
&lt;p&gt;I see two solutions today:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Build a new library that holds dispatch-able versions of all of the relevant Numpy functions
and convince everyone to use it instead of Numpy internally&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Build this dispatch mechanism into Numpy itself&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each has challenges.&lt;/p&gt;
&lt;section id="build-a-new-centralized-plugin-library"&gt;
&lt;h2&gt;Build a new centralized plugin library&lt;/h2&gt;
&lt;p&gt;We can build a new library,
here called &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;arrayish&lt;/span&gt;&lt;/code&gt;,
that holds dispatch-able versions of all of the relevant Numpy functions.
We then convince everyone to use it instead of Numpy internally.&lt;/p&gt;
&lt;p&gt;So in each array-like library’s codebase we write code like the following:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# inside numpy&amp;#39;s codebase&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;arrayish&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;
&lt;span class="nd"&gt;@arrayish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@arrayish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cos&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@arrayish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# inside cupy&amp;#39;s codebase&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;arrayish&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cupy&lt;/span&gt;
&lt;span class="nd"&gt;@arrayish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@arrayish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cos&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@arrayish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;and so on for Dask, Sparse, and any other Numpy-like libraries.&lt;/p&gt;
&lt;p&gt;In all of the algorithm libraries (like XArray, autograd, TensorLy, …)
we use arrayish instead of Numpy&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# inside XArray&amp;#39;s codebase&lt;/span&gt;
&lt;span class="c1"&gt;# import numpy&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;arrayish&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This is the same plugin solution as before,
but now we build a community standard plugin system
that hopefully all of the projects can agree to use.&lt;/p&gt;
&lt;p&gt;This reduces the big &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n&lt;/span&gt; &lt;span class="pre"&gt;by&lt;/span&gt; &lt;span class="pre"&gt;m&lt;/span&gt;&lt;/code&gt; cost of maintaining several plugin systems,
to a more manageable &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n&lt;/span&gt; &lt;span class="pre"&gt;plus&lt;/span&gt; &lt;span class="pre"&gt;m&lt;/span&gt;&lt;/code&gt; cost of using a single plugin system in each library.
This centralized project would also benefit, perhaps,
from being better maintained than any individual project is likely to do on its own.&lt;/p&gt;
&lt;p&gt;However this has costs:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;Getting many different projects to agree on a new standard is hard&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Algorithmic projects will need to start using arrayish internally,
adding new imports like the following:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;arrayish&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And this wll certainly cause some complications interally&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Someone needs to build an maintain the central infrastructure&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/hameerabbasi"&gt;Hameer Abbasi&lt;/a&gt;
put together a rudimentary prototype for arrayish here:
&lt;a class="reference external" href="https://github.com/hameerabbasi/arrayish"&gt;github.com/hameerabbasi/arrayish&lt;/a&gt;.
There has been some discussion about this topic, using XArray+Sparse as an example, in
&lt;a class="reference external" href="https://github.com/pydata/sparse/issues/1"&gt;pydata/sparse #1&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id="dispatch-from-within-numpy"&gt;
&lt;h2&gt;Dispatch from within Numpy&lt;/h2&gt;
&lt;p&gt;Alternatively, the central dispatching mechanism could live within Numpy itself.&lt;/p&gt;
&lt;p&gt;Numpy functions could learn to hand control over to their arguments,
allowing the array implementations to take over when possible.
This would allow existing Numpy code to work on externally developed array implementations.&lt;/p&gt;
&lt;p&gt;There is precedent for this.
The &lt;a class="reference external" href="https://docs.scipy.org/doc/numpy/reference/arrays.classes.html#numpy.class.__array_ufunc__"&gt;&lt;strong&gt;array_ufunc&lt;/strong&gt;&lt;/a&gt; protocol
allows any class that defines the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;__array_ufunc__&lt;/span&gt;&lt;/code&gt; method
to take control of any Numpy ufunc like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;np.sin&lt;/span&gt;&lt;/code&gt; or &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;np.exp&lt;/span&gt;&lt;/code&gt;.
Numpy reductions like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;np.sum&lt;/span&gt;&lt;/code&gt; already look for &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.sum&lt;/span&gt;&lt;/code&gt; methods on their arguments and defer to them if possible.&lt;/p&gt;
&lt;p&gt;Some array projects, like Dask and Sparse, already implement the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;__array_ufunc__&lt;/span&gt;&lt;/code&gt; protocol.
There is also &lt;a class="reference external" href="https://github.com/cupy/cupy/pull/1247"&gt;an open PR for CuPy&lt;/a&gt;.
Here is an example showing Numpy functions on Dask arrays cleanly.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;  &lt;span class="c1"&gt;# A Dask array&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;             &lt;span class="c1"&gt;# Apply Numpy function to a Dask array&lt;/span&gt;
&lt;span class="go"&gt;dask.array&amp;lt;sum-aggregate, shape=(), dtype=float64, chunksize=()&amp;gt;  # get a Dask array&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;I recommend that all Numpy-API compatible array projects implement the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;__array_ufunc__&lt;/span&gt;&lt;/code&gt; protocol.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This works for many functions, but not all.
Other operations like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;tensordot&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;concatenate&lt;/span&gt;&lt;/code&gt;, and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;stack&lt;/span&gt;&lt;/code&gt;
occur frequently in algorithmic code but are not covered here.&lt;/p&gt;
&lt;p&gt;This solution avoids the community challenges of the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;arrayish&lt;/span&gt;&lt;/code&gt; solution above.
Everyone is accustomed to aligning themselves to Numpy’s decisions,
and relatively little code would need to be rewritten.&lt;/p&gt;
&lt;p&gt;The challenge with this approach is that historically
Numpy has moved more slowly than the rest of the ecosystem.
For example the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;__array_ufunc__&lt;/span&gt;&lt;/code&gt; protocol mentioned above
was discussed for several years before it was merged.
Fortunately Numpy has recently
&lt;a class="reference external" href="https://www.numfocus.org/blog/numpy-receives-first-ever-funding-thanks-to-moore-foundation"&gt;received&lt;/a&gt;
&lt;a class="reference external" href="https://bids.berkeley.edu/news/bids-receives-sloan-foundation-grant-contribute-numpy-development"&gt;funding&lt;/a&gt;
to help it make changes like this more rapidly.
The full time developers hired under this funding have just started though,
and it’s not clear how much of a priority this work is for them at first.&lt;/p&gt;
&lt;p&gt;For what it’s worth I’d prefer to see this Numpy protocol solution take hold.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/05/27/beyond-numpy.md&lt;/span&gt;, line 312)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="final-thoughts"&gt;
&lt;h1&gt;Final Thoughts&lt;/h1&gt;
&lt;p&gt;In recent years Python’s array computing ecosystem has grown organically to support
GPUs, sparse, and distributed arrays.
This is wonderful and a great example of the growth that can occur in decentralized open source development.&lt;/p&gt;
&lt;p&gt;However to solidify this growth and apply it across the ecosystem we now need to do some central planning
to move from a pair-wise model where packages need to know about each other
to an ecosystem model where packages can negotiate by developing and adhering to community-standard protocols.&lt;/p&gt;
&lt;p&gt;The community has done this transition before
(Numeric + Numarray -&amp;gt; Numpy, the Scikit-Learn fit/predict API, etc..)
usually with surprisingly positive results.&lt;/p&gt;
&lt;p&gt;The open questions I have today are the following:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;How quickly can Numpy adapt to this demand for protocols
while still remaining stable for its existing role as foundation of the ecosystem&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What algorithmic domains can be written in a cross-hardware way
that depends only on the high-level Numpy API,
and doesn’t require specialization at the data structure level.
Clearly some domains exist (XArray, automatic differentiation),
but how common are these?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Once a standard protocol is in place,
what other array-like implementations might arise?
In-memory compression? Probabilistic? Symbolic?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/05/27/beyond-numpy.md&lt;/span&gt;, line 339)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="update"&gt;
&lt;h1&gt;Update&lt;/h1&gt;
&lt;p&gt;After discussing this topic at the
&lt;a class="reference external" href="https://scisprints.github.io/#may-numpy-developer-sprint"&gt;May NumPy Developer Sprint&lt;/a&gt;
at &lt;a class="reference external" href="https://bids.berkeley.edu/"&gt;BIDS&lt;/a&gt;
a few of us have drafted a Numpy Enhancement Proposal (NEP)
&lt;a class="reference external" href="https://github.com/numpy/numpy/pull/11189"&gt;available here&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/05/27/beyond-numpy/"/>
    <summary>Document headings start at H2, not H1 [myst.header]</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-05-27T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/03/21/dask-0.17.2/</id>
    <title>Dask Release 0.17.2</title>
    <updated>2018-03-21T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc.&lt;/a&gt;
and the Data Driven Discovery Initiative from the &lt;a class="reference external" href="https://www.moore.org/"&gt;Moore
Foundation&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I’m pleased to announce the release of Dask version 0.17.2. This is a minor
release with new features and stability improvements.
This blogpost outlines notable changes since the 0.17.0 release on February
12th.&lt;/p&gt;
&lt;p&gt;You can conda install Dask:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;conda install dask
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;or pip install from PyPI:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;pip install dask[complete] --upgrade
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Full changelogs are available here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/blob/master/docs/source/changelog.rst"&gt;dask/dask&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/blob/master/docs/source/changelog.rst"&gt;dask/distributed&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some notable changes follow:&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/03/21/dask-0.17.2.md&lt;/span&gt;, line 32)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="tornado-5-0"&gt;

&lt;p&gt;Tornado is a popular framework for concurrent network programming that Dask
relies on heavily. Tornado recently released a major version update that
included both some major features for Dask as well as a couple of bugs.&lt;/p&gt;
&lt;p&gt;The new &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;IOStream.read_into&lt;/span&gt;&lt;/code&gt; method allows Dask communications (or anyone using
this API) to move large datasets more efficiently over the network with
fewer copies. This enables Dask to take advantage of high performance
networking available on modern super-computers. On the Cheyenne system, where
we tested this, we were able to get the full 3GB/s bandwidth available through
the Infiniband network with this change (when using a few worker processes).&lt;/p&gt;
&lt;p&gt;Many thanks to &lt;a class="reference external" href="https://github.com/pitrou"&gt;Antoine Pitrou&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/bdarnell"&gt;Ben
Darnell&lt;/a&gt; for their efforts on this.&lt;/p&gt;
&lt;p&gt;At the same time there were some unforeseen issues in the update to Tornado 5.0.
More pervasive use of bytearrays over bytes caused issues with compression
libraries like Snappy and Python 2 that were not expecting these types. There
is a brief window in &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;distributed.__version__&lt;/span&gt; &lt;span class="pre"&gt;==&lt;/span&gt; &lt;span class="pre"&gt;1.21.3&lt;/span&gt;&lt;/code&gt; that enables this
functionality if Tornado 5.0 is present but will misbehave if Snappy is also
present.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/03/21/dask-0.17.2.md&lt;/span&gt;, line 55)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="http-file-system"&gt;
&lt;h1&gt;HTTP File System&lt;/h1&gt;
&lt;p&gt;Dask leverages a &lt;a class="reference external" href="https://github.com/dask/dask/issues/2880"&gt;file-system-like protocol&lt;/a&gt;
for access to remote data.
This is what makes commands like the following work:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.dataframe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dd&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s3://...&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hdfs://...&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;gcs://...&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We have now added http and https file systems for reading data directly from
web servers. These also support random access if the web server supports range
queries.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;https://...&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;As with S3, HDFS, GCS, … you can also use these tools outside of Dask
development. Here we read the first twenty bytes of the Pandas license:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.bytes.http&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HTTPFileSystem&lt;/span&gt;
&lt;span class="n"&gt;http&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HTTPFileSystem&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;https://raw.githubusercontent.com/pandas-dev/pandas/master/LICENSE&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BSD 3-Clause License&amp;#39;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://github.com/martindurant"&gt;Martin Durant&lt;/a&gt; who did this work
and manages Dask’s byte handling generally. See &lt;a class="reference external" href="http://dask.pydata.org/en/latest/remote-data-services.html"&gt;remote data documentation&lt;/a&gt; for more information.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/03/21/dask-0.17.2.md&lt;/span&gt;, line 94)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="fixed-a-correctness-bug-in-dask-dataframe-s-shuffle"&gt;
&lt;h1&gt;Fixed a correctness bug in Dask dataframe’s shuffle&lt;/h1&gt;
&lt;p&gt;We identified and resolved a correctness bug in dask.dataframe’s shuffle that
resulted in some rows being dropped during complex operations like joins and
groupby-applies with many partitions.&lt;/p&gt;
&lt;p&gt;See &lt;a class="reference external" href="https://github.com/dask/dask/pull/3201"&gt;dask/dask #3201&lt;/a&gt; for more information.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/03/21/dask-0.17.2.md&lt;/span&gt;, line 102)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="cluster-super-class-and-intelligent-adaptive-deployments"&gt;
&lt;h1&gt;Cluster super-class and intelligent adaptive deployments&lt;/h1&gt;
&lt;p&gt;There are many Python subprojects that help you deploy Dask on different
cluster resource managers like Yarn, SGE, Kubernetes, PBS, and more. These
have all converged to have more-or-less the same API that we have now combined
into a consistent interface that downstream projects can inherit from in
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;distributed.deploy.Cluster&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Now that we have a consistent interface we have started to invest more in
improving the interface and intelligence of these systems as a group. This
includes both pleasant IPython widgets like the following:&lt;/p&gt;
&lt;img src="/images/dask-kubernetes-widget.png" width="70%"&gt;
&lt;p&gt;as well as improved logic around adaptive deployments. Adaptive deployments
allow clusters to scale themselves automatically based on current workload. If
you have recently submitted a lot of work the scheduler will estimate its
duration and ask for an appropriate number of workers to finish the computation
quickly. When the computation has finished the scheduler will release the
workers back to the system to free up resources.&lt;/p&gt;
&lt;p&gt;The logic here has improved substantially including the following:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;You can specify minimum and maximum limits on your adaptivity&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The scheduler estimates computation duration and asks for workers
appropriately&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;There is some additional delay in giving back workers to avoid hysteresis,
or cases where we repeatedly ask for and return workers&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/03/21/dask-0.17.2.md&lt;/span&gt;, line 131)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="related-projects"&gt;
&lt;h1&gt;Related projects&lt;/h1&gt;
&lt;p&gt;Some news from related projects:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;The young daskernetes project was renamed to &lt;a class="reference external" href="http://dask-kubernetes.readthedocs.io/en/latest/"&gt;dask-kubernetes&lt;/a&gt;. This displaces a previous project (that had not been released) for launching Dask on Google Cloud Platform. That project has been renamed to &lt;a class="reference external" href="https://github.com/dask/dask-gke"&gt;dask-gke&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A new project, &lt;a class="reference external" href="https://github.com/dask/dask-jobqueue/"&gt;dask-jobqueue&lt;/a&gt; was
started to handle launching Dask clusters on traditional batch queuing
systems like PBS, SLURM, SGE, TORQUE, etc.. This projet grew out of the &lt;a class="reference external" href="https://pangeo-data.github.io/"&gt;Pangeo&lt;/a&gt; collaboration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A Dask Helm chart has been added to Helm’s stable channel&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/03/21/dask-0.17.2.md&lt;/span&gt;, line 141)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="acknowledgements"&gt;
&lt;h1&gt;Acknowledgements&lt;/h1&gt;
&lt;p&gt;The following people contributed to the dask/dask repository since the 0.17.0
release on February 12h:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Anderson Banihirwe&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dan Collins&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dieter Weber&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Gabriele Lanaro&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;John Kirkham&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;James Bourbeau&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Julien Lhermitte&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martin Durant&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Max Epstein&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;nkhadka&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;okkez&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pangeran Bottor&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rich Postelnik&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scott M. Edenbaum&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simon Perkins&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Thrasibule&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tom Augspurger&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tor E Hagemann&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Uwe L. Korn&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wes Roach&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The following people contributed to the dask/distributed repository since the
1.21.0 release on February 12th:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Alexander Ford&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Andy Jones&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Antoine Pitrou&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Brett Naul&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Joe Hamman&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;John Kirkham&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loïc Estève&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matti Lyra&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Sven Kreiss&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Thrasibule&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tom Augspurger&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/03/21/dask-0.17.2/"/>
    <summary>This work is supported by Anaconda Inc.
and the Data Driven Discovery Initiative from the Moore
Foundation.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-03-21T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/02/28/minimal-bug-reports/</id>
    <title>Craft Minimal Bug Reports</title>
    <updated>2018-02-28T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;Following up on a post on &lt;a class="reference internal" href="#../../../2016/08/25/supporting-users"&gt;&lt;span class="xref myst"&gt;supporting users in open source&lt;/span&gt;&lt;/a&gt;
this post lists some suggestions on how to ask a maintainer to help you with a problem.&lt;/p&gt;
&lt;p&gt;You don’t have to follow these suggestions. They are optional.
They make it more likely that a project maintainer will spend time helping you.
It’s important to remember that their willingness to support you for free is optional too.&lt;/p&gt;
&lt;p&gt;Crafting minimal bug reports is essential for the life and maintenance of community-driven open source projects.
Doing this well is an incredible service to the community.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 18)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="minimal-complete-verifiable-examples"&gt;

&lt;p&gt;I strongly recommend following Stack Overflow’s guidelines on &lt;a class="reference external" href="https://stackoverflow.com/help/mcve"&gt;Minimal Complete Verifiable Exmamples&lt;/a&gt;. I’ll include brief highlights here:&lt;/p&gt;
&lt;blockquote&gt;
&lt;div&gt;&lt;p&gt;… code should be …&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Minimal – Use as little code as possible that still produces the same problem&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Complete – Provide all parts needed to reproduce the problem&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Verifiable – Test the code you’re about to provide to make sure it reproduces the problem&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;Lets be clear, this is &lt;em&gt;hard&lt;/em&gt; and takes time.&lt;/p&gt;
&lt;p&gt;As a question-asker I find that creating an MCVE often takes 10-30 minutes for a simple problem.
Fortunately this work is usually straightforward,
even if I don’t know very much about the package I’m having trouble with.
Most of the work to create a minimal example is about removing all of the code that was specific to my application,
and as the question-asker I am probably the most qualified person to do that.&lt;/p&gt;
&lt;p&gt;When answering questions I often point people to StackOverflow’s MCVE document.
They sometimes come back with a better-but-not-yet-minimal example.
This post clarifies a few common issues.&lt;/p&gt;
&lt;p&gt;As an running example I’m going to use Pandas dataframe problems.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 44)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="don-t-post-data"&gt;
&lt;h1&gt;Don’t post data&lt;/h1&gt;
&lt;p&gt;You shouldn’t post the file that you’re working with.
Instead, try to see if you can reproduce the problem with just a few lines of data rather than the whole thing.&lt;/p&gt;
&lt;p&gt;Having to download a file, unzip it, etc. make it much less likely that someone will actually run your example in their free time.&lt;/p&gt;
&lt;section id="don-t"&gt;
&lt;h2&gt;Don’t&lt;/h2&gt;
&lt;p&gt;I’ve uploaded my data to Dropbox and you can get it here: &lt;a class="reference external" href="https://example.com"&gt;my-data.csv.gz&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;my-data.csv.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="do"&gt;
&lt;h2&gt;Do&lt;/h2&gt;
&lt;p&gt;You should be able to copy-paste the following to get enough of my data to cause the problem:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;account-start&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;2017-02-03&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2017-03-03&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2017-01-01&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;client&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Alice Anders&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bob Baker&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Charlie Chaplin&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;balance&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1432.32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;10.43&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;30000.00&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;db-id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1234&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2424&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;251&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;proxy-id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;525&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1525&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2542&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;rank&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;52&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;525&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="o"&gt;...&lt;/span&gt;
                   &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 76)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="actually-don-t-include-your-data-at-all"&gt;
&lt;h1&gt;Actually don’t include your data at all&lt;/h1&gt;
&lt;p&gt;Actually, your data probably has lots of information that is very specific to
your application. Your eyes gloss over it but a maintainer doesn’t know what
is relevant and what isn’t, so it will take them time to digest it if you
include it. Instead see if you can reproduce your same failure with artificial
or random data.&lt;/p&gt;
&lt;section id="id1"&gt;
&lt;h2&gt;Don’t&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 84); &lt;em&gt;&lt;a href="#id1"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “don’t”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;Here is enough of my data to reproduce the problem&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;account-start&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;2017-02-03&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2017-03-03&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2017-01-01&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;client&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Alice Anders&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bob Baker&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Charlie Chaplin&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;balance&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1432.32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;10.43&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;30000.00&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;db-id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1234&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2424&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;251&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;proxy-id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;525&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1525&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2542&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;rank&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;52&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;525&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="o"&gt;...&lt;/span&gt;
                   &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="id2"&gt;
&lt;h2&gt;Do&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 100); &lt;em&gt;&lt;a href="#id2"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “do”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;My actual problem is about finding the best ranked employee over a certain time period,
but we can reproduce the problem with this simpler dataset.
Notice that the dates are &lt;em&gt;out of order&lt;/em&gt; in this data (2000-01-02 comes after 2000-01-03).
I found that this was critical to reproducing the error.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;account-start&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;2000-01-01&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2000-01-03&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2000-01-02&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;db-id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Alice&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bob&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Charlie&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;As we shrink down our example problem we often discover a lot about what causes the problem.
This discovery is valuable
and something that only the question-asker is capable of doing efficiently.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 118)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="see-how-small-you-can-make-things"&gt;
&lt;h1&gt;See how small you can make things&lt;/h1&gt;
&lt;p&gt;To make it even easier, see how small you can make your data.
For example if working with tabular data (like Pandas),
then how many columns do you actually need to reproduce the failure?
How many rows do you actually need to reproduce the failure?
Do the columns need to be named as you have them now or could they be just “A” and “B”
or descriptive of the types within?&lt;/p&gt;
&lt;section id="id3"&gt;
&lt;h2&gt;Do&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 127); &lt;em&gt;&lt;a href="#id3"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “do”.&lt;/p&gt;
&lt;/aside&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;datetime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;2000-01-03&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2000-01-02&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                   &lt;span class="s1"&gt;&amp;#39;id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 135)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="remove-unnecessary-steps"&gt;
&lt;h1&gt;Remove unnecessary steps&lt;/h1&gt;
&lt;p&gt;Is every line in your example absolutely necessary to reproduce the error?
If you’re able to delete a line of code then please do.
Because you already understand your problem you are &lt;em&gt;much more efficient&lt;/em&gt; at doing this than the maintainer is.
They probably know more about the tool, but you know more about your code.&lt;/p&gt;
&lt;section id="id4"&gt;
&lt;h2&gt;Don’t&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 142); &lt;em&gt;&lt;a href="#id4"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “don’t”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;The groupby step below is raising a warning that I don’t understand&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- this produces the error&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="id5"&gt;
&lt;h2&gt;Do&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 155); &lt;em&gt;&lt;a href="#id5"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “do”.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;The groupby step below is raising a warning that I don’t understand&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- this produces the error&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 165)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="use-syntax-highlighting"&gt;
&lt;h1&gt;Use Syntax Highlighting&lt;/h1&gt;
&lt;p&gt;When using Github you can enclose code blocks in triple-backticks (the
character on the top-left of your keyboard on US-standard QWERTY keyboards).
It looks like this:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;```python
x = 1
```
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 175)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="provide-complete-tracebacks"&gt;
&lt;h1&gt;Provide complete tracebacks&lt;/h1&gt;
&lt;p&gt;You know all of that stuff between your code and the exception that is hard to
make sense of? You should include it.&lt;/p&gt;
&lt;section id="id6"&gt;
&lt;h2&gt;Don’t&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 180); &lt;em&gt;&lt;a href="#id6"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “don’t”.&lt;/p&gt;
&lt;/aside&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;I get a ZeroDivisionError from the following code:

```python
def div(x, y):
    return x / y

div(1, 0)
```
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="id7"&gt;
&lt;h2&gt;Do&lt;/h2&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: INFO/1 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/28/minimal-bug-reports.md&lt;/span&gt;, line 191); &lt;em&gt;&lt;a href="#id7"&gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Duplicate implicit target name: “do”.&lt;/p&gt;
&lt;/aside&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;I get a ZeroDivisionError from the following code:

```python
def div(x, y):
    return x / y

div(1, 0)
```

```python-traceback
ZeroDivisionError                         Traceback (most recent call last)
&amp;lt;ipython-input-4-7b96263abbfa&amp;gt; in &amp;lt;module&amp;gt;()
----&amp;gt; 1 div(1, 0)

&amp;lt;ipython-input-3-7685f97b4ce5&amp;gt; in div(x, y)
      1 def div(x, y):
----&amp;gt; 2     return x / y
      3

ZeroDivisionError: division by zero
```
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;If the traceback is long that’s ok. If you really want to be clean you can put
it in &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;&amp;lt;details&amp;gt;&lt;/span&gt;&lt;/code&gt; brackets.&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;I get a ZeroDivisionError from the following code:

```python
def div(x, y):
    return x / y

div(1, 0)
```

### Traceback

&amp;lt;details&amp;gt;

```python
ZeroDivisionError                         Traceback (most recent call last)
&amp;lt;ipython-input-4-7b96263abbfa&amp;gt; in &amp;lt;module&amp;gt;()
----&amp;gt; 1 div(1, 0)

&amp;lt;ipython-input-3-7685f97b4ce5&amp;gt; in div(x, y)
      1 def div(x, y):
----&amp;gt; 2     return x / y
      3

ZeroDivisionError: division by zero
```

&amp;lt;/details&amp;gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="ask-questions-in-public-places"&gt;
&lt;h2&gt;Ask Questions in Public Places&lt;/h2&gt;
&lt;p&gt;When raising issues you often have a few possible locations:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;GitHub issue tracker&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stack Overflow&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Project mailing list&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Project Chat room&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;E-mail maintainers directly (never do this)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Different projects handle this differently, but they usually have a page on
their documentation about where to go for help. This is often labeled
“Community”, “Support” or “Where to ask for help”. Here are the
recommendations from the
&lt;a class="reference external" href="https://pandas.pydata.org/community.html"&gt;Pandas community&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Generally it’s good to ask questions where many maintainers can see your
question and help, and where other users can find your question and answer if
they encounter a similar bug in the future.&lt;/p&gt;
&lt;p&gt;While your goal may be to solve your problem, the maintainer’s goal is likely
to create a record of how to solve problems like yours. This helps many more
users who will have a similar problem in the future, see your well-crafted bug
report, and learn from the resulting conversation.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="my-personal-preferences"&gt;
&lt;h2&gt;My personal preferences&lt;/h2&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;For user questions like “What is the right way to do X?” I prefer Stack Overflow.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For bug reports like “I did X, I’m pretty confident that it should work, but I
get this error” I prefer Github issues&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For general chit-chat I prefer Gitter, though actually, I personally spend
almost no time in gitter because it isn’t easily searchable by future
users. If you’ve asked me a question in Gitter I will almost certainly
not respond to it, except to direct you to github, stack overflow, or this
blogpost.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;I only like personal e-mail if someone is proposing to fund or seriously
support the project in some way&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But again, different projects do this differently and have different policies.
You should check the documentation of the project you’re dealing with to learn
how they like to support users.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/02/28/minimal-bug-reports/"/>
    <summary>Following up on a post on supporting users in open source
this post lists some suggestions on how to ask a maintainer to help you with a problem.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="scipy" label="scipy"/>
    <published>2018-02-28T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/02/12/dask-0.17.0/</id>
    <title>Dask Release 0.17.0</title>
    <updated>2018-02-12T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc.&lt;/a&gt;
and the Data Driven Discovery Initiative from the &lt;a class="reference external" href="https://www.moore.org/"&gt;Moore
Foundation&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I’m pleased to announce the release of Dask version 0.17.0. This a significant
major release with new features, breaking changes, and stability improvements.
This blogpost outlines notable changes since the 0.16.0 release on November
21st.&lt;/p&gt;
&lt;p&gt;You can conda install Dask:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;conda install dask -c conda-forge
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;or pip install from PyPI:&lt;/p&gt;
&lt;div class="highlight-none notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;pip install dask[complete] --upgrade
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Full changelogs are available here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/blob/master/docs/source/changelog.rst"&gt;dask/dask&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/blob/master/docs/source/changelog.rst"&gt;dask/distributed&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some notable changes follow.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/12/dask-0.17.0.md&lt;/span&gt;, line 32)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="deprecations"&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Removed &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.dataframe.rolling_*&lt;/span&gt;&lt;/code&gt; methods, which were previously deprecated both in dask.dataframe and in pandas. These are replaced with the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;rolling.*&lt;/span&gt;&lt;/code&gt; namespace&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We’ve generally stopped maintenance of the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-ec2&lt;/span&gt;&lt;/code&gt; project to launch dask clusters on Amazon’s EC2 using Salt. We generally recommend kubernetes instead both for Amazon’s EC2, and for Google and Azure as well&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://dask.pydata.org/en/latest/setup/kubernetes.html"&gt;dask.pydata.org/en/latest/setup/kubernetes.html&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Internal state of the distributed scheduler has changed significantly. This may affect advanced users who were inspecting this state for debugging or diagnostics.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/12/dask-0.17.0.md&lt;/span&gt;, line 41)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="task-ordering"&gt;
&lt;h1&gt;Task Ordering&lt;/h1&gt;
&lt;p&gt;As Dask encounters more complex problems from more domains
we continually run into problems where its current heuristics do not perform optimally.
This release includes a rewrite of our static task prioritization heuristics.
This will improve Dask’s ability to traverse complex computations
in a way that keeps memory use low.&lt;/p&gt;
&lt;p&gt;To aid debugging we also integrated these heuristics into the GraphViz-style plots
that come from the visualize method.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;visualize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;order&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;RdBu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;a href="https://user-images.githubusercontent.com/306380/35012109-86df75fa-fad6-11e7-9fa8-a43a697a4a17.png"&gt;
  &lt;img src="https://user-images.githubusercontent.com/306380/35012109-86df75fa-fad6-11e7-9fa8-a43a697a4a17.png"
     width="80%"
     align="center"&gt;&lt;/a&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/pull/3066"&gt;dask/dask #3066&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/dask/pull/3057"&gt;dask/dask #3057&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/12/dask-0.17.0.md&lt;/span&gt;, line 66)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="nested-joblib"&gt;
&lt;h1&gt;Nested Joblib&lt;/h1&gt;
&lt;p&gt;Dask supports parallelizing Scikit-Learn
by extending Scikit-Learn’s underlying library for parallelism,
&lt;a class="reference external" href="http://tomaugspurger.github.io/distributed-joblib.html"&gt;Joblib&lt;/a&gt;.
This allows Dask to distribute &lt;em&gt;some&lt;/em&gt; SKLearn algorithms across a cluster
just by wrapping them with a context manager.&lt;/p&gt;
&lt;p&gt;This relationship has been strengthened,
and particular attention has been focused
when nesting one parallel computation within another,
such as occurs when you train a parallel estimator, like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomForest&lt;/span&gt;&lt;/code&gt;,
within another parallel computation, like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;GridSearchCV&lt;/span&gt;&lt;/code&gt;.
Previously this would result in spawning too many threads/processes
and generally oversubscribing hardware.&lt;/p&gt;
&lt;p&gt;Due to recent combined development within both Joblib and Dask,
these sorts of situations can now be resolved efficiently by handing them off to Dask,
providing speedups even in single-machine cases:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.externals&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed.joblib&lt;/span&gt;  &lt;span class="c1"&gt;# register the dask joblib backend&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;est&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ParallelEstimator&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;gs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parallel_backend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;dask&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;See Tom Augspurger’s recent post with more details about this work:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://tomaugspurger.github.io/distributed-joblib.html"&gt;http://tomaugspurger.github.io/distributed-joblib.html&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/joblib/joblib/pull/595"&gt;joblib/joblib #595&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/pull/1705"&gt;dask/distributed #1705&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/joblib/joblib/pull/613"&gt;joblib/joblib #613&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://github.com/TomAugspurger"&gt;Tom Augspurger&lt;/a&gt;,
&lt;a class="reference external" href="https://github.com/jcrist"&gt;Jim Crist&lt;/a&gt;, and
&lt;a class="reference external" href="https://github.com/ogrisel"&gt;Olivier Grisel&lt;/a&gt; who did most of this work.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/12/dask-0.17.0.md&lt;/span&gt;, line 111)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="scheduler-internal-refactor"&gt;
&lt;h1&gt;Scheduler Internal Refactor&lt;/h1&gt;
&lt;p&gt;The distributed scheduler has been significantly refactored to change it from a forest of dictionaries:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;c&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;dependencies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;c&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;c&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]}&lt;/span&gt;
&lt;span class="n"&gt;nbytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;c&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To a bunch of objects:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nbytes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dependencies&lt;/span&gt;&lt;span class="o"&gt;=...&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
         &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nbytes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dependencies&lt;/span&gt;&lt;span class="o"&gt;=...&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
         &lt;span class="s1"&gt;&amp;#39;c&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;c&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nbytes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dependencies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[])}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;(there is &lt;em&gt;much&lt;/em&gt; more state than what is listed above,
but hopefully the examples above are clear.)&lt;/p&gt;
&lt;p&gt;There were a few motivations for this:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;We wanted to try out Cython and PyPy, for which objects like this might be more effective than dictionaries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We believe that this is probably a bit easier for developers new to the schedulers to understand. The proliferation of state dictionaries was not highly discoverable.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Goal one ended up not working out.
We have not yet been able to make the scheduler significantly faster under Cython or PyPy with this new layout. There is even a slight memory increase with these changes.
However we have been happy with the results in code readability, and we hope that others find this useful as well.&lt;/p&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://github.com/pitrou"&gt;Antoine Pitrou&lt;/a&gt;,
who did most of the work here.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/12/dask-0.17.0.md&lt;/span&gt;, line 144)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="user-priorities"&gt;
&lt;h1&gt;User Priorities&lt;/h1&gt;
&lt;p&gt;You can now submit tasks with different priorities.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# Higher priority preferred&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Lower priority happens later&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To be clear, Dask has always had priorities, they just weren’t easily user-settable.
Higher priorities are given precedence. The default priority for all tasks is zero.
You can also submit priorities for collections (like arrays and dataframes)&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# give this computation higher priority.&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/distributed/pull/1651"&gt;dask/distributed #1651&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/12/dask-0.17.0.md&lt;/span&gt;, line 163)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="related-projects"&gt;
&lt;h1&gt;Related projects&lt;/h1&gt;
&lt;p&gt;Several related projects are also undergoing releases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Tornado is updating to version 5.0 (there is a beta out now).
This is a major change that will put Tornado on the Asyncio event loop in Python 3.
It also includes many performance enhancements for high-bandwidth networks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bokeh 0.12.14 was just released.&lt;/p&gt;
&lt;p&gt;Note that you will need to update Dask to work with this version of Bokeh&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://daskernetes.readthedocs.io/en/latest/"&gt;Daskernetes&lt;/a&gt;, a new project for launching Dask on Kubernetes clusters&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/12/dask-0.17.0.md&lt;/span&gt;, line 176)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="acknowledgements"&gt;
&lt;h1&gt;Acknowledgements&lt;/h1&gt;
&lt;p&gt;The following people contributed to the dask/dask repository since the 0.16.0
release on November 14th:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Albert DeFusco&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Apostolos Vlachopoulos&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;castalheiro&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;James Bourbeau&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jon Mease&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ian Hopkinson&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jakub Nowacki&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jim Crist&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;John A Kirkham&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Joseph Lin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Keisuke Fujii&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martijn Arts&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martin Durant&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Markus Gonser&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Nir&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rich Signell&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Roman Yurchak&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;S. Andrew Sheppard&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;sephib&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stephan Hoyer&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tom Augspurger&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Uwe L. Korn&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wei Ji&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Xander Johnson&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The following people contributed to the dask/distributed repository since the
1.20.0 release on November 14th:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Alexander Ford&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Antoine Pitrou&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Brett Naul&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Brian Broll&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bruce Merry&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cornelius Riemenschneider&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Daniel Li&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jim Crist&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kelvin Yang&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Min RK&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;rqx&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Russ Bubley&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scott Sievert&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tom Augspurger&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Xander Johnson&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/02/12/dask-0.17.0/"/>
    <summary>This work is supported by Anaconda Inc.
and the Data Driven Discovery Initiative from the Moore
Foundation.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-02-12T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/02/09/credit-models-with-dask/</id>
    <title>Credit Modeling with Dask</title>
    <updated>2018-02-09T00:00:00+00:00</updated>
    <author>
      <name>Richard Postelnik</name>
    </author>
    <content type="html">&lt;p&gt;This post explores a real-world use case calculating complex credit models in Python using Dask.
It is an example of a complex parallel system that is well outside of the traditional “big data” workloads.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/09/credit-models-with-dask.md&lt;/span&gt;, line 13)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="this-is-a-guest-post"&gt;

&lt;p&gt;Hi All,&lt;/p&gt;
&lt;p&gt;This is a guest post from &lt;a class="reference external" href="https://github.com/postelrich"&gt;Rich Postelnik&lt;/a&gt;,
an Anaconda employee who works with a large retail bank on their credit modeling system.
They’re doing interesting work with Dask to manage complex computations
(see task graph below).
This is a nice example of using Dask for complex problems that are neither a big dataframe nor a big array, but are still highly parallel.
Rich was kind enough to write up this description of their problem and share it here.&lt;/p&gt;
&lt;p&gt;Thanks Rich!&lt;/p&gt;
&lt;a href="/images/credit_models/simple-model.svg"&gt;
  &lt;img src="/images/credit_models/simple-model.svg"
       alt="zoomed model section"
       width="100%"&gt;&lt;/a&gt;
&lt;p&gt;&lt;em&gt;This is cross-posted at &lt;a class="reference external" href="https://www.anaconda.com/blog/developer-blog/credit-modeling-with-dask/"&gt;Anaconda’s Developer Blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;P.S. If others have similar solutions and would like to share them I’d love to host those on this blog as well.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/09/credit-models-with-dask.md&lt;/span&gt;, line 35)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="the-problem"&gt;
&lt;h1&gt;The Problem&lt;/h1&gt;
&lt;p&gt;When applying for a loan, like a credit card, mortgage, auto loan, etc., we want to estimate the likelihood of default and the profit (or loss) to be gained. Those models are composed of a complex set of equations that depend on each other. There can be hundreds of equations each of which could have up to 20 inputs and yield 20 outputs. That is a lot of information to keep track of! We want to avoid manually keeping track of the dependencies, as well as messy code like the following Python function:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;final_equation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;out1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;equation1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out2_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out2_2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out2_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;equation2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out3_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out3_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;equation3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out2_3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;out_final&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;equation_n&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out_final&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This boils down to a dependency and ordering problem known as task scheduling.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/09/credit-models-with-dask.md&lt;/span&gt;, line 51)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="dags-to-the-rescue"&gt;
&lt;h1&gt;DAGs to the rescue&lt;/h1&gt;
&lt;img style="margin: 0 auto; display: block;" src="/images/credit_models/snatch.jpg" alt="snatch joke"&gt;
&lt;p&gt;A &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Directed_acyclic_graph"&gt;directed acyclic graph&lt;/a&gt; (DAG) is commonly used to solve task scheduling problems. Dask is a library for delayed task computation that makes use of directed graphs at its core. &lt;a class="reference external" href="http://dask.pydata.org/en/latest/delayed.html"&gt;dask.delayed&lt;/a&gt; is a simple decorator that turns a Python function into a graph vertex. If I pass the output from one delayed function as a parameter to another delayed function, Dask creates a directed edge between them. Let’s look at an example:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So here we have a function to add two numbers together. Let’s see what happens when we wrap it with &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.delayed&lt;/span&gt;&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delayed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;
&lt;span class="go"&gt;Delayed(&amp;#39;add-f6204fac-b067-40aa-9d6a-639fc719c3ce&amp;#39;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;add&lt;/span&gt;&lt;/code&gt; now returns a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Delayed&lt;/span&gt;&lt;/code&gt; object. We can pass this as an argument back into our &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.delayed&lt;/span&gt;&lt;/code&gt; function to start building out a chain of computation.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;four&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;four&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;4&lt;/span&gt;

&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;four&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;visualize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Below we can see how the DAG starts to come together.&lt;/p&gt;
&lt;img style="margin: 0 auto; display: block;" src="/images/credit_models/four.png" alt="four graph"&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/09/credit-models-with-dask.md&lt;/span&gt;, line 89)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="mock-credit-example"&gt;
&lt;h1&gt;Mock credit example&lt;/h1&gt;
&lt;p&gt;Let’s assume I’m a mortgage bank and have 10 people applying for a mortgage. I want to estimate the group’s average likelihood to default based on years of credit history and income.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;hist_yrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;incomes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Let’s also assume that default is a function of the incremented years history and half the years experience. While this could be written like:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;income&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I know in the future that I will need the incremented history for another calculation and want to be able to reuse the code as well as avoid doing the computation twice. Instead, I can break those functions out:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;delayed&lt;/span&gt;

&lt;span class="nd"&gt;@delayed&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="nd"&gt;@delayed&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;halve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="nd"&gt;@delayed&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Note how I wrapped the functions with &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;delayed&lt;/span&gt;&lt;/code&gt;. Now instead of returning a number these functions will return a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Delayed&lt;/span&gt;&lt;/code&gt; object. Even better is that these functions can also take &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Delayed&lt;/span&gt;&lt;/code&gt; objects as inputs. It is this passing of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Delayed&lt;/span&gt;&lt;/code&gt; objects as inputs to other &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;delayed&lt;/span&gt;&lt;/code&gt; functions that allows Dask to construct the task graph. I can now call these functions on my data in the style of normal Python code:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;inc_hist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hist_yrs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;halved_income&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;halve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;estimated_default&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inc_hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;halved_income&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;If you look at these variables, you will see that nothing has actually been calculated yet. They are all lists of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Delayed&lt;/span&gt;&lt;/code&gt; objects.&lt;/p&gt;
&lt;p&gt;Now, to get the average, I could just take the sum of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;estimated_default&lt;/span&gt;&lt;/code&gt; but I want this to scale (and make a more interesting graph) so let’s do a merge-style reduction.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@delayed&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;seq&lt;/span&gt;
    &lt;span class="n"&gt;middle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;middle&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;middle&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;

&lt;span class="n"&gt;default_sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;estimated_defaults&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;At this point &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;default_sum&lt;/span&gt;&lt;/code&gt; is a list of length 1 and that first element is the sum of estimated default for all applicants. To get the average, we divide by the number of applicants and call compute:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;avg_default&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;default_sum&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="n"&gt;avg_default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# 40.75&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To see the computation graph that Dask will use, we call &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;visualize&lt;/span&gt;&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;avg_default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;visualize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;img src="/images/credit_models/dummy_graph.png"
     style="margin: 0 auto; display: block;"
     alt="default graph"
     width="100%"&gt;&lt;/p&gt;
&lt;p&gt;And that is how Dask can be used to construct a complex system of equations with reusable intermediary calculations.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/09/credit-models-with-dask.md&lt;/span&gt;, line 173)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="how-we-used-dask-in-practice"&gt;
&lt;h1&gt;How we used Dask in practice&lt;/h1&gt;
&lt;p&gt;For our credit modeling problem, we used Dask to make a custom data structure to represent the individual equations. Using the default example above, this looked something like the following:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Equation&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;inc_hist&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;halved_income&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;defaults&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nd"&gt;@delayed&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;equation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inc_hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;halved_income&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;inc_hist&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;halved_income&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This allows us to write each equation as its own isolated function and mark its inputs and outputs. With this set of equation objects, we can determine the order of computation (with a &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Topological_sorting"&gt;topological sort&lt;/a&gt;) and let Dask handle the graph generation and computation. This eliminates the onerous task of manually passing around the arguments in the code base. Below is an example task graph for one particular model that the bank actually does.&lt;/p&gt;
&lt;a href="/images/credit_models/simple.svg"&gt;
  &lt;img src="/images/credit_models/simple.svg"
       alt="calc task graph"
       width="100%"&gt;
  &lt;/a&gt;
&lt;p&gt;This graph was a bit too large to render with the normal &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;my_task.visualize()&lt;/span&gt;&lt;/code&gt; method, so instead we rendered it with &lt;a class="reference external" href="https://gephi.org"&gt;Gephi&lt;/a&gt; to make the pretty colored graph above. The chaotic upper region of this graph is the individual equation calculations. Zooming in we can see the entry point, our input pandas DataFrame, as the large orange circle at the top and how it gets fed into many of the equations.&lt;/p&gt;
&lt;a href="/images/credit_models/simple-model.svg"&gt;
  &lt;img src="/images/credit_models/simple-model.svg"
       alt="zoomed model section"
       width="100%"&gt;&lt;/a&gt;
&lt;p&gt;The output of the model is about 100 times the size of the input so we do some aggregation at the end via tree reduction. This accounts for the more structured bottom half of the graph. The large green node at the bottom is our final output.&lt;/p&gt;
&lt;a href="/images/credit_models/simple-agg.svg"&gt;
  &lt;img src="/images/credit_models/simple-agg.svg"
       alt="zoomed agg section"
       width="100%"&gt;&lt;/a&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/09/credit-models-with-dask.md&lt;/span&gt;, line 209)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="final-thoughts"&gt;
&lt;h1&gt;Final Thoughts&lt;/h1&gt;
&lt;p&gt;With our Dask-based data structure, we spend more of our time writing model code rather than maintenance of the engine itself. This allows a clean separation between our analysts that design and write our models, and our computational system that runs them. Dask also offers a number of advantages not covered above. For example, with Dask you also get access to &lt;a class="reference external" href="https://distributed.readthedocs.io/en/latest/web.html"&gt;diagnostics&lt;/a&gt; such as time spent running each task and resources used. Also, you can easily distribute your computation with &lt;a class="reference external" href="https://distributed.readthedocs.io/en/latest/"&gt;dask distributed&lt;/a&gt; with relative ease. Now if I want to run our model across larger-than-memory data or on a distributed cluster, we don’t have to worry about rewriting our code to incorporate something like Spark. Finally, Dask allows you to give pandas-capable business analysts or less technical folks access to large datasets with the &lt;a class="reference external" href="http://dask.pydata.org/en/latest/dataframe.html"&gt;dask dataframe&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/09/credit-models-with-dask.md&lt;/span&gt;, line 213)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="full-example"&gt;
&lt;h1&gt;Full Example&lt;/h1&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;delayed&lt;/span&gt;


&lt;span class="nd"&gt;@delayed&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;


&lt;span class="nd"&gt;@delayed&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;halve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;


&lt;span class="nd"&gt;@delayed&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt;


&lt;span class="nd"&gt;@delayed&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;seq&lt;/span&gt;
    &lt;span class="n"&gt;middle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;middle&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;middle&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;


&lt;span class="n"&gt;hist_yrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;incomes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;inc_hist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hist_yrs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;halved_income&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;halve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;incomes&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;estimated_defaults&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inc_hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;halved_income&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;default_sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;estimated_defaults&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;avg_default&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;default_sum&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="n"&gt;avg_default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;avg_default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;visualize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# requires graphviz and python-graphviz to be installed&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/02/09/credit-models-with-dask.md&lt;/span&gt;, line 261)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="acknowledgements"&gt;
&lt;h1&gt;Acknowledgements&lt;/h1&gt;
&lt;p&gt;Special thanks to Matt Rocklin, Michael Grant, Gus Cavanagh, and Rory Merritt for their feedback when writing this article.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/02/09/credit-models-with-dask/"/>
    <summary>This post explores a real-world use case calculating complex credit models in Python using Dask.
It is an example of a complex parallel system that is well outside of the traditional “big data” workloads.</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <published>2018-02-09T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2018/01/22/pangeo-2/</id>
    <title>Pangeo: JupyterHub, Dask, and XArray on the Cloud</title>
    <updated>2018-01-22T00:00:00+00:00</updated>
    <content type="html">&lt;p&gt;&lt;em&gt;This work is supported by &lt;a class="reference external" href="http://anaconda.com"&gt;Anaconda Inc&lt;/a&gt;, the NSF
EarthCube program, and UC Berkeley BIDS&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;A few weeks ago a few of us stood up &lt;a class="reference external" href="http://pangeo.pydata.org"&gt;pangeo.pydata.org&lt;/a&gt;,
an experimental deployment of JupyterHub, Dask, and XArray on Google Container Engine (GKE)
to support atmospheric and oceanographic data analysis on large datasets.
This follows on &lt;a class="reference internal" href="../../2017/09/18/pangeo-1/"&gt;&lt;span class="doc std std-doc"&gt;recent work&lt;/span&gt;&lt;/a&gt; to deploy Dask and XArray for the same workloads on super computers.
This system is a proof of concept that has taught us a great deal about how to move forward.
This blogpost briefly describes the problem,
the system,
then describes the collaboration,
and finally discusses a number of challenges that we’ll be working on in coming months.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/01/22/pangeo-2.md&lt;/span&gt;, line 21)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="the-problem"&gt;

&lt;p&gt;Atmospheric and oceanographic sciences collect (with satellites) and generate (with simulations) large datasets
that they would like to analyze with distributed systems.
Libraries like Dask and XArray already solve this problem computationally if scientists have their own clusters,
but we seek to expand access by deploying on cloud-based systems.
We build a system to which people can log in, get Jupyter Notebooks, and launch Dask clusters without much hassle.
We hope that this increases access, and connects more scientists with more cloud-based datasets.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/01/22/pangeo-2.md&lt;/span&gt;, line 30)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="the-system"&gt;
&lt;h1&gt;The System&lt;/h1&gt;
&lt;p&gt;We integrate several pre-existing technologies to build a system where people can log in,
get access to a Jupyter notebook,
launch distributed compute clusters using Dask,
and analyze large datasets stored in the cloud.
They have a full user environment available to them through a website,
can leverage thousands of cores for computation,
and use existing APIs and workflows that look familiar to how they work on their laptop.&lt;/p&gt;
&lt;p&gt;A video walk-through follows below:&lt;/p&gt;
&lt;iframe width="560"
        height="315"
        src="https://www.youtube.com/embed/rSOJKbfNBNk"
        frameborder="0"
        allow="autoplay; encrypted-media"
        allowfullscreen&gt;&lt;/iframe&gt;
&lt;p&gt;We assembled this system from a number of pieces and technologies:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://jupyterhub.readthedocs.io/en/latest/"&gt;JupyterHub&lt;/a&gt;: Provides both the ability to launch single-user notebook servers
and handles user management for us.
In particular we use the KubeSpawner and the excellent documentation at &lt;a class="reference external" href="https://zero-to-jupyterhub.readthedocs.io/en/latest"&gt;Zero to JupyterHub&lt;/a&gt;,
which we recommend to anyone interested in this area.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/jupyterhub/kubespawner"&gt;KubeSpawner&lt;/a&gt;: A JupyterHub spawner that makes it easy to launch single-user notebook servers on Kubernetes systems&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://jupyterlab-tutorial.readthedocs.io/en/latest/"&gt;JupyterLab&lt;/a&gt;: The newer version of the classic notebook,
which we use to provide a richer remote user interface,
complete with terminals, file management, and more.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://xarray.pydata.org"&gt;XArray&lt;/a&gt;: Provides computation on NetCDF-style data.
XArray extends NumPy and Pandas to enable scientists to express complex computations on complex datasets
in ways that they find intuitive.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://dask.pydata.org"&gt;Dask&lt;/a&gt;: Provides the parallel computation behind XArray&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask/daskernetes"&gt;Daskernetes&lt;/a&gt;: Makes it easy to launch Dask clusters on Kubernetes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://kubernetes.io/"&gt;Kubernetes&lt;/a&gt;: In case it’s not already clear, all of this is based on Kubernetes,
which manages launching programs (like Jupyter notebook servers or Dask workers) on different machines,
while handling load balancing, permissions, and so on&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://cloud.google.com/kubernetes-engine/"&gt;Google Container Engine&lt;/a&gt;: Google’s managed Kubernetes service.
Every major cloud provider now has such a system,
which makes us happy about not relying too heavily on one system&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://gcsfs.readthedocs.io/en/latest/"&gt;GCSFS&lt;/a&gt;: A Python library providing intuitive access to Google Cloud Storage,
either through Python file interfaces or through a &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Filesystem_in_Userspace"&gt;FUSE&lt;/a&gt; file system&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://zarr.readthedocs.io/en/stable/"&gt;Zarr&lt;/a&gt;: A chunked array storage format that is suitable for the cloud&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/01/22/pangeo-2.md&lt;/span&gt;, line 74)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="collaboration"&gt;
&lt;h1&gt;Collaboration&lt;/h1&gt;
&lt;p&gt;We were able to build, deploy, and use this system to answer real science questions in a couple weeks.
We feel that this result is significant in its own right,
and is largely because we collaborated widely.
This project required the expertise of several individuals across several projects, institutions, and funding sources.
Here are a few examples of who did what from which organization.
We list institutions and positions mostly to show the roles involved.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Alistair Miles, Professor, Oxford:
Helped to optimize Zarr for XArray on GCS&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jacob Tomlinson, Staff, UK Met Informatics Lab:
Developed original JADE deployment and early Dask-Kubernetes work.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Joe Hamman, Postdoc, National Center for Atmospheric Research:
Provided scientific use case, data, and work flow.
Tuned XArray and Zarr for efficient data storing and saving.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martin Durant, Software developer, Anaconda Inc.:
Tuned GCSFS for many-access workloads. Also provided FUSE system for NetCDF support&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matt Pryor, Staff, Centre for Envronmental Data Analysis:
Extended original JADE deployment and early Dask-Kubernetes work.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Matthew Rocklin, Software Developer, Anaconda Inc.
Integration. Also performance testing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ryan Abernathey, Assistant Professor, Columbia University:
XArray + Zarr support, scientific use cases, coordination&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stephan Hoyer, Software engineer, Google:
XArray support&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Yuvi Panda, Staff, UC Berkeley BIDS and Data Science Education Program:
Provided assistance configuring JupyterHub with KubeSpawner.
Also prototyped the Daskernetes Dask + Kubernetes tool.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Notice the mix of academic and for-profit institutions.
Also notice the mix of scientists, staff, and professional software developers.
We believe that this mixture helps ensure the efficient construction of useful solutions.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/01/22/pangeo-2.md&lt;/span&gt;, line 108)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="lessons"&gt;
&lt;h1&gt;Lessons&lt;/h1&gt;
&lt;p&gt;This experiment has taught us a few things that we hope to explore further:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;Users can launch Kubernetes deployments from Kubernetes pods,
such as launching Dask clusters from their JupyterHub single-user notebooks.&lt;/p&gt;
&lt;p&gt;To do this well we need to start defining user roles more explicitly within JupyterHub.
We need to give users a safe an isolated space on the cluster to use without affecting their neighbors.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;HDF5 and NetCDF on cloud storage is an open question&lt;/p&gt;
&lt;p&gt;The file formats used for this sort of data are pervasive,
but not particulary convenient or efficent on cloud storage.
In particular the libraries used to read them make many small reads,
each of which is costly when operating on cloud object storage&lt;/p&gt;
&lt;p&gt;I see a few options:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Use FUSE file systems,
but tune them with tricks like read-ahead and caching
in order to compensate for HDF’s access patterns&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use the HDF group’s proposed HSDS service,
which promises to resolve these issues&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Adopt new file formats that are more cloud friendly.
Zarr is one such example that has so far performed admirably,
but certainly doesn’t have the long history of trust that HDF and NetCDF have earned.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Environment customization is important and tricky, especially when adding distributed computing.&lt;/p&gt;
&lt;p&gt;Immediately after showing this to science groups they want to try it out with their own software environments.
They can do this easily in their notebook session with tools like pip or conda,
but to apply those same changes to their dask workers is a bit more challenging,
especially when those workers come and go dynamically.&lt;/p&gt;
&lt;p&gt;We have solutions for this.
They can bulid and publish docker images.
They can add environment variables to specify extra pip or conda packages.
They can deploy their own pangeo deployment for their own group.&lt;/p&gt;
&lt;p&gt;However these have all taken some work to do well so far.
We hope that some combination of Binder-like publishing and small modification tricks like environment variables resolve this problem.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Our docker images are very large.
This means that users sometimes need to wait a minute or more for their session or their dask workers to start up
(less after things have warmed up a bit).&lt;/p&gt;
&lt;p&gt;It is surprising how much of this comes from conda and node packages.
We hope to resolve this both by improving our Docker hygeine
and by engaging packaging communities to audit package size.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Explore other clouds&lt;/p&gt;
&lt;p&gt;We started with Google just because their Kubernetes support has been around the longest,
but all major cloud providers (Google, AWS, Azure) now provide some level of managed Kubernetes support.
Everything we’ve done has been cloud-vendor agnostic, and various groups with data already on other clouds have reached out and are starting deployment on those systems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Combine efforts with other groups&lt;/p&gt;
&lt;p&gt;We’re actually not the first group to do this.
The UK Met Informatics Lab quietly built a similar prototype, JADE (Jupyter and Dask Environment) many months ago.
We’re now collaborating to merge efforts.&lt;/p&gt;
&lt;p&gt;It’s also worth mentioning that they prototyped the first iteration of Daskernetes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reach out to other communities&lt;/p&gt;
&lt;p&gt;While we started our collaboration with atmospheric and oceanographic scientists,
these same solutions apply to many other disciplines.
We should investigate other fields and start collaborations with those communities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Improve Dask + XArray algorithms&lt;/p&gt;
&lt;p&gt;When we try new problems in new environments we often uncover new opportunities to improve Dask’s internal scheduling algorithms.
This case is no different :)&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Much of this upcoming work is happening in the upstream projects
so this experimentation is both of concrete use to ongoing scientific research
as well as more broad use to the open source communities that these projects serve.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/01/22/pangeo-2.md&lt;/span&gt;, line 188)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="community-uptake"&gt;
&lt;h1&gt;Community uptake&lt;/h1&gt;
&lt;p&gt;We presented this at a couple conferences over the past week.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;American Meteorological Society, Python Symposium, Keynote. Slides: &lt;a class="reference external" href="http://matthewrocklin.com/slides/ams-2018.html#/"&gt;http://matthewrocklin.com/slides/ams-2018.html#/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Earth Science Information Partners Winter Meeting. Video: &lt;a class="reference external" href="https://www.youtube.com/watch?v=mDrjGxaXQT4"&gt;https://www.youtube.com/watch?v=mDrjGxaXQT4&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We found that this project aligns well with current efforts from many government agencies to publish large datasets on cloud stores (mostly S3).
Many of these data publication endeavors seek a computational system to enable access for the scientific public.
Our project seems to complement these needs without significant coordination.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/01/22/pangeo-2.md&lt;/span&gt;, line 199)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="disclaimers"&gt;
&lt;h1&gt;Disclaimers&lt;/h1&gt;
&lt;p&gt;While we encourage people to try out &lt;a class="reference external" href="http://pangeo.pydata.org"&gt;pangeo.pydata.org&lt;/a&gt; we also warn you that this system is immature.
In particular it has the following issues:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;it is insecure, please do not host sensitive data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;it is unstable, and may be taken down at any time&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;it is small, we only have a handful of cores deployed at any time, mostly for experimentation purposes&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;However it is also &lt;em&gt;open&lt;/em&gt;, and instructions to deploy your own &lt;a class="reference external" href="https://github.com/pangeo-data/pangeo/tree/master/gce"&gt;live here&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2018/01/22/pangeo-2.md&lt;/span&gt;, line 210)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="come-help"&gt;
&lt;h1&gt;Come help&lt;/h1&gt;
&lt;p&gt;We are a growing group comprised of many institutions including technologists, scientists, and open source projects.
There is plenty to do and plenty to discuss.
Please engage with us at &lt;a class="reference external" href="https://github.com/pangeo-data/pangeo/issues/new"&gt;github.com/pangeo-data/pangeo/issues/new&lt;/a&gt;&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2018/01/22/pangeo-2/"/>
    <summary>This work is supported by Anaconda Inc, the NSF
EarthCube program, and UC Berkeley BIDS</summary>
    <category term="Programming" label="Programming"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <category term="scipy" label="scipy"/>
    <published>2018-01-22T00:00:00+00:00</published>
  </entry>
</feed>
