<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts tagged dask array</title>
  <updated>2026-03-05T15:05:26.112477+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/tag/dask-array/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2024/11/21/dask-detrending/</id>
    <title>Improving GroupBy.map with Dask and Xarray</title>
    <updated>2024-11-21T00:00:00+00:00</updated>
    <author>
      <name>Patrick Hoefler</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;This post was originally published on the &lt;a class="reference external" href="https://xarray.dev/blog/dask-detrending"&gt;Xarray blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Running large-scale GroupBy-Map patterns with Xarray that are backed by &lt;a class="reference external" href="https://docs.dask.org/en/stable/array.html"&gt;Dask arrays&lt;/a&gt; is
an essential part of a lot of typical geospatial workloads. Detrending is a very common
operation where this pattern is needed.&lt;/p&gt;
&lt;p&gt;In this post, we will explore how and why this caused so many pitfalls for Xarray users in
the past and how we improved performance and scalability with a few changes to how Dask
subselects data.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2024/11/21/dask-detrending.md&lt;/span&gt;, line 20)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="what-is-groupby-map"&gt;

&lt;p&gt;&lt;a class="reference external" href="https://docs.xarray.dev/en/stable/generated/xarray.core.groupby.DatasetGroupBy.map.html"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;GroupBy.map&lt;/span&gt;&lt;/code&gt;&lt;/a&gt; lets you apply a User Defined Function (UDF)
that accepts and returns Xarray objects. The UDF will receive an Xarray object (either a Dataset or a DataArray) containing Dask arrays corresponding to one single group.
&lt;a class="reference external" href="https://docs.xarray.dev/en/stable/generated/xarray.core.groupby.DatasetGroupBy.reduce.html"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Groupby.reduce&lt;/span&gt;&lt;/code&gt;&lt;/a&gt; is quite similar
in that it applies a UDF, but in this case the UDF will receive the underlying Dask arrays, &lt;em&gt;not&lt;/em&gt; Xarray objects.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2024/11/21/dask-detrending.md&lt;/span&gt;, line 27)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="the-application"&gt;
&lt;h1&gt;The Application&lt;/h1&gt;
&lt;p&gt;Consider a typical workflow where you want to apply a detrending step. You may want to smooth out
the data by removing the trends over time. This is a common operation in climate science
and normally looks roughly like this:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;detrending_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataArray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;DataArray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# important: the rolling operation is applied within a group&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;arr&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rolling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_periods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;time.dayofyear&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;detrending_step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We are grouping by the day of the year and then are calculating the rolling average over
30-year windows for a particular day.&lt;/p&gt;
&lt;p&gt;Our example will run on a 1 TiB array, 64 years worth of data and the following structure:&lt;/p&gt;
&lt;figure class="align-center"&gt;
&lt;img alt="Python repr output of 1 TiB Dask array with shape (1801, 3600, 233376) split into 5460, 250 MiB chunks of (300, 300, 365)" src="/images/dask-detrending/input-array.png" style="width: 800px;"/&gt;
&lt;/figure&gt;
&lt;p&gt;The array isn’t overly huge and the chunks are reasonably sized.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2024/11/21/dask-detrending.md&lt;/span&gt;, line 52)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="the-problem"&gt;
&lt;h1&gt;The Problem&lt;/h1&gt;
&lt;p&gt;The general application seems straightforward. Group by the day of the year and apply a UDF
to every group. There are a few pitfalls in this application that can make the result of
this operation unusable. Our array is sorted by time, which means that we have to pick
entries from many different areas in the array to create a single group (corresponding to a single day of the year).
Picking the same day of every year is basically a slicing operation with a step size of 365.&lt;/p&gt;
&lt;figure class="align-center"&gt;
&lt;img alt="Schematic showing an array sorted by time, where data is selected from many different areas in the array to create a single group (corresponding to a specific day of the year)." src="/images/dask-detrending/indexing-data-selection.png" title="Data Selection Pattern" style="width: 800px;"/&gt;
&lt;/figure&gt;
&lt;p&gt;Our example has a year worth of data in a single chunk along the time axis. The general problem
exists for any workload where you have to access random entries of data. This
particular access pattern means that we have to pick one value per chunk, which is pretty
inefficient. The right side shows the individual groups that we are operating on.&lt;/p&gt;
&lt;p&gt;One of the main issues with this pattern is that Dask will create a single output chunk per time
entry, e.g. each group will consist of as many chunks as we have year.&lt;/p&gt;
&lt;p&gt;This results in a huge increase in the number of chunks:&lt;/p&gt;
&lt;figure class="align-center"&gt;
&lt;img alt="Python repr output of a 1 TiB Dask array with nearly 2 million, 700 kiB chunks." src="/images/dask-detrending/output-array-old.png" style="width: 800px;"/&gt;
&lt;/figure&gt;
&lt;p&gt;This simple operation increases the number of chunks from 5000 to close to 2 million. Each
chunk only has a few hundred kilobytes of data. &lt;strong&gt;This is pretty bad!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Dask computations generally scale along the number of chunks you have. Increasing the chunks by such
a large factor is catastrophic. Each follow-up operation, as simple as &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;a-b&lt;/span&gt;&lt;/code&gt; will create 2 million
additional tasks.&lt;/p&gt;
&lt;p&gt;The only workaround for users was to rechunk to something more sensible afterward, but it
still keeps the incredibly expensive indexing operation in the graph.&lt;/p&gt;
&lt;p&gt;Note this is the underlying problem that is &lt;a class="reference external" href="https://xarray.dev/blog/flox"&gt;solved by flox&lt;/a&gt; for aggregations like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.mean()&lt;/span&gt;&lt;/code&gt;
using parallel-native algorithms to avoid the expense of indexing out each group.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2024/11/21/dask-detrending.md&lt;/span&gt;, line 91)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="improvements-to-the-data-selection-algorithm"&gt;
&lt;h1&gt;Improvements to the Data Selection algorithm&lt;/h1&gt;
&lt;p&gt;The method of how Dask selected the data was objectively pretty bad.
A rewrite of the underlying algorithm enabled us to achieve a much more robust result. The new
algorithm is a lot smarter about how to pick values from each individual chunk, but most importantly,
it will try to preserve the input chunksize as closely as possible.&lt;/p&gt;
&lt;p&gt;For our initial example, it will put every group into a single chunk. This means that we will
end up with the number of chunks along the time axis being equal to the number of groups, i.e. 365.&lt;/p&gt;
&lt;figure class="align-center"&gt;
&lt;img alt="Python repr output of a 1 TiB Dask array with 31164, 43 MiB chunks" src="/images/dask-detrending/output-array-new.png" style="width: 800px;"/&gt;
&lt;/figure&gt;
&lt;p&gt;The algorithm reduces the number of chunks from 2 million to roughly 30 thousand, which is a huge improvement
and a scale that Dask can easily handle. The graph is now much smaller, and the follow-up operations
will run a lot faster as well.&lt;/p&gt;
&lt;p&gt;This improvement will help every operation that we listed above and make the scale a lot more
reliably than before. The algorithm is used very widely across Dask and Xarray and thus, influences
many methods.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2024/11/21/dask-detrending.md&lt;/span&gt;, line 113)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-s-next"&gt;
&lt;h1&gt;What’s next?&lt;/h1&gt;
&lt;p&gt;Xarray selects one group at a time for &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;groupby(...).map(...)&lt;/span&gt;&lt;/code&gt;, i.e. this requires one operation
per group. This will hurt scalability if the dataset has a very large number of groups, because
the computation will create a very expensive graph. There is currently an effort to implement alternative
APIs that are shuffle-based to circumvent that problem. A current PR is available &lt;a class="reference external" href="https://github.com/pydata/xarray/pull/9320"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The fragmentation of the output chunks by indexing is something that will hurt every workflow that is selecting data in a random
pattern. This also includes:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.sel&lt;/span&gt;&lt;/code&gt; if you aren’t using slices explicitly&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.isel&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.sortby&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;groupby(...).quantile()&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;and many more.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We expect all of these workloads to be substantially improved now.&lt;/p&gt;
&lt;p&gt;Additionally, &lt;a class="reference external" href="https://docs.dask.org/en/stable/changelog.html#v2024-11-1"&gt;Dask improved a lot of things&lt;/a&gt; related to either increasing chunksizes or fragmentation
of chunks over the cycle of a workload with more improvements to come. This will help a lot of
users to get better and more reliable performance.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2024/11/21/dask-detrending/"/>
    <summary>This post was originally published on the Xarray blog.</summary>
    <category term="daskarray" label="dask array"/>
    <category term="xarray" label="xarray"/>
    <published>2024-11-21T00:00:00+00:00</published>
  </entry>
</feed>
