<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts by Benjamin Zaitlen &amp; James Bourbeau</title>
  <updated>2026-03-05T15:05:19.486867+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/author/benjamin-zaitlen-james-bourbeau/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2019/10/08/df-groupby/</id>
    <title>DataFrame Groupby Aggregations</title>
    <updated>2019-10-08T00:00:00+00:00</updated>
    <author>
      <name>Benjamin Zaitlen &amp; James Bourbeau</name>
    </author>
    <content type="html">&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/10/08/df-groupby.md&lt;/span&gt;, line 10)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="groupby-aggregations-with-dask"&gt;

&lt;p&gt;In this post we’ll dive into how Dask computes groupby aggregations. These are commonly used operations for ETL and analysis in which we split data into groups, apply a function to each group independently, and then combine the results back together. In the PyData/R world this is often referred to as the split-apply-combine strategy (first coined by &lt;a class="reference external" href="https://www.jstatsoft.org/article/view/v040i01"&gt;Hadley Wickham&lt;/a&gt;) and is used widely throughout the &lt;a class="reference external" href="https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html"&gt;Pandas ecosystem&lt;/a&gt;.&lt;/p&gt;
&lt;div align="center"&gt;
  &lt;a href="/images/split-apply-combine.png"&gt;
    &lt;img src="/images/split-apply-combine.png" width="80%" align="center"&gt;
  &lt;/a&gt;
  &lt;p align="center"&gt;&lt;i&gt;Image courtesy of swcarpentry.github.io&lt;/i&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Dask leverages this idea using a similarly catchy name: apply-concat-apply or &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;aca&lt;/span&gt;&lt;/code&gt; for short. Here we’ll explore the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;aca&lt;/span&gt;&lt;/code&gt; strategy in both simple and complex operations.&lt;/p&gt;
&lt;p&gt;First, recall that a Dask DataFrame is a &lt;a class="reference external" href="https://docs.dask.org/en/latest/dataframe-design.html#internal-design"&gt;collection&lt;/a&gt; of DataFrame objects (e.g. each &lt;a class="reference external" href="https://docs.dask.org/en/latest/dataframe-design.html#partitions"&gt;partition&lt;/a&gt; of a Dask DataFrame is a Pandas DataFrame). For example, let’s say we have the following Pandas DataFrame:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;                       &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;                       &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;
&lt;span class="go"&gt;     a   b   c&lt;/span&gt;
&lt;span class="go"&gt;0    1   1   2&lt;/span&gt;
&lt;span class="go"&gt;1    1   3   4&lt;/span&gt;
&lt;span class="go"&gt;2    2  10   5&lt;/span&gt;
&lt;span class="go"&gt;3    3   3   2&lt;/span&gt;
&lt;span class="go"&gt;4    3   2   3&lt;/span&gt;
&lt;span class="go"&gt;5    1   1   5&lt;/span&gt;
&lt;span class="go"&gt;6    1   3   2&lt;/span&gt;
&lt;span class="go"&gt;7    2  10   3&lt;/span&gt;
&lt;span class="go"&gt;8    3   3   9&lt;/span&gt;
&lt;span class="go"&gt;9    3   3   2&lt;/span&gt;
&lt;span class="go"&gt;10  99  12  44&lt;/span&gt;
&lt;span class="go"&gt;11  10   0  33&lt;/span&gt;
&lt;span class="go"&gt;12   1   9   2&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To create a Dask DataFrame with three partitions from this data, we could partition &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;df&lt;/span&gt;&lt;/code&gt; between the indices of: (0, 4), (5, 9), and (10, 12). We can perform this partitioning with Dask by using the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_pandas&lt;/span&gt;&lt;/code&gt; function with &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;npartitions=3&lt;/span&gt;&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.dataframe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dd&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;npartitions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The 3 partitions are simply 3 individual Pandas DataFrames:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;partitions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;   a   b  c&lt;/span&gt;
&lt;span class="go"&gt;0  1   1  2&lt;/span&gt;
&lt;span class="go"&gt;1  1   3  4&lt;/span&gt;
&lt;span class="go"&gt;2  2  10  5&lt;/span&gt;
&lt;span class="go"&gt;3  3   3  2&lt;/span&gt;
&lt;span class="go"&gt;4  3   2  3&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/10/08/df-groupby.md&lt;/span&gt;, line 66)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="apply-concat-apply"&gt;
&lt;h1&gt;Apply-concat-apply&lt;/h1&gt;
&lt;p&gt;When Dask applies a function and/or algorithm (e.g. &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sum&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mean&lt;/span&gt;&lt;/code&gt;, etc.) to a Dask DataFrame, it does so by applying that operation to all the constituent partitions independently, collecting (or concatenating) the outputs into intermediary results, and then applying the operation again to the intermediary results to produce a final result. Internally, Dask re-uses the same apply-concat-apply methodology for many of its internal DataFrame calculations.&lt;/p&gt;
&lt;p&gt;Let’s break down how Dask computes &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;ddf.groupby(['a',&lt;/span&gt; &lt;span class="pre"&gt;'b']).c.sum()&lt;/span&gt;&lt;/code&gt; by going through each step in the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;aca&lt;/span&gt;&lt;/code&gt; process. We’ll begin by splitting our &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;df&lt;/span&gt;&lt;/code&gt; Pandas DataFrame into three partitions:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;df_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;section id="apply"&gt;
&lt;h2&gt;Apply&lt;/h2&gt;
&lt;p&gt;Next we perform the same &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;groupby(['a',&lt;/span&gt; &lt;span class="pre"&gt;'b']).c.sum()&lt;/span&gt;&lt;/code&gt; operation on each of our three partitions:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;sr1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;sr2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;sr3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_3&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;These operations each produce a Series with a &lt;a class="reference external" href="https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html"&gt;MultiIndex&lt;/a&gt;:&lt;/p&gt;
&lt;table&gt;
  &lt;tr&gt;
    &lt;th&gt;
      &lt;pre&gt;
&gt;&gt;&gt; sr1
a  b
1  1     2
   3     4
2  10    5
3  2     3
   3     2
Name: c, dtype: int64
      &lt;/pre&gt;
    &lt;/th&gt;
    &lt;th&gt;
      &lt;pre&gt;
&gt;&gt;&gt; sr2
a  b
1  1      5
   3      2
2  10     3
3  3     11
Name: c, dtype: int64
      &lt;/pre&gt;
    &lt;/th&gt;
    &lt;th&gt;
      &lt;pre&gt;
&gt;&gt;&gt; sr3
a   b
1   9      2
10  0     33
99  12    44
Name: c, dtype: int64
      &lt;/pre&gt;
    &lt;/th&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/section&gt;
&lt;section id="the-concat"&gt;
&lt;h2&gt;The conCat!&lt;/h2&gt;
&lt;p&gt;After the first &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;apply&lt;/span&gt;&lt;/code&gt;, the next step is to concatenate the intermediate &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sr1&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sr2&lt;/span&gt;&lt;/code&gt;, and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sr3&lt;/span&gt;&lt;/code&gt; results. This is fairly straightforward to do using the Pandas &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;concat&lt;/span&gt;&lt;/code&gt; function:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;sr_concat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;sr1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;sr_concat&lt;/span&gt;
&lt;span class="go"&gt;a   b&lt;/span&gt;
&lt;span class="go"&gt;1   1      2&lt;/span&gt;
&lt;span class="go"&gt;    3      4&lt;/span&gt;
&lt;span class="go"&gt;2   10     5&lt;/span&gt;
&lt;span class="go"&gt;3   2      3&lt;/span&gt;
&lt;span class="go"&gt;    3      2&lt;/span&gt;
&lt;span class="go"&gt;1   1      5&lt;/span&gt;
&lt;span class="go"&gt;    3      2&lt;/span&gt;
&lt;span class="go"&gt;2   10     3&lt;/span&gt;
&lt;span class="go"&gt;3   3     11&lt;/span&gt;
&lt;span class="go"&gt;1   9      2&lt;/span&gt;
&lt;span class="go"&gt;10  0     33&lt;/span&gt;
&lt;span class="go"&gt;99  12    44&lt;/span&gt;
&lt;span class="go"&gt;Name: c, dtype: int64&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="apply-redux"&gt;
&lt;h2&gt;Apply Redux&lt;/h2&gt;
&lt;p&gt;Our final step is to apply the same &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;groupby(['a',&lt;/span&gt; &lt;span class="pre"&gt;'b']).c.sum()&lt;/span&gt;&lt;/code&gt; operation again on the concatenated &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sr_concat&lt;/span&gt;&lt;/code&gt; Series. However we no longer have columns &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;a&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;b&lt;/span&gt;&lt;/code&gt;, so how should we proceed?&lt;/p&gt;
&lt;p&gt;Zooming out a bit, our goal is to add the values in the column which have the same index. For example, there are two rows with the index &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;(1,&lt;/span&gt; &lt;span class="pre"&gt;1)&lt;/span&gt;&lt;/code&gt; with corresponding values: 2, 5. So how can we groupby the indices with the same value? A MutliIndex uses &lt;a class="reference external" href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html#pandas.MultiIndex"&gt;levels&lt;/a&gt; to define what the value is at a give index. Dask &lt;a class="reference external" href="https://github.com/dask/dask/blob/973c6e1b2e38c2d9d6e8c75fb9b4ab7a0d07e6a7/dask/dataframe/groupby.py#L69-L75"&gt;determines&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/dask/dask/blob/973c6e1b2e38c2d9d6e8c75fb9b4ab7a0d07e6a7/dask/dataframe/groupby.py#L1065"&gt;uses these levels&lt;/a&gt; in the final apply step of the apply-concat-apply calculation. In our case, the level is &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;[0,&lt;/span&gt; &lt;span class="pre"&gt;1]&lt;/span&gt;&lt;/code&gt;, that is, we want both the index at the 0th level and the 1st level and if we group by both, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;0,&lt;/span&gt; &lt;span class="pre"&gt;1&lt;/span&gt;&lt;/code&gt;, we will have effectively grouped the same indices together:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sr_concat&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table&gt;
  &lt;tr&gt;
    &lt;th&gt;
      &lt;pre&gt;
&gt;&gt;&gt; total
a   b
1   1      7
    3      6
    9      2
2   10     8
3   2      3
    3     13
10  0     33
99  12    44
Name: c, dtype: int64
      &lt;/pre&gt;
    &lt;/th&gt;
    &lt;th&gt;
      &lt;pre&gt;
&gt;&gt;&gt; ddf.groupby(['a', 'b']).c.sum().compute()
a   b
1   1      7
    3      6
2   10     8
3   2      3
    3     13
1   9      2
10  0     33
99  12    44
Name: c, dtype: int64
      &lt;/pre&gt;
    &lt;/th&gt;
    &lt;th&gt;
      &lt;pre&gt;
&gt;&gt;&gt; df.groupby(['a', 'b']).c.sum()
a   b
1   1      7
    3      6
    9      2
2   10     8
3   2      3
    3     13
10  0     33
99  12    44
Name: c, dtype: int64
      &lt;/pre&gt;
    &lt;/th&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;Additionally, we can easily examine the steps of this apply-concat-apply calculation by &lt;a class="reference external" href="https://docs.dask.org/en/latest/graphviz.html"&gt;visualizing the task graph&lt;/a&gt; for the computation:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;visualize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;a href="/images/sum.svg"&gt;
  &lt;img src="/images/sum.svg" width="80%"&gt;
&lt;/a&gt;
&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sum&lt;/span&gt;&lt;/code&gt; is rather a straight-forward calculation. What about something a bit more complex like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mean&lt;/span&gt;&lt;/code&gt;?&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;visualize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;a href="/images/mean.svg"&gt;
  &lt;img src="/images/mean.svg" width="80%"&gt;
&lt;/a&gt;
&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Mean&lt;/span&gt;&lt;/code&gt; is a good example of an operation which doesn’t directly fit in the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;aca&lt;/span&gt;&lt;/code&gt; model – concatenating &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mean&lt;/span&gt;&lt;/code&gt; values and taking the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mean&lt;/span&gt;&lt;/code&gt; again will yield incorrect results. Like any style of computation: vectorization, Map/Reduce, etc., we sometime need to creatively fit the computation to the style/mode. In the case of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;aca&lt;/span&gt;&lt;/code&gt; we can often break down the calculation into constituent parts. For &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;mean&lt;/span&gt;&lt;/code&gt;, this would be &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sum&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;count&lt;/span&gt;&lt;/code&gt;:&lt;/p&gt;
&lt;div class="math notranslate nohighlight"&gt;
\[ \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}\]&lt;/div&gt;
&lt;p&gt;From the task graph above, we can see that two independent tasks for each partition: &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;series-groupby-count-chunk&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;series-groupby-sum-chunk&lt;/span&gt;&lt;/code&gt;. The results are then aggregated into two final nodes: &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;series-groupby-count-agg&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;series-groupby-sum-agg&lt;/span&gt;&lt;/code&gt; and then we finally calculate the mean: &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;total&lt;/span&gt; &lt;span class="pre"&gt;sum&lt;/span&gt; &lt;span class="pre"&gt;/&lt;/span&gt; &lt;span class="pre"&gt;total&lt;/span&gt; &lt;span class="pre"&gt;count&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2019/10/08/df-groupby/"/>
    <summary>Document headings start at H2, not H1 [myst.header]</summary>
    <category term="dask" label="dask"/>
    <category term="dataframe" label="dataframe"/>
    <published>2019-10-08T00:00:00+00:00</published>
  </entry>
</feed>
