<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posted in 2023</title>
  <updated>2026-03-05T15:05:21.111544+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/2023/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2023/08/25/dask-expr-introduction/</id>
    <title>High Level Query Optimization in Dask</title>
    <updated>2023-08-25T00:00:00+00:00</updated>
    <author>
      <name>Patrick Hoefler</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;This work was engineered and supported by &lt;a class="reference external" href="https://coiled.io/?utm_source=dask-blog&amp;amp;amp;utm_medium=dask-expr"&gt;Coiled&lt;/a&gt; and &lt;a class="reference external" href="https://www.nvidia.com/"&gt;NVIDIA&lt;/a&gt;. Thanks to &lt;a class="reference external" href="https://github.com/phofl"&gt;Patrick Hoefler&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/rjzamora"&gt;Rick Zamora&lt;/a&gt;, in particular. Original version of this post appears on &lt;a class="reference external" href="https://blog.coiled.io/blog/dask-expr-introduction.html?utm_source=dask-blog&amp;amp;amp;utm_medium=dask-expr"&gt;blog.coiled.io&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;figure class="align-center"&gt;
&lt;img alt="Expression tree encoded by dask-expr" src="/images/dask_expr/dask-expr-introduction-title.png" style="width: 70%;"/&gt;
&lt;/figure&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/08/25/dask-expr-introduction.md&lt;/span&gt;, line 16)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="introduction"&gt;

&lt;p&gt;Dask DataFrame doesn’t currently optimize your code for you (like Spark or a SQL database would).
This means that users waste a lot of computation. Let’s look at a common example
which looks ok at first glance, but is actually pretty inefficient.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.dataframe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dd&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# unnecessarily reads all rows and columns&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hvfhs_license_num&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;HV0003&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;    &lt;span class="c1"&gt;# could push the filter into the read parquet call&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numeric_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;tips&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;                                &lt;span class="c1"&gt;# should read only necessary columns&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We can make this run much faster with a few simple steps:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hvfhs_license_num&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;==&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;HV0003&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;tips&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tips&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Currently, Dask DataFrame wouldn’t optimize this for you, but a new effort that is built around
logical query planning in Dask DataFrame will do this for you. This article introduces some of
those changes that are developed in &lt;a class="reference external" href="https://github.com/dask-contrib/dask-expr"&gt;dask-expr&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can install and try &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-expr&lt;/span&gt;&lt;/code&gt; with:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We are using the &lt;a class="reference external" href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page"&gt;NYC Taxi&lt;/a&gt;
dataset in this post.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/08/25/dask-expr-introduction.md&lt;/span&gt;, line 60)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="dask-expressions"&gt;
&lt;h1&gt;Dask Expressions&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/dask-contrib/dask-expr"&gt;Dask expressions&lt;/a&gt; provides a logical query planning layer on
top of Dask DataFrames. Let’s look at our initial example and investigate how we can improve the efficiency
through a query optimization layer. As noted initially, there are a couple of things that aren’t ideal:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;We are reading all rows into memory instead of filtering while reading the parquet files.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We are reading all columns into memory instead of only the columns that are necessary.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We are applying the filter and the aggregation onto all columns instead of only &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;&amp;quot;tips&amp;quot;&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The query optimization layer from &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-expr&lt;/span&gt;&lt;/code&gt; can help us with that. It will look at this expression
and determine that not all rows are needed. An intermediate layer will transpile the filter into
a valid filter-expression for &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_parquet&lt;/span&gt;&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hvfhs_license_num&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;==&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;HV0003&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numeric_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;tips&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This still reads every column into memory and will compute the sum of every numeric column. The
next optimization step is to push the column selection into the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_parquet&lt;/span&gt;&lt;/code&gt; call as well.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;tips&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;filters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hvfhs_license_num&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;==&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;HV0003&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numeric_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This is a basic example that you could rewrite by hand. Use cases that are closer to real
workflows might potentially have hundreds of columns, which makes rewriting them very strenuous
if you need a non-trivial subset of them.&lt;/p&gt;
&lt;p&gt;Let’s take a look at how we can achieve this. &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-expr&lt;/span&gt;&lt;/code&gt; records the expression as given by the
user in an expression tree:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pprint&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;Projection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tips&amp;#39;&lt;/span&gt;
  &lt;span class="n"&gt;Sum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;numeric_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;
    &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;ReadParquet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s3://coiled-datasets/uber-lyft-tlc/&amp;#39;&lt;/span&gt;
      &lt;span class="n"&gt;EQ&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;HV0003&amp;#39;&lt;/span&gt;
        &lt;span class="n"&gt;Projection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hvfhs_license_num&amp;#39;&lt;/span&gt;
          &lt;span class="n"&gt;ReadParquet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s3://coiled-datasets/uber-lyft-tlc/&amp;#39;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This tree represents the expression as is. We can observe that we would read the whole dataset into
memory before we apply the projections and filters. One observation of note: It seems like we
are reading the dataset twice, but Dask is able to fuse tasks that are doing the same to avoid
computing these things twice. Let’s reorder the expression to make it more efficient:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;simplify&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pprint&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;Sum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;numeric_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;
  &lt;span class="n"&gt;ReadParquet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s3://coiled-datasets/uber-lyft-tlc/&amp;#39;&lt;/span&gt;
               &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tips&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
               &lt;span class="n"&gt;filters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hvfhs_license_num&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;==&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;HV0003&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This looks quite a bit simpler. &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-expr&lt;/span&gt;&lt;/code&gt; reordered the query and pushed the filter and the column
projection into the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_parquet&lt;/span&gt;&lt;/code&gt; call. We were able to remove quite a few steps from our expression
tree and make the remaining expressions more efficient as well. This represents the steps that
we did manually in the beginning. &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-expr&lt;/span&gt;&lt;/code&gt; performs these steps for arbitrary many columns without
increasing the burden on the developers.&lt;/p&gt;
&lt;p&gt;These are only the two most common and easy to illustrate optimization techniques from &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-expr&lt;/span&gt;&lt;/code&gt;.
Some other useful optimizations are already available:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;len(...)&lt;/span&gt;&lt;/code&gt; will only use the Index to compute the length; additionally we can ignore many operations
that won’t change the shape of a DataFrame, like a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;replace&lt;/span&gt;&lt;/code&gt; call.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;set_index&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;sort_values&lt;/span&gt;&lt;/code&gt; won’t eagerly trigger computations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Better informed selection of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;merge&lt;/span&gt;&lt;/code&gt; algorithms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;…&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We are still adding more optimization techniques to make Dask DataFrame queries more efficient.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/08/25/dask-expr-introduction.md&lt;/span&gt;, line 145)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="try-it-out"&gt;
&lt;h1&gt;Try it out&lt;/h1&gt;
&lt;p&gt;The project is in a state where interested users should try it out. We published a couple of
releases. The API covers a big chunk of the Dask DataFrame API, and we keep adding more.
We have already observed very impressive performance improvements for workflows that would benefit
from query optimization. Memory usage is down for these workflows as well.&lt;/p&gt;
&lt;p&gt;We are very much looking for feedback and potential avenues to improve the library. Please give it
a shot and share your experience with us.&lt;/p&gt;
&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-expr&lt;/span&gt;&lt;/code&gt; is not integrated into the main Dask DataFrame implementation yet. You can install it
with:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The API is very similar to what Dask DataFrame provides. It exposes mostly the same methods as
Dask DataFrame does. You can use the same methods in most cases.&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_expr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dd&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;You can find a list of supported operations in the
&lt;a class="reference external" href="https://github.com/dask-contrib/dask-expr#api-coverage"&gt;Readme&lt;/a&gt;. This project is still very much
in progress. The API might change without warning. We are aiming for weekly releases to push new
features out as fast as possible.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/08/25/dask-expr-introduction.md&lt;/span&gt;, line 174)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="why-are-we-adding-this-now"&gt;
&lt;h1&gt;Why are we adding this now?&lt;/h1&gt;
&lt;p&gt;Historically, Dask focused on flexibility and smart scheduling instead of query optimization.
The distributed scheduler built into Dask uses sophisticated algorithms to ensure ideal scheduling
of individual tasks. It tries to ensure that your resources are utilized as efficient as possible.
The graph construction process enables Dask users to build very
flexible and complicated graphs that reach beyond SQL operations. The flexibility that is provided
by the &lt;a class="reference external" href="https://docs.dask.org/en/latest/futures.html"&gt;Dask futures API&lt;/a&gt; requires very intelligent
algorithms, but it enables users to build highly sophisticated graphs. The following picture shows
the graph for a credit risk model:&lt;/p&gt;
&lt;a href="/images/dask_expr/graph_credit_risk_model.png"&gt;
&lt;img src="/images/dask_expr/graph_credit_risk_model.png"
     width="70%"
     alt="Computation graph representing a credit risk model"&gt;&lt;/a&gt;
&lt;p&gt;The nature of the powerful scheduler and the physical optimizations enables us to build very
complicated programs that will then run efficiently. Unfortunately, the nature of these optimizations
does not enable us to avoid scheduling work that is not necessary. This is where the current effort
to build high level query optimization into Dask comes in.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/08/25/dask-expr-introduction.md&lt;/span&gt;, line 195)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Dask comes with a very smart distributed scheduler but without much logical query planning. This
is something we are rectifying now through building a high level query optimizer into Dask
DataFrame. We expect to improve performance and reduce memory usage for an average Dask workflow.&lt;/p&gt;
&lt;p&gt;This API is read for interested users to play around with. It covers a good chunk of the DataFrame
API. The library is under active development, we expect to add many more interesting things over
the coming weeks and months.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2023/08/25/dask-expr-introduction/"/>
    <summary>This work was engineered and supported by Coiled and NVIDIA. Thanks to Patrick Hoefler and Rick Zamora, in particular. Original version of this post appears on blog.coiled.io</summary>
    <category term="dask" label="dask"/>
    <category term="performance" label="performance"/>
    <category term="queryoptimizer" label="query optimizer"/>
    <published>2023-08-25T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2023/04/18/dask-upstream-testing/</id>
    <title>Upstream testing in Dask</title>
    <updated>2023-04-18T00:00:00+00:00</updated>
    <author>
      <name>James Bourbeau</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;Original version of this post appears on &lt;a class="reference external" href="https://blog.coiled.io/blog/dask-upstream-testing.html?utm_source=dask-blog&amp;amp;amp;utm_medium=dask-upstream-testing"&gt;blog.coiled.io&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.dask.org/"&gt;Dask&lt;/a&gt; has deep integrations with other libraries in the PyData ecosystem like NumPy, pandas, Zarr, PyArrow, and more.
Part of providing a good experience for Dask users is making sure that Dask continues to work well with this community
of libraries as they push out new releases. This post walks through how Dask maintainers proactively ensure Dask
continuously works with its surrounding ecosystem.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/18/dask-upstream-testing.md&lt;/span&gt;, line 17)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="nightly-testing"&gt;

&lt;p&gt;Dask has a &lt;a class="reference external" href="https://github.com/dask/dask/blob/834a19eaeb6a5d756ca4ea90b56ca9ac943cb051/.github/workflows/upstream.yml"&gt;dedicated CI build&lt;/a&gt;
that runs Dask’s normal test suite once a day with unreleased, nightly versions of
several libraries installed. This lets us check whether or not a recent change in a library like NumPy or pandas
breaks some aspect of Dask’s functionality.&lt;/p&gt;
&lt;p&gt;To increase visibility when such a breakage occurs, as part of the upstream CI build, an
issue is automatically opened that provides a summary of what tests failed and links to the build
logs for the corresponding failure (&lt;a class="reference external" href="https://github.com/dask/dask/issues/9736"&gt;here’s an example issue&lt;/a&gt;).&lt;/p&gt;
&lt;div align="center"&gt;
&lt;img src="/images/example-upstream-CI-issue.png" style="max-width: 100%;" width="100%" /&gt;
&lt;/div&gt;
&lt;p&gt;This makes it less likely that a failing upstream build goes unnoticed.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/18/dask-upstream-testing.md&lt;/span&gt;, line 36)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="how-things-can-break-and-are-fixed"&gt;
&lt;h1&gt;How things can break and are fixed&lt;/h1&gt;
&lt;p&gt;There are usually two different ways in which things break. Either:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;A library made an intentional change in behavior and a corresponding compatibility change needs to be made in
Dask (the next section has an example of this case).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;There was some unintentional consequence of a change made in a library that resulted in a breakage in Dask.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When the latter case occurs, Dask maintainers can then engage with other library maintainers to resolve the
unintended breakage. This all happens before any libraries push out a new release, so no user code breaks.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/18/dask-upstream-testing.md&lt;/span&gt;, line 47)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="example-pandas-2-0"&gt;
&lt;h1&gt;Example: pandas 2.0&lt;/h1&gt;
&lt;p&gt;One specific example of this process in action is the recent pandas 2.0 release.
This is a major version release and contains significant breaking changes like removing deprecated functionality.&lt;/p&gt;
&lt;p&gt;As these breaking changes were merged into pandas, we started seeing related failures in Dask’s upstream CI build.
Dask maintainers were then able to add
&lt;a class="reference external" href="https://github.com/dask/dask/pulls?q=is%3Apr+%22pandas+2.0%22+in%3Atitle"&gt;a variety of compatibility changes&lt;/a&gt; so
that Dask works well with pandas 2.0 immediately.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/18/dask-upstream-testing.md&lt;/span&gt;, line 57)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="acknowledgements"&gt;
&lt;h1&gt;Acknowledgements&lt;/h1&gt;
&lt;p&gt;Special thanks to &lt;a class="reference external" href="https://github.com/keewis"&gt;Justus Magin&lt;/a&gt; for his work on the
&lt;a class="reference external" href="https://github.com/xarray-contrib/issue-from-pytest-log"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;xarray-contrib/issue-from-pytest-log&lt;/span&gt;&lt;/code&gt;&lt;/a&gt; GitHub action.
We’ve found this to be really convenient for easily opening up GitHub issues when test failures occur.&lt;/p&gt;
&lt;p&gt;Also, thanks to &lt;a class="reference external" href="https://github.com/j-bennet"&gt;Irina Truong&lt;/a&gt; (&lt;a class="reference external" href="https://www.coiled.io/?utm_source=dask-blog&amp;amp;amp;utm_medium=dask-upstream-testing"&gt;Coiled&lt;/a&gt;), &lt;a class="reference external" href="https://github.com/phofl"&gt;Patrick Hoefler&lt;/a&gt; (&lt;a class="reference external" href="https://www.coiled.io/?utm_source=dask-blog&amp;amp;amp;utm_medium=dask-upstream-testing"&gt;Coiled&lt;/a&gt;),
and &lt;a class="reference external" href="https://github.com/mroeschke"&gt;Matthew Roeschke&lt;/a&gt; (&lt;a class="reference external" href="https://rapids.ai/"&gt;NVIDIA&lt;/a&gt;) for their efforts ensuring pandas and Dask
continue to work together.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2023/04/18/dask-upstream-testing/"/>
    <summary>Original version of this post appears on blog.coiled.io</summary>
    <category term="dask" label="dask"/>
    <category term="ecosystem" label="ecosystem"/>
    <category term="pydata" label="pydata"/>
    <published>2023-04-18T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2023/04/14/scheduler-environment-requirements/</id>
    <title>Do you need consistent environments between the client, scheduler and workers?</title>
    <updated>2023-04-14T00:00:00+00:00</updated>
    <author>
      <name>Florian Jetter</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;Update May 3rd 2023: &lt;a class="reference external" href="https://github.com/dask/dask-blog/pull/166"&gt;Clarify GPU recommendations&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;With the release &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;2023.4.0&lt;/span&gt;&lt;/code&gt; of dask and distributed we are making a change which may require the Dask scheduler to have consistent software and hardware capabilities as the client and workers.&lt;/p&gt;
&lt;p&gt;It has always been recommended that your client and workers have a consistent software and hardware environment so that data structures and dependencies can be pickled and passed between them. However recent changes to the Dask scheduler mean that we now also require your scheduler to have the same consistent environment as everything else.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/14/scheduler-environment-requirements.md&lt;/span&gt;, line 15)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="what-does-this-mean-for-me"&gt;

&lt;p&gt;For most users, this change should go unnoticed as it is common to run all Dask components in the same conda environment or docker image and typically on homogenous machines.&lt;/p&gt;
&lt;p&gt;However, for folks who may have optimized their schedulers to use cut-down environments, or for users with specialized hardware such as GPUs available on their client/workers but not the scheduler there may be some impact.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/14/scheduler-environment-requirements.md&lt;/span&gt;, line 21)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-will-the-impact-be"&gt;
&lt;h1&gt;What will the impact be?&lt;/h1&gt;
&lt;p&gt;If you run into errors such as &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;&amp;quot;RuntimeError:&lt;/span&gt; &lt;span class="pre"&gt;Error&lt;/span&gt; &lt;span class="pre"&gt;during&lt;/span&gt; &lt;span class="pre"&gt;deserialization&lt;/span&gt; &lt;span class="pre"&gt;of&lt;/span&gt; &lt;span class="pre"&gt;the&lt;/span&gt; &lt;span class="pre"&gt;task&lt;/span&gt; &lt;span class="pre"&gt;graph.&lt;/span&gt; &lt;span class="pre"&gt;This&lt;/span&gt; &lt;span class="pre"&gt;frequently&lt;/span&gt; &lt;span class="pre"&gt;occurs&lt;/span&gt; &lt;span class="pre"&gt;if&lt;/span&gt; &lt;span class="pre"&gt;the&lt;/span&gt; &lt;span class="pre"&gt;Scheduler&lt;/span&gt; &lt;span class="pre"&gt;and&lt;/span&gt; &lt;span class="pre"&gt;Client&lt;/span&gt; &lt;span class="pre"&gt;have&lt;/span&gt; &lt;span class="pre"&gt;different&lt;/span&gt; &lt;span class="pre"&gt;environments.&amp;quot;&lt;/span&gt;&lt;/code&gt; please ensure your software environment is consistent between your client, scheduler and workers.&lt;/p&gt;
&lt;p&gt;If you are passing GPU objects between the client and workers we now recommend that your scheduler has a GPU too. This recommendation is just so that GPU-backed objects contained in Dask graphs can be deserialized on the scheduler if necessary. Typically the GPU available to the scheduler doesn’t need to be as powerful as long as it has &lt;a class="reference external" href="https://en.wikipedia.org/wiki/CUDA#GPUs_supported"&gt;similar CUDA compute capabilities&lt;/a&gt;. For example for cost optimization reasons you may want to use A100s on your client and workers and a T4 on your scheduler.&lt;/p&gt;
&lt;p&gt;Users who do not have a GPU on the client and are leveraging GPU workers shouldn’t run into this as the GPU objects will only exist on the workers.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/14/scheduler-environment-requirements.md&lt;/span&gt;, line 29)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="why-are-we-doing-this"&gt;
&lt;h1&gt;Why are we doing this?&lt;/h1&gt;
&lt;p&gt;The reason we now suggest that you have the same hardware/software capabilities on the scheduler is that we are giving the scheduler the ability to deserialize graphs before distributing them to the workers. This will allow the scheduler to make smarter scheduling decisions in the future by having a better understanding of the operation it is performing.&lt;/p&gt;
&lt;p&gt;The downside to this is that graphs can contain complex Python objects created by any number of dependencies on the client side, so in order for the scheduler to deserialize them it needs to have the same libraries installed. Equally, if the client-side packages create GPU objects then the scheduler will also need one.&lt;/p&gt;
&lt;p&gt;We are sure you’ll agree that this breakage for a small percentage of users will be worth it for the long-term improvements to Dask.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2023/04/14/scheduler-environment-requirements/"/>
    <summary>Update May 3rd 2023: Clarify GPU recommendations.</summary>
    <category term="IO" label="IO"/>
    <category term="dataframe" label="dataframe"/>
    <published>2023-04-14T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2023/04/12/from-map/</id>
    <title>Deep Dive into creating a Dask DataFrame Collection with from_map</title>
    <updated>2023-04-12T00:00:00+00:00</updated>
    <author>
      <name>Rick Zamora</name>
    </author>
    <content type="html">&lt;p&gt;&lt;a class="reference external" href="https://docs.dask.org/en/stable/dataframe.html"&gt;Dask DataFrame&lt;/a&gt; provides dedicated IO functions for several popular tabular-data formats, like CSV and Parquet. If you are working with a supported format, then the corresponding function (e.g &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_csv&lt;/span&gt;&lt;/code&gt;) is likely to be the most reliable way to create a new Dask DataFrame collection. For other workflows, &lt;a class="reference external" href="https://docs.dask.org/en/stable/generated/dask.dataframe.from_map.html"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt;&lt;/a&gt; now offers a convenient way to define a DataFrame collection as an arbitrary function mapping. While these kinds of workflows have historically required users to adopt the Dask Delayed API, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; now makes custom collection creation both easier and more performant.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/12/from-map.md&lt;/span&gt;, line 11)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="what-is-from-map"&gt;

&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; API was added to Dask DataFrame in v2022.05.1 with the intention of replacing &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_delayed&lt;/span&gt;&lt;/code&gt; as the recommended means of custom DataFrame creation. At its core, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; simply converts each element of an iterable object (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;inputs&lt;/span&gt;&lt;/code&gt;) into a distinct Dask DataFrame partition, using a common function (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;func&lt;/span&gt;&lt;/code&gt;):&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The overall behavior is essentially the Dask DataFrame equivalent of the standard-Python &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;map&lt;/span&gt;&lt;/code&gt; function:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Note that both &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;map&lt;/span&gt;&lt;/code&gt; actually support an arbitrary number of iterable inputs. However, we will only focus on the use of a single iterable argument in this post.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/12/from-map.md&lt;/span&gt;, line 27)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="a-simple-example"&gt;
&lt;h1&gt;A simple example&lt;/h1&gt;
&lt;p&gt;To better understand the behavior of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt;, let’s consider the simple case that we want to interact with Feather-formatted data created with the following Pandas code:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="n"&gt;paths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;./data.0.feather&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;./data.1.feather&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;xyz&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;B&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;index&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_feather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Since Dask does not yet offer a dedicated &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_feather&lt;/span&gt;&lt;/code&gt; function (as of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-2023.3.1&lt;/span&gt;&lt;/code&gt;), most users would assume that the only option to create a Dask DataFrame collection is to use &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.delayed&lt;/span&gt;&lt;/code&gt;. The “best practice” for creating a collection in this case, however, is to wrap &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;pd.read_feather&lt;/span&gt;&lt;/code&gt; or &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;cudf.read_feather&lt;/span&gt;&lt;/code&gt; in a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; call like so:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.dataframe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dd&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_feather&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;
&lt;span class="go"&gt;Dask DataFrame Structure:&lt;/span&gt;
&lt;span class="go"&gt;                   A       B  index&lt;/span&gt;
&lt;span class="go"&gt;npartitions=2&lt;/span&gt;
&lt;span class="go"&gt;               int64  object  int64&lt;/span&gt;
&lt;span class="go"&gt;                 ...     ...    ...&lt;/span&gt;
&lt;span class="go"&gt;                 ...     ...    ...&lt;/span&gt;
&lt;span class="go"&gt;Dask Name: read_feather, 1 graph layer&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Which produces the following Pandas (or cuDF) object after computation:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;   A  B  index&lt;/span&gt;
&lt;span class="go"&gt;0  0  x      0&lt;/span&gt;
&lt;span class="go"&gt;1  0  y      1&lt;/span&gt;
&lt;span class="go"&gt;2  0  z      2&lt;/span&gt;
&lt;span class="go"&gt;0  1  x      3&lt;/span&gt;
&lt;span class="go"&gt;1  1  y      4&lt;/span&gt;
&lt;span class="go"&gt;2  1  z      5&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Although the same output can be achieved using the conventional &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dd.from_delayed&lt;/span&gt;&lt;/code&gt; strategy, using &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; will improve the available opportunities for task-graph optimization within Dask.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/12/from-map.md&lt;/span&gt;, line 74)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="performance-considerations-specifying-meta-and-divisions"&gt;
&lt;h1&gt;Performance considerations: Specifying &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;meta&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;divisions&lt;/span&gt;&lt;/code&gt;&lt;/h1&gt;
&lt;p&gt;Although &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;func&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;iterable&lt;/span&gt;&lt;/code&gt; are the only &lt;em&gt;required&lt;/em&gt; arguments to &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt;, one can significantly improve the overall performance of a workflow by specifying optional arguments like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;meta&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;divisions&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Due to the lazy nature of Dask DataFrame, each collection is required to carry around a schema (column name and dtype information) in the form of an empty Pandas (or cuDF) object. If &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;meta&lt;/span&gt;&lt;/code&gt; is not directly provided to the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; function, the schema will need to be populated by eagerly materializing the first partition, which can increase the apparent latency of the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; API call itself. For this reason, it is always recommended to specify an explicit &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;meta&lt;/span&gt;&lt;/code&gt; argument if the expected column names and dtypes are known a priori.&lt;/p&gt;
&lt;p&gt;While passing in a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;meta&lt;/span&gt;&lt;/code&gt; argument is likely to reduce the&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; API call latency, passing in a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;divisions&lt;/span&gt;&lt;/code&gt; argument makes it possible to reduce the end-to-end compute time. This is because, by specifying &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;divisions&lt;/span&gt;&lt;/code&gt;, we are allowing Dask DataFrame to track useful per-partition min/max statistics. Therefore, if the overall workflow involves grouping or joining on the index, Dask can avoid the need to perform unnecessary shuffling operations.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/12/from-map.md&lt;/span&gt;, line 82)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="using-from-map-to-implement-a-custom-api"&gt;
&lt;h1&gt;Using &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; to implement a custom API&lt;/h1&gt;
&lt;p&gt;Although it is currently difficult to automatically extract division information from the metadata of an arbitrary Feather dataset, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; makes it relatively easy to implement your own highly-functional &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_feather&lt;/span&gt;&lt;/code&gt; API using &lt;a class="reference external" href="https://arrow.apache.org/docs/python/index.html"&gt;PyArrow&lt;/a&gt;. For example, the following code is all that one needs to enable lazy Feather IO with both column projection and index selection:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;from_arrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;(Optional) Utility to enforce &amp;#39;backend&amp;#39; configuration&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dataframe.backend&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;cudf&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cudf&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cudf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_arrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_pandas&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_feather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Create a Dask DataFrame from Feather files&lt;/span&gt;

&lt;span class="sd"&gt;    Example of a &amp;quot;custom&amp;quot; `from_map` IO function&lt;/span&gt;

&lt;span class="sd"&gt;    Parameters&lt;/span&gt;
&lt;span class="sd"&gt;    ----------&lt;/span&gt;
&lt;span class="sd"&gt;    paths: list&lt;/span&gt;
&lt;span class="sd"&gt;        List of Feather-formatted paths. Each path will&lt;/span&gt;
&lt;span class="sd"&gt;        be mapped to a distinct DataFrame partition.&lt;/span&gt;
&lt;span class="sd"&gt;    columns: list or None, default None&lt;/span&gt;
&lt;span class="sd"&gt;        Optional list of columns to select from each file.&lt;/span&gt;
&lt;span class="sd"&gt;    index: str or None, default None&lt;/span&gt;
&lt;span class="sd"&gt;        Optional column name to set as the DataFrame index.&lt;/span&gt;

&lt;span class="sd"&gt;    Returns&lt;/span&gt;
&lt;span class="sd"&gt;    -------&lt;/span&gt;
&lt;span class="sd"&gt;    dask.dataframe.DataFrame&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.dataframe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dd&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pyarrow.dataset&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;ds&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 1: Extract `meta` from the dataset&lt;/span&gt;
    &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;feather&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;from_arrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty_table&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;
    &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: Define the `func` argument&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Create a Pandas DataFrame from a dataset fragment&lt;/span&gt;
        &lt;span class="c1"&gt;# NOTE: In practice, this function should&lt;/span&gt;
        &lt;span class="c1"&gt;# always be defined outside `read_feather`&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;read_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;read_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;from_arrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frag&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;read_columns&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 3: Define the `iterable` argument&lt;/span&gt;
    &lt;span class="n"&gt;iterable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_fragments&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 4: Call `from_map`&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;iterable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# `func` kwarg&lt;/span&gt;
        &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# `func` kwarg&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Here we see that using &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; to enable completely-lazy collection creation only requires four steps. First, we use &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;pyarrow.dataset&lt;/span&gt;&lt;/code&gt; to define a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;meta&lt;/span&gt;&lt;/code&gt; argument for &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt;, so that we can avoid the unnecessary overhead of an eager read operation. For some file formats and/or applications, it may also be possible to calculate &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;divisions&lt;/span&gt;&lt;/code&gt; at this point. However, as explained above, such information is not readily available for this particular example.&lt;/p&gt;
&lt;p&gt;The second step is to define the underlying function (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;func&lt;/span&gt;&lt;/code&gt;) that we will use to produce each of our final DataFrame partitions. Third, we define one or more iterable objects containing the unique information needed to produce each partition (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;iterable&lt;/span&gt;&lt;/code&gt;). In this case, the only iterable object corresponds to a generator of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;pyarrow.dataset&lt;/span&gt;&lt;/code&gt; fragments, which is essentially a wrapper around the input path list.&lt;/p&gt;
&lt;p&gt;The fourth and final step is to use the final &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;func&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;interable&lt;/span&gt;&lt;/code&gt;, and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;meta&lt;/span&gt;&lt;/code&gt; information to call the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; API. Note that we also use this opportunity to specify additional key-word arguments, like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;columns&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;index&lt;/span&gt;&lt;/code&gt;. In contrast to the iterable positional arguments, which are always mapped to &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;func&lt;/span&gt;&lt;/code&gt;, these key-word arguments will be broadcasted.&lt;/p&gt;
&lt;p&gt;Using the&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_feather&lt;/span&gt;&lt;/code&gt; implementation above, it becomes both easy and efficient to convert an arbitrary Feather dataset into a lazy Dask DataFrame collection:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read_feather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;index&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;
&lt;span class="go"&gt;Dask DataFrame Structure:&lt;/span&gt;
&lt;span class="go"&gt;                   A&lt;/span&gt;
&lt;span class="go"&gt;npartitions=2&lt;/span&gt;
&lt;span class="go"&gt;               int64&lt;/span&gt;
&lt;span class="go"&gt;                 ...&lt;/span&gt;
&lt;span class="go"&gt;                 ...&lt;/span&gt;
&lt;span class="go"&gt;Dask Name: func, 1 graph layer&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;       A&lt;/span&gt;
&lt;span class="go"&gt;index&lt;/span&gt;
&lt;span class="go"&gt;0      0&lt;/span&gt;
&lt;span class="go"&gt;1      0&lt;/span&gt;
&lt;span class="go"&gt;2      0&lt;/span&gt;
&lt;span class="go"&gt;3      1&lt;/span&gt;
&lt;span class="go"&gt;4      1&lt;/span&gt;
&lt;span class="go"&gt;5      1&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/12/from-map.md&lt;/span&gt;, line 183)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="advanced-enhancing-column-projection"&gt;
&lt;h1&gt;Advanced: Enhancing column projection&lt;/h1&gt;
&lt;p&gt;Although a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_feather&lt;/span&gt;&lt;/code&gt; implementation like the one above is likely to meet the basic needs of most applications, it is certainly possible that users will often leave out the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;column&lt;/span&gt;&lt;/code&gt; argument in practice. For example:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read_feather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;For code like this, as the implementation currently stands, each IO task would be forced to read in an entire Feather file, and then select the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;”A”&lt;/span&gt;&lt;/code&gt; column from a Pandas/cuDF DataFrame only after it had already been read into memory. The additional overhead is insignificant for the toy-dataset used here. However, avoiding this kind of unnecessary IO can lead to dramatic performance improvements in real-world applications.&lt;/p&gt;
&lt;p&gt;So, how can we modify our &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_feather&lt;/span&gt;&lt;/code&gt; implementation to take advantage of external column-projection operations (like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;ddf[&amp;quot;A&amp;quot;]&lt;/span&gt;&lt;/code&gt;)? The good news is that &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; is already equipped with the necessary graph-optimization hooks to handle this, so long as the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;func&lt;/span&gt;&lt;/code&gt; object satisfies the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;DataFrameIOFunction&lt;/span&gt;&lt;/code&gt; protocol:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@runtime_checkable&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;DataFrameIOFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Protocol&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;DataFrame IO function with projectable columns&lt;/span&gt;
&lt;span class="sd"&gt;    Enables column projection in ``DataFrameIOLayer``.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;

    &lt;span class="nd"&gt;@property&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Return the current column projection&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="ne"&gt;NotImplementedError&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;project_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Return a new DataFrameIOFunction object&lt;/span&gt;
&lt;span class="sd"&gt;        with a new column projection&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="ne"&gt;NotImplementedError&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="fm"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Return a new DataFrame partition&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="ne"&gt;NotImplementedError&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;That is, all we need to do is change “Step 2” of our implementation to use the following code instead:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;    &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.dataframe.io.utils&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataFrameIOFunction&lt;/span&gt;

    &lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ReadFeather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DataFrameIOFunction&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Create a Pandas/cuDF DataFrame from a dataset fragment&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;

        &lt;span class="nd"&gt;@property&lt;/span&gt;
        &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_columns&lt;/span&gt;

        &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;project_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="c1"&gt;# Replace this object with one that will only read `columns`&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ReadFeather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;

        &lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="fm"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frag&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="c1"&gt;# Same logic as original `func`&lt;/span&gt;
            &lt;span class="n"&gt;read_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;read_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;from_arrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frag&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;read_columns&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;

    &lt;span class="n"&gt;func&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ReadFeather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/04/12/from-map.md&lt;/span&gt;, line 251)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;It is now easier than ever to create a Dask DataFrame collection from an arbitrary data source. Although the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.delayed&lt;/span&gt;&lt;/code&gt; API has already enabled similar functionality for many years, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt; now makes it possible to implement a custom IO function without sacrificing any of the high-level graph optimizations leveraged by the rest of the Dask DataFrame API.&lt;/p&gt;
&lt;p&gt;Start experimenting with &lt;a class="reference external" href="https://docs.dask.org/en/stable/generated/dask.dataframe.from_map.html"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_map&lt;/span&gt;&lt;/code&gt;&lt;/a&gt; today, and let us know how it goes!&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2023/04/12/from-map/"/>
    <summary>Dask DataFrame provides dedicated IO functions for several popular tabular-data formats, like CSV and Parquet. If you are working with a supported format, then the corresponding function (e.g read_csv) is likely to be the most reliable way to create a new Dask DataFrame collection. For other workflows, from_map now offers a convenient way to define a DataFrame collection as an arbitrary function mapping. While these kinds of workflows have historically required users to adopt the Dask Delayed API, from_map now makes custom collection creation both easier and more performant.</summary>
    <category term="IO" label="IO"/>
    <category term="dataframe" label="dataframe"/>
    <published>2023-04-12T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2023/03/15/shuffling-large-data-at-constant-memory/</id>
    <title>Shuffling large data at constant memory in Dask</title>
    <updated>2023-03-15T00:00:00+00:00</updated>
    <author>
      <name>Hendrik Makait</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;This work was engineered and supported by &lt;a class="reference external" href="https://coiled.io?utm_source=dask-blog&amp;amp;amp;utm_medium=shuffling-large-data-at-constant-memory"&gt;Coiled&lt;/a&gt;. In particular, thanks to &lt;a class="reference external" href="https://github.com/fjetter"&gt;Florian Jetter&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/gjoseph92"&gt;Gabe Joseph&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/hendrikmakait"&gt;Hendrik Makait&lt;/a&gt;, and &lt;a class="reference external" href="https://matthewrocklin.com/?utm_source=dask-blog&amp;amp;amp;utm_medium=shuffling-large-data-at-constant-memory"&gt;Matt Rocklin&lt;/a&gt;. Original version of this post appears on &lt;a class="reference external" href="https://blog.coiled.io/blog/shuffling-large-data-at-constant-memory.html?utm_source=dask-blog&amp;amp;amp;utm_medium=shuffling-large-data-at-constant-memory"&gt;blog.coiled.io&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;With release &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;2023.2.1&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.dataframe&lt;/span&gt;&lt;/code&gt; introduces a new shuffling method called P2P, making sorts, merges, and joins faster and using constant memory.
Benchmarks show impressive improvements:&lt;/p&gt;
&lt;figure class="align-center"&gt;
&lt;img alt="P2P shuffling uses constant memory while task-based shuffling scales linearly." src="/images/p2p_benchmark_memory_usage.png" style="width: 100%;"/&gt;
&lt;figcaption&gt;
&lt;em&gt;P2P shuffling (blue) uses constant memory while task-based shuffling (orange) scales linearly.&lt;/em&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;This article describes the problem, the new solution, and the impact on performance.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/03/15/shuffling-large-data-at-constant-memory.md&lt;/span&gt;, line 24)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="what-is-shuffling"&gt;

&lt;p&gt;Shuffling is a key primitive in data processing systems.
It is used whenever we move a dataset around in an all-to-all fashion, such as occurs in sorting, dataframe joins, or array rechunking.
Shuffling is a challenging computation to run efficiently, with lots of small tasks sharding the data.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/03/15/shuffling-large-data-at-constant-memory.md&lt;/span&gt;, line 30)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="task-based-shuffling-scales-poorly"&gt;
&lt;h1&gt;Task-based shuffling scales poorly&lt;/h1&gt;
&lt;p&gt;While systems like distributed databases and Apache Spark use a dedicated shuffle service to move data around the cluster, Dask builds on task-based scheduling.
Task-based systems are more flexible for general-purpose parallel computing but less suitable for all-to-all shuffling.
This results in three main issues:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scheduler strain:&lt;/strong&gt; The scheduler hangs due to the sheer amount of tasks required for shuffling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Full dataset materialization:&lt;/strong&gt; Workers materialize the entire dataset causing the cluster to run out of memory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Many small operations:&lt;/strong&gt; Intermediate tasks are too small and bring down hardware performance.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Together, these issues make large-scale shuffles inefficient, causing users to over-provision clusters.
Fortunately, we can design a system to address these concerns.
Early work on this started back &lt;a class="reference external" href="https://www.coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept?utm_source=dask-blog&amp;amp;amp;utm_medium=shuffling-large-data-at-constant-memory"&gt;in 2021&lt;/a&gt; and has now matured into the P2P shuffling system:&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/03/15/shuffling-large-data-at-constant-memory.md&lt;/span&gt;, line 44)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="p2p-shuffling"&gt;
&lt;h1&gt;P2P shuffling&lt;/h1&gt;
&lt;p&gt;With release &lt;a class="reference external" href="https://distributed.dask.org/en/stable/changelog.html#v2023-2-1"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;2023.2.1&lt;/span&gt;&lt;/code&gt;&lt;/a&gt;, Dask introduces a new shuffle method called P2P, making sorts, merges, and joins run faster and in constant memory.&lt;/p&gt;
&lt;p&gt;This system is designed with three aspects, mirroring the problems listed above:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Peer-to-Peer communication:&lt;/strong&gt; Reduce the involvement of the scheduler&lt;/p&gt;
&lt;p&gt;Shuffling becomes an O(n) operation from the scheduler’s perspective, removing a key bottleneck.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;figure class="align-center"&gt;
&lt;img alt="P2P shuffling uses fewer tasks than task-based shuffling." src="/images/shuffling_tasks.png" style="width: 100%;"/&gt;
&lt;/figure&gt;
&lt;ol class="arabic" start="2"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Disk by default:&lt;/strong&gt; Store data as it arrives on disk, efficiently.&lt;/p&gt;
&lt;p&gt;Dask can now shuffle datasets of arbitrary size in small memory, reducing the need to fiddle with right-sizing clusters.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Buffer network and disk with memory:&lt;/strong&gt; Avoid many small writes by buffering sensitive hardware with in-memory stores.&lt;/p&gt;
&lt;p&gt;Shuffling involves CPU, network, and disk, each of which brings its own bottlenecks when dealing with many small pieces of data.
We use memory judiciously to trade off between these bottlenecks, balancing network and disk I/O to maximize the overall throughput.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In addition to these three aspects, P2P shuffling implements many minor optimizations to improve performance, and it relies on &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;pyarrow&amp;gt;=7.0&lt;/span&gt;&lt;/code&gt; for efficient data handling and (de)serialization.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/03/15/shuffling-large-data-at-constant-memory.md&lt;/span&gt;, line 69)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="results"&gt;
&lt;h1&gt;Results&lt;/h1&gt;
&lt;p&gt;To evaluate P2P shuffling, we benchmark against task-based shuffling on common workloads from our benchmark suite.
For more information on this benchmark suite, see the &lt;a class="reference external" href="https://github.com/coiled/coiled-runtime/tree/main/tests/benchmarks"&gt;GitHub repository&lt;/a&gt; or the &lt;a class="reference external" href="https://benchmarks.coiled.io/"&gt;latest test results&lt;/a&gt;.&lt;/p&gt;
&lt;section id="memory-stability"&gt;
&lt;h2&gt;Memory stability&lt;/h2&gt;
&lt;p&gt;The biggest benefit of P2P shuffling is constant memory usage.
Memory usage drops and stays constant across all workloads:&lt;/p&gt;
&lt;figure class="align-center"&gt;
&lt;img alt="P2P shuffling uses significantly less memory than task-based shuffling on all workloads." src="/images/p2p_benchmark_suite.png" style="width: 100%;"/&gt;
&lt;figcaption&gt;
&lt;em&gt;P2P shuffling (blue) uses constant memory while task-based shuffling (orange) scales linearly.&lt;/em&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;For the tested workloads, we saw up to &lt;strong&gt;10x lower memory&lt;/strong&gt;.
For even larger workloads, this gap only increases.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="performance-and-speed"&gt;
&lt;h2&gt;Performance and Speed&lt;/h2&gt;
&lt;p&gt;In the above plot, we can two performance improvements:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Faster execution:&lt;/strong&gt; Workloads run up to 45% faster.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Quicker startup:&lt;/strong&gt; Smaller graphs mean P2P shuffling starts sooner.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Comparing scheduler dashboards for P2P and task-based shuffling side-by-side
helps to understand what causes these performance gains:&lt;/p&gt;
&lt;figure class="align-center"&gt;
&lt;img alt="Task-based shuffling shows bigger gaps in the task stream and spilling compared to P2P shuffling." src="/images/dashboard-comparison.png" style="width: 100%;"/&gt;
&lt;figcaption&gt;
&lt;em&gt;Task-based shuffling (left) shows bigger gaps in the task stream and spilling compared to P2P shuffling (right).&lt;/em&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graph size:&lt;/strong&gt; 10x fewer tasks are faster to deploy and easier on the scheduler.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Controlled I/O:&lt;/strong&gt; P2P shuffling handles I/O explicitly, which avoids less performant spilling by Dask.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fewer interruptions:&lt;/strong&gt; I/O is well balanced, so we see fewer gaps in work in the task stream.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Overall, P2P shuffling leverages our hardware far more effectively.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/03/15/shuffling-large-data-at-constant-memory.md&lt;/span&gt;, line 112)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="changing-defaults"&gt;
&lt;h1&gt;Changing defaults&lt;/h1&gt;
&lt;p&gt;These results benefit most users, so P2P shuffling is now the default starting from release &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;2023.2.1&lt;/span&gt;&lt;/code&gt;,
as long as &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;pyarrow&amp;gt;=7.0.0&lt;/span&gt;&lt;/code&gt; is installed. Dataframe operations like the following will benefit:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;df.set_index(...)&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;df.merge(...)&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;df.groupby(...).apply(...)&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;section id="share-your-experience"&gt;
&lt;h2&gt;Share your experience&lt;/h2&gt;
&lt;p&gt;To understand how P2P performs on the vast range of workloads out there in the wild, and to ensure that making it the default was the right choice, we would love to learn about your experiences running it.
We have opened a &lt;a class="reference external" href="https://github.com/dask/distributed/discussions/7509"&gt;discussion issue&lt;/a&gt; on GitHub for feedback on this change.
Please let us know how it helps (or doesn’t).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="keep-old-behavior"&gt;
&lt;h2&gt;Keep old behavior&lt;/h2&gt;
&lt;p&gt;If P2P shuffling does not work for you, you can deactivate it by setting the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dataframe.shuffle.method&lt;/span&gt;&lt;/code&gt; config value to &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;&amp;quot;tasks&amp;quot;&lt;/span&gt;&lt;/code&gt; or explicitly setting a keyword argument, for example:&lt;/p&gt;
&lt;p&gt;with the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;yaml&lt;/span&gt;&lt;/code&gt; config&lt;/p&gt;
&lt;div class="highlight-yaml notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nt"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;tasks&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;or when using a cluster manager&lt;/p&gt;
&lt;div class="highlight-python3 notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LocalCluster&lt;/span&gt;

&lt;span class="c1"&gt;# The dataframe.shuffle.method config is available since 2023.3.1&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dataframe.shuffle.method&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;tasks&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}):&lt;/span&gt;
    &lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LocalCluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# many cluster managers send current dask config automatically&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;For more information on deactivating P2P shuffle, see the &lt;a class="reference external" href="https://github.com/dask/distributed/discussions/7509"&gt;discussion #7509&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/03/15/shuffling-large-data-at-constant-memory.md&lt;/span&gt;, line 152)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="what-about-arrays"&gt;
&lt;h1&gt;What about arrays?&lt;/h1&gt;
&lt;p&gt;While the original motivation was to optimize large-scale dataframe joins, the P2P system is useful for all problems requiring lots of communication between tasks.
For array workloads, this often occurs when rechunking data, such as when organizing a matrix by rows when it was stored by columns.
Similar to dataframe joins, array rechunking has been inefficient in the past, which has become such a problem that the array community built specialized tools like &lt;a class="reference external" href="https://github.com/pangeo-data/rechunker"&gt;rechunker&lt;/a&gt; to avoid it entirely.&lt;/p&gt;
&lt;p&gt;There is a &lt;a class="reference external" href="https://github.com/dask/distributed/pull/7534"&gt;naive implementation&lt;/a&gt; of array rechunking using the P2P system available for experimental use.
Benchmarking this implementation shows mixed results:&lt;/p&gt;
&lt;figure class="align-center"&gt;
&lt;img alt="P2P rechunking uses constant memory." src="/images/p2p_benchmark_arrays.png" style="width: 100%;"/&gt;
&lt;/figure&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;👍 &lt;strong&gt;Constant memory use:&lt;/strong&gt; As with dataframe operations, memory use is constant.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;❓ &lt;strong&gt;Variable runtime:&lt;/strong&gt; The runtime of workloads may increase with P2P.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;👎 &lt;strong&gt;Memory overhead:&lt;/strong&gt; There is a large memory overhead for many small partitions.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The constant memory use is a very promising result.
There are several ways to tackle the current limitations.
We expect this to improve as we work with collaborators from the array computing community.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/03/15/shuffling-large-data-at-constant-memory.md&lt;/span&gt;, line 173)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="next-steps"&gt;
&lt;h1&gt;Next steps&lt;/h1&gt;
&lt;p&gt;Development on P2P shuffling is not done yet.
For the future we plan the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask.array&lt;/span&gt;&lt;/code&gt;&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;While the early prototype of array rechunking is promising, it’s not there yet.
We plan to do the following:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Intelligently select which algorithm to use (task-based rechunking is better sometimes)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Work with collaborators on rechunking to improve performance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Seek out other use cases, like &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;map_overlap&lt;/span&gt;&lt;/code&gt;, where this might be helpful&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Failure recovery&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;Make P2P shuffling resilient to worker loss; currently, it has to restart entirely.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Performance tuning&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;Performance today is good, but not yet at peak hardware speeds.
We can improve this in a few ways:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Hiding disk I/O&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Using more memory when appropriate for smaller shuffles&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Improved batching of small operations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;More efficient serialization&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To follow along with the development, subscribe to the &lt;a class="reference external" href="https://github.com/dask/distributed/issues/7507"&gt;tracking issue&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/03/15/shuffling-large-data-at-constant-memory.md&lt;/span&gt;, line 203)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="summary"&gt;
&lt;h1&gt;Summary&lt;/h1&gt;
&lt;p&gt;Shuffling data is a common operation in dataframe workloads.
Since &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;2023.2.1&lt;/span&gt;&lt;/code&gt;, Dask defaults to P2P shuffling for distributed clusters and shuffles data faster and at constant memory.
This improvement unlocks previously un-runnable workload sizes and efficiently uses your clusters.
Finally, P2P shuffling demonstrates extending Dask to add new paradigms while leveraging the foundations of its distributed engine.&lt;/p&gt;
&lt;p&gt;Share results in &lt;a class="reference external" href="https://github.com/dask/distributed/issues/7507"&gt;this discussion thread&lt;/a&gt; or follow development at this &lt;a class="reference external" href="https://github.com/dask/distributed/issues/7507"&gt;tracking issue&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2023/03/15/shuffling-large-data-at-constant-memory/"/>
    <summary>This work was engineered and supported by Coiled. In particular, thanks to Florian Jetter, Gabe Joseph, Hendrik Makait, and Matt Rocklin. Original version of this post appears on blog.coiled.io</summary>
    <category term="dask" label="dask"/>
    <category term="distributed" label="distributed"/>
    <category term="p2p" label="p2p"/>
    <category term="shuffling" label="shuffling"/>
    <published>2023-03-15T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2023/02/13/dask-on-flyte/</id>
    <title>Managing dask workloads with Flyte</title>
    <updated>2023-02-13T00:00:00+00:00</updated>
    <author>
      <name>Bernhard Stadlbauer</name>
    </author>
    <content type="html">&lt;p&gt;It is now possible to manage &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; workloads using &lt;a class="reference external" href="https://flyte.org/"&gt;Flyte&lt;/a&gt; 🎉!&lt;/p&gt;
&lt;p&gt;The major advantages are:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Each &lt;a class="reference external" href="https://docs.flyte.org/projects/flytekit/en/latest/generated/flytekit.task.html"&gt;Flyte &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;task&lt;/span&gt;&lt;/code&gt;&lt;/a&gt; spins up its own ephemeral &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; cluster using a Docker image tailored to the task, ensuring consistency in the Python environment across the client, scheduler, and workers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flyte will use the existing &lt;a class="reference external" href="https://kubernetes.io/"&gt;Kubernetes&lt;/a&gt; infrastructure to spin up &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; clusters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Spot/Preemtible instances are natively supported.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The whole &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; task can be cached.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Enabling &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; support in an already running Flyte setup can be done in just a few minutes.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is what a Flyte &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;task&lt;/span&gt;&lt;/code&gt; backed by a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; cluster with four workers looks like:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;typing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;flytekit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Resources&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;flytekitplugins.dask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WorkerGroup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Scheduler&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;


&lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Dask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Scheduler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Resources&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;WorkerGroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;number_of_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;limits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Resources&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;8&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;32Gi&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;increment_numbers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_length&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_length&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This task can run locally using a standard &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;distributed.Client()&lt;/span&gt;&lt;/code&gt; and can scale to arbitrary cluster sizes once &lt;a class="reference external" href="https://docs.flyte.org/projects/cookbook/en/latest/getting_started/package_register.html"&gt;registered with Flyte&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/02/13/dask-on-flyte.md&lt;/span&gt;, line 52)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="what-is-flyte"&gt;

&lt;p&gt;&lt;a class="reference external" href="https://flyte.org/"&gt;Flyte&lt;/a&gt; is a Kubernetes native workflow orchestration engine. Originally developed at &lt;a class="reference external" href="https://www.lyft.com/"&gt;Lyft&lt;/a&gt;, it is now an open-source (&lt;a class="reference external" href="https://github.com/flyteorg/flyte"&gt;Github&lt;/a&gt;) and a graduate project under the Linux Foundation. It stands out among similar tools such as &lt;a class="reference external" href="https://airflow.apache.org/"&gt;Airflow&lt;/a&gt; or &lt;a class="reference external" href="https://argoproj.github.io/"&gt;Argo&lt;/a&gt; due to its key features, which include:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Caching/Memoization of previously executed tasks for improved performance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kubernetes native&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Workflow definitions in Python, not e.g.,&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;YAML&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Strong typing between tasks and workflows using &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;Protobuf&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dynamic generation of workflow DAGs at runtime&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ability to run workflows locally&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A simple workflow would look something like the following:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;typing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pd&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;flytekit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Resources&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;flytekitplugins.dask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WorkerGroup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Scheduler&lt;/span&gt;


&lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Dask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Scheduler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Resources&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;WorkerGroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;number_of_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;limits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Resources&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;8&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;32Gi&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;expensive_data_preparation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Expensive, highly parallel `dask` code&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Some large DataFrame, Flyte will handle serialization&lt;/span&gt;


&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Model training, can also use GPU, etc.&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;s3://path-to-model&amp;quot;&lt;/span&gt;


&lt;span class="nd"&gt;@workflow&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;train_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prepared_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;expensive_data_preparation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;input_files&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prepared_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In the above, both &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;expensive_data_preparation()&lt;/span&gt;&lt;/code&gt; as well as &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;train()&lt;/span&gt;&lt;/code&gt; would be run in their own Pod(s) in Kubernetes, while the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;train_model()&lt;/span&gt;&lt;/code&gt; workflow is a DSL which creates a Directed Acyclic Graph (DAG) of the workflow. It will determine the order of tasks based on their inputs and outputs. Input and output types (based on the type hints) will be validated at registration time to avoid surprises at runtime.&lt;/p&gt;
&lt;p&gt;After registration with Flyte, the workflow can be started from the UI:&lt;/p&gt;
&lt;img src="/images/dask-flyte-workflow.png" alt="Dask workflow in the Flyte UI" style="max-width: 100%;" width="100%" /&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/02/13/dask-on-flyte.md&lt;/span&gt;, line 110)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="why-use-the-dask-plugin-for-flyte"&gt;
&lt;h1&gt;Why use the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; plugin for Flyte?&lt;/h1&gt;
&lt;p&gt;At first glance, Flyte and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; look similar in what they are trying to achieve, both capable of creating a DAG from user functions, managing inputs and outputs, etc. However, the major conceptual difference lies in their approach. While &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; has long-lived workers to run tasks, a Flyte task is a designated Kubernetes Pod that creates a significant overhead in task-runtime.&lt;/p&gt;
&lt;p&gt;While &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; tasks incur an overhead of around one millisecond (refer to the &lt;a class="reference external" href="https://distributed.dask.org/en/stable/efficiency.html#use-larger-tasks"&gt;docs&lt;/a&gt;), spinning up a new Kubernetes pod takes several seconds. The long-lived nature of the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; workers allows for optimization of the DAG, running tasks that operate on the same data on the same node, reducing the need for inter-worker data serialization (known as shuffling). With Flyte tasks being ephemeral, this optimization is not possible, and task outputs are serialized to a blob storage instead.&lt;/p&gt;
&lt;p&gt;Given the limitations discussed above, why use Flyte? Flyte is not intended to replace tools such as dask or Apache Spark, but rather provides an orchestration layer on top. While workloads can be run directly in Flyte, such as training a single GPU model, Flyte offers &lt;a class="reference external" href="https://flyte.org/integrations"&gt;numerous integrations&lt;/a&gt; with other popular data processing tools.&lt;/p&gt;
&lt;p&gt;With Flyte managing the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; cluster lifecycle, each &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; Flyte task will run on its own dedicated &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; cluster made up of Kubernetes pods. When the Flyte task is triggered from the UI, Flyte will spin up a dask cluster tailored to the task, which will then be used to execute the user code. This enables the use of different Docker images with varying dependencies for different tasks, whilst always ensuring that the dependencies of the client, scheduler, and workers are consistent.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/02/13/dask-on-flyte.md&lt;/span&gt;, line 120)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-prerequisites-are-required-to-run-dask-tasks-in-flyte"&gt;
&lt;h1&gt;What prerequisites are required to run &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; tasks in Flyte?&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;The Kubernetes cluster needs to have the &lt;a class="reference external" href="https://kubernetes.dask.org/en/latest/operator.html"&gt;dask operator&lt;/a&gt; installed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flyte version &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;1.3.0&lt;/span&gt;&lt;/code&gt; or higher is required.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; plugin needs to be enabled in the Flyte propeller config. (refer to the &lt;a class="reference external" href="https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html#specify-plugin-configuration"&gt;docs&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Docker image associated with the task must have the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;flytekitplugins-dask&lt;/span&gt;&lt;/code&gt; package installed in its Python environment.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/02/13/dask-on-flyte.md&lt;/span&gt;, line 127)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="how-do-things-work-under-the-hood"&gt;
&lt;h1&gt;How do things work under the hood?&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;Note: The following is for reference only and is not necessary for users who only use the plugin. However, it could be useful for easier debugging.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;On a high-level overview, the following steps occur when a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; task is initiated in Flyte:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;A &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;FlyteWorkflow&lt;/span&gt;&lt;/code&gt; &lt;a class="reference external" href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/"&gt;Custom Resource (CR)&lt;/a&gt; is created in Kubernetes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flyte Propeller, a &lt;a class="reference external" href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/"&gt;Kubernetes Operator&lt;/a&gt;), detects the creation of the workflow.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The operator inspects the task’s spec and identifies it as a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; task. It verifies if it has the required plugin associated with it and locates the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; plugin.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; plugin within Flyte Propeller picks up the task defintion and creates a &lt;a class="reference external" href="https://kubernetes.dask.org/en/latest/operator_resources.html#daskjob"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;DaskJob&lt;/span&gt;&lt;/code&gt; Custom Resource&lt;/a&gt; using the &lt;a class="reference external" href="https://github.com/bstadlbauer/dask-k8s-operator-go-client/"&gt;dask-k8s-operator-go-client&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="https://kubernetes.dask.org/en/latest/operator.html"&gt;dask operator&lt;/a&gt; picks up the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;DaskJob&lt;/span&gt;&lt;/code&gt; resource and runs the job accordingly. It spins up a pod to run the client/job-runner, one for the scheduler, and additional worker pods as designated in the Flyte task decorator.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;While the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; task is running, Flyte Propeller continuously monitors the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;DaskJob&lt;/span&gt;&lt;/code&gt; resource, waiting on it to report success or failure. Once the job has finished or the Flyte task has been terminated, all &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask&lt;/span&gt;&lt;/code&gt; related resources will be cleaned up.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/02/13/dask-on-flyte.md&lt;/span&gt;, line 140)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="useful-links"&gt;
&lt;h1&gt;Useful links&lt;/h1&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://docs.flyte.org/en/latest/"&gt;Flyte documentation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://flyte.org/community"&gt;Flyte community&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://docs.flyte.org/projects/cookbook/en/latest/auto/integrations/kubernetes/k8s_dask/index.html"&gt;flytekitplugins-dask user documentation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html"&gt;flytekitplugins-dask deployment documentation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://kubernetes.dask.org/en/latest/"&gt;dask-kubernetes documentation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="http://blog.dask.org/2022/11/09/dask-kubernetes-operator"&gt;Blog post on the dask kubernetes operator&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In case there are any questions or concerns, don’t hesitate to reach out. You can contact Bernhard Stadlbauer via the &lt;a class="reference external" href="https://slack.flyte.org/"&gt;Flyte Slack&lt;/a&gt; or via &lt;a class="reference external" href="https://github.com/bstadlbauer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I would like to give shoutouts to &lt;a class="reference external" href="https://jacobtomlinson.dev/"&gt;Jacob Tomlinson&lt;/a&gt; (Dask) and &lt;a class="reference external" href="https://github.com/hamersaw"&gt;Dan Rammer&lt;/a&gt; (Flyte) for all of the help I’ve received. This would not have been possible without your support!&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2023/02/13/dask-on-flyte/"/>
    <summary>It is now possible to manage dask workloads using Flyte 🎉!</summary>
    <category term="Flyte" label="Flyte"/>
    <category term="Kubernetes" label="Kubernetes"/>
    <category term="Python" label="Python"/>
    <category term="dask" label="dask"/>
    <published>2023-02-13T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2023/02/02/easy-cpu-gpu/</id>
    <title>Easy CPU/GPU Arrays and Dataframes</title>
    <updated>2023-02-02T00:00:00+00:00</updated>
    <author>
      <name>Ben Zaitlen (NVIDIA)</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;This article was originally posted on the &lt;a class="reference external" href="https://medium.com/rapids-ai/easy-cpu-gpu-arrays-and-dataframes-run-your-dask-code-where-youd-like-e349d92351d"&gt;RAPIDS blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;It’s now easy to switch between CPU (NumPy / Pandas) and GPU (CuPy / cuDF) in Dask.
As of Dask 2022.10.0, users can optionally select the backend engine for input IO and data creation. In the short-term, the goal of the backend-configuration system is to enable Dask users to write code that will run on both CPU and GPU systems.&lt;/p&gt;
&lt;p&gt;The preferred backend can be configured using the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;array.backend&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dataframe.backend&lt;/span&gt;&lt;/code&gt; options with the standard Dask configuration system:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;array.backend&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;cupy&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dataframe.backend&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;cudf&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/02/02/easy-cpu-gpu.md&lt;/span&gt;, line 22)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="dispatching-for-array-creation"&gt;

&lt;p&gt;To see how users can easily switch between NumPy and CuPy, let’s start by creating an array of ones:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;array.backend&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;cupy&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}):&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="n"&gt;darr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;  &lt;span class="c1"&gt;# Get cupy-backed collection&lt;/span&gt;
&lt;span class="gp"&gt;...&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;darr&lt;/span&gt;
&lt;span class="go"&gt;dask.array&amp;lt;ones_like, shape=(10,), dtype=float64, chunksize=(5,), chunktype=cupy.ndarray&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;chunktype&lt;/span&gt;&lt;/code&gt; informs us that the array is constructed with &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;cupy.ndarray&lt;/span&gt;&lt;/code&gt;
objects instead of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;numpy.ndarray&lt;/span&gt;&lt;/code&gt; objects.&lt;/p&gt;
&lt;p&gt;We’ve also improved the user experience for random array creation. Previously, if a user wanted to create a CuPy-backed Dask array, they were required to define an explicit &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomState&lt;/span&gt;&lt;/code&gt; object in Dask using CuPy. For example, the following code worked prior to Dask 2022.10.0, but seems rather verbose:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cupy&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;da&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;rs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cupy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;darr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;darr&lt;/span&gt;
&lt;span class="go"&gt;dask.array&amp;lt;randint, shape=(10, 20), dtype=int64, chunksize=(2, 5), chunktype=cupy.ndarray&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, we can leverage the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;array.backend&lt;/span&gt;&lt;/code&gt; configuration to create a CuPy-backed dask array for random data:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;array.backend&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;cupy&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}):&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="n"&gt;darr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;da&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# Get cupy-backed collection&lt;/span&gt;
&lt;span class="gp"&gt;...&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;darr&lt;/span&gt;
&lt;span class="go"&gt;dask.array&amp;lt;randint, shape=(10, 20), dtype=int64, chunksize=(2, 5), chunktype=cupy.ndarray&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Using &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;array.backend&lt;/span&gt;&lt;/code&gt; is significantly easier and much more ergonomic – it supports all basic array creation methods including: &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;ones&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;zeros&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;empty&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;full&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;arange&lt;/span&gt;&lt;/code&gt;, and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;random&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_array&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_zarr&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_tiledb&lt;/span&gt;&lt;/code&gt; have not yet been implemented
with this functionality&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/02/02/easy-cpu-gpu.md&lt;/span&gt;, line 64)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="dispatching-for-dataframe-creation"&gt;
&lt;h1&gt;Dispatching for Dataframe Creation&lt;/h1&gt;
&lt;p&gt;When creating Dask Dataframes backed by either Pandas or cuDF, the beginning is often the input I/O methods: read_csv, read_parquet, etc. We’ll first start by constructing a dataframe on the fly with &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;from_dict&lt;/span&gt;&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dataframe.backend&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;cudf&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}):&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;a&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;b&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="n"&gt;ddf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;npartitions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;...&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;dask_cudf.DataFrame | 2 tasks | 2 npartitions&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Here we can tell we have a cuDF backed dataframe and we are using &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-cudf&lt;/span&gt;&lt;/code&gt;
because the repr shows us the type: &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;&amp;lt;dask_cudf.DataFrame&lt;/span&gt; &lt;span class="pre"&gt;|&lt;/span&gt; &lt;span class="pre"&gt;2&lt;/span&gt; &lt;span class="pre"&gt;tasks&lt;/span&gt; &lt;span class="pre"&gt;|&lt;/span&gt; &lt;span class="pre"&gt;2&lt;/span&gt; &lt;span class="pre"&gt;npartitions&amp;gt;&lt;/span&gt;&lt;/code&gt;.
Let’s also demonstrate the read functionality by generating CSV and
Parquet data.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;example.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;single_file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ddf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;example.parquet&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now we are simply repeating the config setting but instead using the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_csv&lt;/span&gt;&lt;/code&gt;
and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_parquet&lt;/span&gt;&lt;/code&gt; methods:&lt;/p&gt;
&lt;div class="highlight-default notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dataframe.backend&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;cudf&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}):&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="n"&gt;ddf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;example.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="gp"&gt;...&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;class &amp;#39;dask_cudf.core.DataFrame&amp;#39;&amp;gt;&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dataframe.backend&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;cudf&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}):&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="n"&gt;ddf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;example.parquet&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ddf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;...&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;class &amp;#39;dask_cudf.core.DataFrame&amp;#39;&amp;gt;&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2023/02/02/easy-cpu-gpu.md&lt;/span&gt;, line 104)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="why-is-this-useful"&gt;
&lt;h1&gt;Why is this Useful ?&lt;/h1&gt;
&lt;p&gt;As hardware changes in exciting and exotic ways with: GPUs, TPUs, IPUs,
etc., we want to provide the same interface and treat hardware as an
abstraction. For example, many PyTorch workflows start with the following:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;cuda&amp;#39;&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;cpu&amp;#39;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And what follows is typically standard hardware agnostic PyTorch. This is
incredibly powerful as the user (in most cases) should not care what hardware
underlies the source. As such, it enables the user to develop PyTorch anywhere
and everywhere. The new Dask backend selection configurations gives users a
similar freedom.&lt;/p&gt;
&lt;p&gt;Conclusion&lt;/p&gt;
&lt;p&gt;Our long-term goal of this feature is to enable Dask users to use any backend library in dask.array and dask.dataframe, as long as that library conforms to the minimal “array” or “dataframe” standard defined by the data-api consortium, respectively.&lt;/p&gt;
&lt;p&gt;The RAPIDS team consistently works with the open-source community to understand and address emerging needs. If you’re an open-source maintainer interested in bringing GPU-acceleration to your project, please reach out on Github or Twitter. The RAPIDS team would love to learn how potential new algorithms or toolkits would impact your work.&lt;/p&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2023/02/02/easy-cpu-gpu/"/>
    <summary>This article was originally posted on the RAPIDS blog.</summary>
    <category term="GPU" label="GPU"/>
    <published>2023-02-02T00:00:00+00:00</published>
  </entry>
</feed>
