<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts tagged Pandas</title>
  <updated>2026-03-05T15:05:23.256225+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/tag/pandas/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2019/01/13/dask-cudf-first-steps/</id>
    <title>Dask, Pandas, and GPUs: first steps</title>
    <updated>2019-01-13T00:00:00+00:00</updated>
    <author>
      <name>Matthew Rocklin</name>
    </author>
    <content type="html">&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/13/dask-cudf-first-steps.md&lt;/span&gt;, line 9)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="executive-summary"&gt;

&lt;p&gt;We’re building a distributed GPU Pandas dataframe out of
&lt;a class="reference external" href="https://github.com/rapidsai/cudf"&gt;cuDF&lt;/a&gt; and
&lt;a class="reference external" href="https://docs.dask.org/en/latest/dataframe.html"&gt;Dask Dataframe&lt;/a&gt;.
This effort is young.&lt;/p&gt;
&lt;p&gt;This post describes the current situation,
our general approach,
and gives examples of what does and doesn’t work today.
We end with some notes on scaling performance.&lt;/p&gt;
&lt;p&gt;You can also view the experiment in this post as
&lt;a class="reference external" href="https://gist.github.com/mrocklin/4b1b80d1ae07ec73f75b2a19c8e90e2e"&gt;a notebook&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;And here is a table of results:&lt;/p&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
  &lt;tr&gt;
    &lt;th&gt;Architecture&lt;/th&gt;
    &lt;th&gt;Time&lt;/th&gt;
    &lt;th&gt;Bandwidth&lt;/th&gt;
  &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt; Single CPU Core &lt;/th&gt;
      &lt;td&gt; 3min 14s &lt;/td&gt;
      &lt;td&gt; 50 MB/s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Eight CPU Cores &lt;/th&gt;
      &lt;td&gt; 58s &lt;/td&gt;
      &lt;td&gt; 170 MB/s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Forty CPU Cores &lt;/th&gt;
      &lt;td&gt; 35s &lt;/td&gt;
      &lt;td&gt; 285 MB/s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; One GPU &lt;/th&gt;
      &lt;td&gt; 11s &lt;/td&gt;
      &lt;td&gt; 900 MB/s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Eight GPUs &lt;/th&gt;
      &lt;td&gt; 5s &lt;/td&gt;
      &lt;td&gt; 2000 MB/s &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/13/dask-cudf-first-steps.md&lt;/span&gt;, line 63)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="building-blocks-cudf-and-dask"&gt;
&lt;h1&gt;Building Blocks: cuDF and Dask&lt;/h1&gt;
&lt;p&gt;Building a distributed GPU-backed dataframe is a large endeavor.
Fortunately we’re starting on a good foundation and
can assemble much of this system from existing components:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="https://github.com/rapidsai/cudf"&gt;cuDF&lt;/a&gt; library aims to implement the
Pandas API on the GPU. It gets good speedups on standard operations like
reading CSV files, filtering and aggregating columns, joins, and so on.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;cudf&lt;/span&gt;  &lt;span class="c1"&gt;# looks and feels like Pandas, but runs on the GPU&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cudf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;myfile.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Alice&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;cuDF is part of the growing &lt;a class="reference external" href="https://rapids.ai"&gt;RAPIDS initiative&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="https://docs.dask.org/en/latest/dataframe.html"&gt;Dask Dataframe&lt;/a&gt;
library provides parallel algorithms around the Pandas API. It composes
large operations like distributed groupbys or distributed joins from a task
graph of many smaller single-node groupbys or joins accordingly (and many
&lt;a class="reference external" href="https://docs.dask.org/en/latest/dataframe-api.html"&gt;other operations&lt;/a&gt;).&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.dataframe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dd&lt;/span&gt;  &lt;span class="c1"&gt;# looks and feels like Pandas, but runs in parallel&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;myfile.*.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Alice&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a class="reference external" href="https://distributed.dask.org"&gt;Dask distributed task scheduler&lt;/a&gt;
provides general-purpose parallel execution given complex task graphs.
It’s good for adding multi-node computing into an existing codebase.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Given these building blocks,
our approach is to make the cuDF API close enough to Pandas that
we can reuse the Dask Dataframe algorithms.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/13/dask-cudf-first-steps.md&lt;/span&gt;, line 105)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="benefits-and-challenges-to-this-approach"&gt;
&lt;h1&gt;Benefits and Challenges to this approach&lt;/h1&gt;
&lt;p&gt;This approach has a few benefits:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;We get to reuse the parallel algorithms found in Dask Dataframe originally designed for Pandas.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It consolidates the development effort within a single codebase so that
future effort spent on CPU Dataframes benefits GPU Dataframes and vice
versa. Maintenance costs are shared.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;By building code that works equally with two DataFrame implementations
(CPU and GPU) we establish conventions and protocols that will
make it easier for other projects to do the same, either with these two
Pandas-like libraries, or with future Pandas-like libraries.&lt;/p&gt;
&lt;p&gt;This approach also aims to demonstrate that the ecosystem should support Pandas-like
libraries, rather than just Pandas. For example, if
(when?) the Arrow library develops a computational system then we’ll be in
a better place to roll that in as well.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When doing any refactor we tend to clean up existing code.&lt;/p&gt;
&lt;p&gt;For example, to make dask dataframe ready for a new GPU Parquet reader
we end up &lt;a class="reference external" href="https://github.com/dask/dask/pull/4336"&gt;refactoring and simplifying our Parquet I/O logic&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The approach also has some drawbacks. Namely, it places API pressure on cuDF to match Pandas so:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;Slight differences in API now cause larger problems, such as these:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/rapidsai/cudf/issues/251"&gt;Join column ordering differs rapidsai/cudf #251&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/rapidsai/cudf/issues/483#issuecomment-453218151"&gt;Groupby aggregation column ordering differs rapidsai/cudf #483&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;cuDF has some pressure on it to repeat what some believe to be mistakes in
the Pandas API.&lt;/p&gt;
&lt;p&gt;For example, cuDF today supports missing values arguably more sensibly than
Pandas. Should cuDF have to revert to the old way of doing things
just to match Pandas semantics? Dask Dataframe will probably need
to be more flexible in order to handle evolution and small differences
in semantics.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/13/dask-cudf-first-steps.md&lt;/span&gt;, line 146)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="alternatives"&gt;
&lt;h1&gt;Alternatives&lt;/h1&gt;
&lt;p&gt;We could also write a new dask-dataframe-style project around cuDF that deviates
from the Pandas/Dask Dataframe API. Until recently this
has actually been the approach, and the
&lt;a class="reference external" href="https://github.com/rapidsai/dask-cudf"&gt;dask-cudf&lt;/a&gt; project did exactly this.
This was probably a good choice early on to get started and prototype things.
The project was able to implement a wide range of functionality including
groupby-aggregations, joins, and so on using
&lt;a class="reference external" href="https://docs.dask.org/en/latest/delayed.html"&gt;dask delayed&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We’re redoing this now on top of dask dataframe though, which means that we’re
losing some functionality that dask-cudf already had, but hopefully the
functionality that we add now will be more stable and established on a firmer
base.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/13/dask-cudf-first-steps.md&lt;/span&gt;, line 162)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="status-today"&gt;
&lt;h1&gt;Status Today&lt;/h1&gt;
&lt;p&gt;Today very little works, but what does is decently smooth.&lt;/p&gt;
&lt;p&gt;Here is a simple example that reads some data from many CSV files,
picks out a column,
and does some aggregations.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_cuda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LocalCUDACluster&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_cudf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.distributed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LocalCUDACluster&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# runs on eight local GPUs&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;gdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dask_cudf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data/nyc/many/*.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# wrap around many CSV files&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;gdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passenger_count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="mi"&gt;184464740&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Also note, NYC Taxi ridership is significantly less than it was a few years ago&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/13/dask-cudf-first-steps.md&lt;/span&gt;, line 186)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-i-m-excited-about-in-the-example-above"&gt;
&lt;h1&gt;What I’m excited about in the example above&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;All of the infrastructure surrounding the cuDF code, like the cluster setup,
diagnostics, JupyterLab environment, and so on, came for free, like any
other new Dask project.&lt;/p&gt;
&lt;p&gt;Here is an image of my JupyterLab setup&lt;/p&gt;
&lt;a href="https://matthewrocklin.com/blog/images/dask-cudf-environment.png"&gt;
  &lt;img src="https://matthewrocklin.com/blog/images/dask-cudf-environment.png"
       alt="Dask + CUDA + cuDF JupyterLab environment"
       width="70%"&gt;
&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Our &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;df&lt;/span&gt;&lt;/code&gt; object is actually just a normal Dask Dataframe. We didn’t have to
write new &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;__repr__&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;__add__&lt;/span&gt;&lt;/code&gt;, or &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.sum()&lt;/span&gt;&lt;/code&gt; implementations, and probably
many functions we didn’t think about work well today (though also many
don’t).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We’re tightly integrated and more connected to other systems. For example, if
we wanted to convert our dask-cudf-dataframe to a dask-pandas-dataframe then
we would just use the cuDF &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;to_pandas&lt;/span&gt;&lt;/code&gt; function:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_partitions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cudf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_pandas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We don’t have to write anything special like a separate &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;.to_dask_dataframe&lt;/span&gt;&lt;/code&gt;
method or handle other special cases.&lt;/p&gt;
&lt;p&gt;Dask parallelism is orthogonal to the choice of CPU or GPU.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It’s easy to switch hardware. By avoiding separate &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;dask-cudf&lt;/span&gt;&lt;/code&gt; code paths
it’s easier to add cuDF to an existing Dask+Pandas codebase to run on GPUs,
or to remove cuDF and use Pandas if we want our code to be runnable without GPUs.&lt;/p&gt;
&lt;p&gt;There are more examples of this in the scaling section below.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/13/dask-cudf-first-steps.md&lt;/span&gt;, line 224)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="what-s-wrong-with-the-example-above"&gt;
&lt;h1&gt;What’s wrong with the example above&lt;/h1&gt;
&lt;p&gt;In general the answer is &lt;strong&gt;many small things&lt;/strong&gt;.&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p&gt;The &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;cudf.read_csv&lt;/span&gt;&lt;/code&gt; function doesn’t yet support reading chunks from a
single CSV file, and so doesn’t work well with very large CSV files. We
had to split our large CSV files into many smaller CSV files first with
normal Dask+Pandas:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask.dataframe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dd&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;few-large/*.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repartition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;npartitions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;many-small/*.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;(See &lt;a class="reference external" href="https://github.com/rapidsai/cudf/issues/568"&gt;rapidsai/cudf #568&lt;/a&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Many operations that used to work in dask-cudf like groupby-aggregations
and joins no longer work. We’re going to need to slightly modify many cuDF
APIs over the next couple of months to more closely match their Pandas
equivalents.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;I ran the timing cell twice because it currently takes a few seconds to
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;import&lt;/span&gt; &lt;span class="pre"&gt;cudf&lt;/span&gt;&lt;/code&gt; today.
&lt;a class="reference external" href="https://github.com/rapidsai/cudf/issues/627"&gt;rapidsai/cudf #627&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We had to make Dask Dataframe a bit more flexible and assume less about its
constituent dataframes being exactly Pandas dataframes. (see
&lt;a class="reference external" href="https://github.com/dask/dask/pull/4359"&gt;dask/dask #4359&lt;/a&gt; and
&lt;a class="reference external" href="https://github.com/dask/dask/pull/4375"&gt;dask/dask #4375&lt;/a&gt; for examples).
I suspect that there will by many more small changes like
these necessary in the future.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These problems are representative of dozens more similar issues. They are
all fixable and indeed, many are actively being fixed today by the &lt;a class="reference external" href="https://github.com/rapidsai/cudf/graphs/contributors"&gt;good folks
working on RAPIDS&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/13/dask-cudf-first-steps.md&lt;/span&gt;, line 262)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="near-term-schedule"&gt;
&lt;h1&gt;Near Term Schedule&lt;/h1&gt;
&lt;p&gt;The RAPIDS group is currently busy working to release 0.5, which includes some
of the fixes necessary to run the example above, and also many unrelated
stability improvements. This will probably keep them busy for a week or two
during which I don’t expect to see much Dask + cuDF work going on other than
planning.&lt;/p&gt;
&lt;p&gt;After that, Dask parallelism support will be a top priority, so
I look forward to seeing some rapid progress here.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/13/dask-cudf-first-steps.md&lt;/span&gt;, line 273)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="scaling-results"&gt;
&lt;h1&gt;Scaling Results&lt;/h1&gt;
&lt;p&gt;In &lt;a class="reference internal" href="../../../2019/01/03/dask-array-gpus-first-steps/"&gt;&lt;span class="doc std std-doc"&gt;my last post about combining Dask Array with CuPy&lt;/span&gt;&lt;/a&gt;,
a GPU-accelerated Numpy,
we saw impressive speedups from using many GPUs on a simple problem that
manipulated some simple random data.&lt;/p&gt;
&lt;section id="dask-array-cupy-on-random-data"&gt;
&lt;h2&gt;Dask Array + CuPy on Random Data&lt;/h2&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
  &lt;tr&gt;
    &lt;th&gt;Architecture&lt;/th&gt;
    &lt;th&gt;Time&lt;/th&gt;
  &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt; Single CPU Core &lt;/th&gt;
      &lt;td&gt; 2hr 39min &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Forty CPU Cores &lt;/th&gt;
      &lt;td&gt; 11min 30s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; One GPU &lt;/th&gt;
      &lt;td&gt; 1 min 37s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Eight GPUs &lt;/th&gt;
      &lt;td&gt; 19s &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That exercise was easy to scale because it was almost entirely bound by the
computation of creating random data.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="dask-dataframe-cudf-on-csv-data"&gt;
&lt;h2&gt;Dask DataFrame + cuDF on CSV data&lt;/h2&gt;
&lt;p&gt;We did a similar study on the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;read_csv&lt;/span&gt;&lt;/code&gt; example above, which is bound mostly
by reading CSV data from disk and then parsing it. You can see a notebook
available
&lt;a class="reference external" href="https://gist.github.com/mrocklin/4b1b80d1ae07ec73f75b2a19c8e90e2e"&gt;here&lt;/a&gt;. We
have similar (though less impressive) numbers to present.&lt;/p&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
  &lt;tr&gt;
    &lt;th&gt;Architecture&lt;/th&gt;
    &lt;th&gt;Time&lt;/th&gt;
    &lt;th&gt;Bandwidth&lt;/th&gt;
  &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt; Single CPU Core &lt;/th&gt;
      &lt;td&gt; 3min 14s &lt;/td&gt;
      &lt;td&gt; 50 MB/s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Eight CPU Cores &lt;/th&gt;
      &lt;td&gt; 58s &lt;/td&gt;
      &lt;td&gt; 170 MB/s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Forty CPU Cores &lt;/th&gt;
      &lt;td&gt; 35s &lt;/td&gt;
      &lt;td&gt; 285 MB/s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; One GPU &lt;/th&gt;
      &lt;td&gt; 11s &lt;/td&gt;
      &lt;td&gt; 900 MB/s &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt; Eight GPUs &lt;/th&gt;
      &lt;td&gt; 5s &lt;/td&gt;
      &lt;td&gt; 2000 MB/s &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;em&gt;The bandwidth numbers were computed by noting that the data was around 10 GB on disk&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/01/13/dask-cudf-first-steps.md&lt;/span&gt;, line 359)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="analysis"&gt;
&lt;h1&gt;Analysis&lt;/h1&gt;
&lt;p&gt;First, I want to emphasize again that it’s easy to test a wide variety of
architectures using this setup because of the Pandas API compatibility between
all of the different projects. We’re seeing a wide range of performance (40x
span) across a variety of different hardware with a wide range of cost points.&lt;/p&gt;
&lt;p&gt;Second, note that this problem scales less well than our
&lt;a class="reference internal" href="../../../2019/01/03/dask-array-gpus-first-steps/"&gt;&lt;span class="doc std std-doc"&gt;previous example with CuPy&lt;/span&gt;&lt;/a&gt;,
both on CPU and GPU.
I suspect that this is because this example is also bound by I/O and not just
number-crunching. While the jump from single-CPU to single-GPU is large, the
jump from single-CPU to many-CPU or single-GPU to many-GPU is not as large as
we would have liked. For GPUs for example we got around a 2x speedup when we
added 8x as many GPUs.&lt;/p&gt;
&lt;p&gt;At first one might think that this is because we’re saturating disk read speeds.
However two pieces of evidence go against that guess:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;NVIDIA folks familiar with my current hardware inform me that they’re able to get
much higher I/O throughput when they’re careful&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The CPU scaling is similarly poor, despite the fact that it’s obviously not
reaching full I/O bandwidth&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Instead, it’s likely that we’re just not treating our disks and IO pipelines
carefully.&lt;/p&gt;
&lt;p&gt;We might consider working to think more carefully about data locality within a
single machine. Alternatively, we might just choose to use a smaller machine,
or many smaller machines. My team has been asking me to start playing with
some cheaper systems than a DGX, I may experiment with those soon. It may be
that for data-loading and pre-processing workloads the previous wisdom of “pack
as much computation as you can into a single box” no longer holds
(without us doing more work that is).&lt;/p&gt;
&lt;section id="come-help"&gt;
&lt;h2&gt;Come help&lt;/h2&gt;
&lt;p&gt;If the work above sounds interesting to you then come help!
There is a lot of low-hanging and high impact work to do.&lt;/p&gt;
&lt;p&gt;If you’re interested in being paid to focus more on these topics, then consider
applying for a job. NVIDIA’s RAPIDS team is looking to hire engineers for Dask
development with GPUs and other data analytics library development projects.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-TX-Austin/Senior-Library-Software-Engineer---RAPIDS_JR1919608-1"&gt;Senior Library Software Engineer - RAPIDS&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;/section&gt;
</content>
    <link href="https://blog.dask.org/2019/01/13/dask-cudf-first-steps/"/>
    <summary>Document headings start at H2, not H1 [myst.header]</summary>
    <category term="GPU" label="GPU"/>
    <category term="Pandas" label="Pandas"/>
    <published>2019-01-13T00:00:00+00:00</published>
  </entry>
</feed>
