Dask Working Notes#
Improving GroupBy.map with Dask and Xarray — *Nov 21, 2024*
This post was originally published on the Xarray blog.
Dask DataFrame is Fast Now — *May 30, 2024*
This work was engineered and supported by Coiled and NVIDIA. Thanks to Patrick Hoefler and Rick Zamora, in particular. Original version of this post appears on docs.coiled.io
High Level Query Optimization in Dask — *Aug 25, 2023*
This work was engineered and supported by Coiled and NVIDIA. Thanks to Patrick Hoefler and Rick Zamora, in particular. Original version of this post appears on blog.coiled.io
Upstream testing in Dask — *Apr 18, 2023*
Original version of this post appears on blog.coiled.io
Do you need consistent environments between the client, scheduler and workers? — *Apr 14, 2023*
Update May 3rd 2023: Clarify GPU recommendations.
Deep Dive into creating a Dask DataFrame Collection with from_map — *Apr 12, 2023*
Dask DataFrame provides dedicated IO functions for several popular tabular-data formats, like CSV and Parquet. If you are working with a supported format, then the corresponding function (e.g
read_csv) is likely to be the most reliable way to create a new Dask DataFrame collection. For other workflows,from_mapnow offers a convenient way to define a DataFrame collection as an arbitrary function mapping. While these kinds of workflows have historically required users to adopt the Dask Delayed API,from_mapnow makes custom collection creation both easier and more performant.Shuffling large data at constant memory in Dask — *Mar 15, 2023*
This work was engineered and supported by Coiled. In particular, thanks to Florian Jetter, Gabe Joseph, Hendrik Makait, and Matt Rocklin. Original version of this post appears on blog.coiled.io
Managing dask workloads with Flyte — *Feb 13, 2023*
It is now possible to manage
daskworkloads using Flyte 🎉!Easy CPU/GPU Arrays and Dataframes — *Feb 02, 2023*
This article was originally posted on the RAPIDS blog.
Dask Demo Day November 2022 — *Nov 21, 2022*
Once a month, the Dask Community team hosts Dask Demo Day: an informal and fun online hangout where folks can showcase new or lesser-known Dask features and the rest of us can learn about all the things we didn’t know Dask could do 😁
Reducing memory usage in Dask workloads by 80% — *Nov 15, 2022*
Original version of this post appears on https://www.coiled.io/blog/reducing-dask-memory-usage
Dask Kubernetes Operator — *Nov 09, 2022*
We are excited to announce that the Dask Kubernetes Operator is now generally available 🎉!
Understanding Dask’s meta keyword argument — *Aug 09, 2022*
If you have worked with Dask DataFrames or Dask Arrays, you have probably come across the
metakeyword argument. Perhaps, while using methods likeapply():Data Proximate Computation on a Dask Cluster Distributed Between Data Centres — *Jul 19, 2022*
This work is a joint venture between the Met Office and the European Weather Cloud, which is a partnership of ECMWF and EUMETSAT.
Documentation Framework — *Jul 15, 2022*
Document headings start at H2, not H1 [myst.header]
How to run different worker types with the Dask Helm Chart — *Feb 17, 2022*
Document headings start at H2, not H1 [myst.header]
Reflections on one year as the Dask life science fellow — *Dec 15, 2021*
Document headings start at H2, not H1 [myst.header]
Mosaic Image Fusion — *Dec 01, 2021*
Document headings start at H2, not H1 [myst.header]
Choosing good chunk sizes in Dask — *Nov 02, 2021*
Document headings start at H2, not H1 [myst.header]
CZI EOSS Update — *Oct 20, 2021*
Dask was awarded funding last year in round 2 of the CZI Essential Open Source Software grant program. That funding was used to hire Genevieve Buckley to work on Dask with a focus on life sciences. Last month Dask submitted an interim progress report to CZI, covering the period from February to September 2021. That progress update is published verbatim below, to share with the wider Dask community.
2021 Dask User Survey — *Sep 15, 2021*
This post presents the results of the 2021 Dask User Survey, which ran earlier this year. Thanks to everyone who took the time to fill out the survey! These results help us better understand the Dask community and will guide future development efforts.
Google Summer of Code 2021 - Dask Project — *Aug 23, 2021*
Document headings start at H2, not H1 [myst.header]
High Level Graphs update — *Jul 07, 2021*
Document headings start at H2, not H1 [myst.header]
Ragged output, how to handle awkward shaped results — *Jul 02, 2021*
Document headings start at H2, not H1 [myst.header]
Dask Down Under — *Jun 25, 2021*
Document headings start at H2, not H1 [myst.header]
Dask Survey 2021, early anecdotes — *Jun 18, 2021*
The annual Dask user survey is under way and currently accepting responses at dask.org/survey.
The evolution of a Dask Distributed user — *Jun 01, 2021*
This week was the 2021 Dask Summit and one of the workshops that we ran covered many deployment options for Dask Distributed.
The 2021 Dask User Survey is out now — *May 25, 2021*
The Dask User Survey is out again! Tell us how you use Dask, and help us make it better for everyone.
Life sciences at the 2021 Dask Summit — *May 24, 2021*
Document headings start at H2, not H1 [myst.header]
Stability of the Dask library — *May 21, 2021*
Dask is moving fast these days. Sometimes we break things as a result.
Skeleton analysis — *May 07, 2021*
Document headings start at H2, not H1 [myst.header]
Dask with PyTorch for large scale image analysis — *Mar 29, 2021*
Document headings start at H2, not H1 [myst.header]
Image segmentation with Dask — *Mar 19, 2021*
Document headings start at H2, not H1 [myst.header]
Measuring Dask memory usage with dask-memusage — *Mar 11, 2021*
Using too much computing resources can get expensive when you’re scaling up in the cloud.
Getting to know the life science community — *Mar 04, 2021*
Document headings start at H2, not H1 [myst.header]
Dask User Summit 2021 — *Mar 03, 2021*
Dask is organizing a user summit in mid-May. This will be a remote event focused on bringing together developers and users of Dask and the distributed PyData stack in different domains.
Image Analysis Redux — *Nov 12, 2020*
Document headings start at H2, not H1 [myst.header]
2020 Dask User Survey — *Sep 22, 2020*
This post presents the results of the 2020 Dask User Survey, which ran earlier this summer. Thanks to everyone who took the time to fill out the survey! These results help us better understand the Dask community and will guide future development efforts.
Announcing the DaskHub Helm Chart — *Aug 31, 2020*
Today we’re announcing the release of the
daskhubhelm chart. This is a Helm chart to easily install JupyterHub and Dask for multiple users on a Kubernetes Cluster. If you’re managing deployment for many people that needs interactive, scalable computing (say for a class of students, a data science team, or a research lab) thendask/daskhubmight be right for you.Running tutorials — *Aug 21, 2020*
For the last couple of months we’ve been running community tutorials every three weeks or so. The response from the community has been great and we’ve had 50-100 people at each 90 minute session.
Comparing Dask-ML and Ray Tune's Model Selection Algorithms — *Aug 06, 2020*
Hyperparameter optimization is the process of deducing model parameters that can’t be learned from data. This process is often time- and resource-consuming, especially in the context of deep learning. A good description of this process can be found at “Tuning the hyper-parameters of an estimator,” and the issues that arise are concisely summarized in Dask-ML’s documentation of “Hyper Parameter Searches.”
Configuring a Distributed Dask Cluster — *Jul 30, 2020*
Configuring a Dask cluster can seem daunting at first, but the good news is that the Dask project has a lot of built in heuristics that try its best to anticipate and adapt to your workload based on the machine it is deployed on and the work it receives. Possibly for a long time you can get away with not configuring anything special at all. That being said, if you are looking for some tips to move on from using Dask locally, or have a Dask cluster that you are ready to optimize with some more in-depth configuration, these tips and tricks will help guide you and link you to the best Dask docs on the topic!
The current state of distributed Dask clusters — *Jul 23, 2020*
Dask enables you to build up a graph of the computation you want to perform and then executes it in parallel for you. This is great for making best use of your computer’s hardware. It is also great when you want to expand beyond the limits of a single machine.
Faster Scheduling — *Jul 21, 2020*
Document headings start at H2, not H1 [myst.header]
Last Year in Review — *Jul 17, 2020*
We recently enjoyed the 2020 SciPy conference from the comfort of our own homes this year. The 19th annual Scientific Computing with Python conference was a virtual conference this year due to the global pandemic. The annual SciPy Conference brought together over 1500 participants from industry, academia, and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development.
Large SVDs — *May 13, 2020*
Document headings start at H2, not H1 [myst.header]
Dask Summit — *Apr 28, 2020*
In late February members of the Dask community gathered together in Washington, DC. This was a mix of open source project maintainers and active users from a broad range of institutions. This post shares a summary of what happened at this workshop, including slides, images, and lessons learned.
Estimating Users — *Jan 14, 2020*
People often ask me “How many people use Dask?”
Dask Deployment Updates — *Nov 01, 2019*
Document headings start at H2, not H1 [myst.header]
DataFrame Groupby Aggregations — *Oct 08, 2019*
Document headings start at H2, not H1 [myst.header]
Better and faster hyperparameter optimization with Dask — *Sep 30, 2019*
Scott Sievert wrote this post. The original post lives at https://stsievert.com/blog/2019/09/27/dask-hyperparam-opt/ with better styling. This work is supported by Anaconda, Inc.
Co-locating a Jupyter Server and Dask Scheduler — *Sep 13, 2019*
If you want, you can have Dask set up a Jupyter notebook server for you, co-located with the Dask scheduler. There are many ways to do this, but this blog post lists two.
Dask on HPC: a case study — *Aug 28, 2019*
Dask is deployed on traditional HPC machines with increasing frequency. In the past week I’ve personally helped four different groups get set up. This is a surprisingly individual process, because every HPC machine has its own idiosyncrasies. Each machine uses a job scheduler like SLURM/PBS/SGE/LSF/…, a network file system, and fast interconnect, but each of those sub-systems have slightly different policies on a machine-by-machine basis, which is where things get tricky.
Dask and ITK for large scale image analysis — *Aug 09, 2019*
Document headings start at H2, not H1 [myst.header]
2019 Dask User Survey — *Aug 05, 2019*
Document headings start at H2, not H1 [myst.header]
Dask Release 2.2.0 — *Aug 02, 2019*
I’m pleased to announce the release of Dask version 2.2. This is a significant release with bug fixes and new features. The last blogged release was 2.0 on 2019-06-22. This blogpost outlines notable changes since the last post.
Extracting fsspec from Dask — *Jul 23, 2019*
Document headings start at H2, not H1 [myst.header]
Dask Release 2.0 — *Jun 22, 2019*
Please take the Dask User Survey for 2019. Your reponse helps to prioritize future work.
Load Large Image Data with Dask Array — *Jun 20, 2019*
Document headings start at H2, not H1 [myst.header]
Python and GPUs: A Status Update — *Jun 19, 2019*
This blogpost was delivered in talk form at the recent PASC 2019 conference. Slides for that talk are here.
Dask on HPC — *Jun 12, 2019*
We analyze large datasets on HPC systems with Dask, a parallel computing library that integrates well with the existing Python software ecosystem, and works comfortably with native HPC hardware.
Experiments in High Performance Networking with UCX and DGX — *Jun 09, 2019*
This post is about experimental and rapidly changing software. Code examples in this post should not be relied upon to work in the future.
Composing Dask Array with Numba Stencils — *Apr 09, 2019*
In this post we explore four array computing technologies, and how they work together to achieve powerful results.
cuML and Dask hyperparameter optimization — *Mar 27, 2019*
Document headings start at H3, not H1 [myst.header]
Dask and the __array_function__ protocol — *Mar 18, 2019*
Document headings start at H2, not H1 [myst.header]
Building GPU Groupby-Aggregations for Dask — *Mar 04, 2019*
Document headings start at H2, not H1 [myst.header]
Running Dask and MPI programs together — *Jan 31, 2019*
Document headings start at H2, not H1 [myst.header]
Single-Node Multi-GPU Dataframe Joins — *Jan 29, 2019*
Document headings start at H2, not H1 [myst.header]
Dask Release 1.1.0 — *Jan 23, 2019*
I’m pleased to announce the release of Dask version 1.1.0. This is a major release with bug fixes and new features. The last release was 1.0.0 on 2018-11-29. This blogpost outlines notable changes since the last release.
Extension Arrays in Dask DataFrame — *Jan 22, 2019*
This work is supported by Anaconda Inc
Dask, Pandas, and GPUs: first steps — *Jan 13, 2019*
Document headings start at H2, not H1 [myst.header]
GPU Dask Arrays, first steps — *Jan 03, 2019*
The following code creates and manipulates 2 TB of randomly generated data.
Dask Version 1.0 — *Nov 29, 2018*
We are pleased to announce the release of Dask version 1.0.0!
Dask-jobqueue — *Oct 08, 2018*
This work was done in collaboration with Matthew Rocklin (Anaconda), Jim Edwards (NCAR), Guillaume Eynard-Bontemps (CNES), and Loïc Estève (INRIA), and is supported, in part, by the US National Science Foundation Earth Cube program. The dask-jobqueue package is a spinoff of the Pangeo Project. This blogpost was previously published here
Refactor Documentation — *Sep 27, 2018*
This work is supported by Anaconda Inc
Dask Development Log — *Sep 17, 2018*
This work is supported by Anaconda Inc
Dask Release 0.19.0 — *Sep 05, 2018*
This work is supported by Anaconda Inc.
High level performance of Pandas, Dask, Spark, and Arrow — *Aug 28, 2018*
This work is supported by Anaconda Inc
Building SAGA optimization for Dask arrays — *Aug 07, 2018*
This work is supported by ETH Zurich, Anaconda Inc, and the Berkeley Institute for Data Science
Dask Development Log — *Aug 02, 2018*
This work is supported by Anaconda Inc
Pickle isn't slow, it's a protocol — *Jul 23, 2018*
This work is supported by Anaconda Inc
Dask Development Log, Scipy 2018 — *Jul 17, 2018*
This work is supported by Anaconda Inc
Who uses Dask? — *Jul 16, 2018*
This work is supported by Anaconda Inc
Dask Development Log — *Jul 08, 2018*
This work is supported by Anaconda Inc
Dask Scaling Limits — *Jun 26, 2018*
This work is supported by Anaconda Inc.
Dask Release 0.18.0 — *Jun 14, 2018*
This work is supported by Anaconda Inc.
Beyond Numpy Arrays in Python — *May 27, 2018*
Document headings start at H2, not H1 [myst.header]
Dask Release 0.17.2 — *Mar 21, 2018*
This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.
Craft Minimal Bug Reports — *Feb 28, 2018*
Following up on a post on supporting users in open source this post lists some suggestions on how to ask a maintainer to help you with a problem.
Dask Release 0.17.0 — *Feb 12, 2018*
This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.
Credit Modeling with Dask — *Feb 09, 2018*
This post explores a real-world use case calculating complex credit models in Python using Dask. It is an example of a complex parallel system that is well outside of the traditional “big data” workloads.
Pangeo: JupyterHub, Dask, and XArray on the Cloud — *Jan 22, 2018*
This work is supported by Anaconda Inc, the NSF EarthCube program, and UC Berkeley BIDS
Dask Development Log — *Dec 06, 2017*
This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation
Dask Release 0.16.0 — *Nov 21, 2017*
This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.
Optimizing Data Structure Access in Python — *Nov 03, 2017*
This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation
Streaming Dataframes — *Oct 16, 2017*
This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation
Notes on Kafka in Python — *Oct 10, 2017*
Document headings start at H2, not H1 [myst.header]
Dask Release 0.15.3 — *Sep 24, 2017*
This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.
Fast GeoSpatial Analysis in Python — *Sep 21, 2017*
This work is supported by Anaconda Inc., the Data Driven Discovery Initiative from the Moore Foundation, and NASA SBIR NNX16CG43P
Dask on HPC - Initial Work — *Sep 18, 2017*
This work is supported by Anaconda Inc. and the NSF EarthCube program.
Dask Release 0.15.2 — *Aug 30, 2017*
This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.
Dask Benchmarks — *Jul 03, 2017*
This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.
Use Apache Parquet — *Jun 28, 2017*
This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.
Dask Release 0.15.0 — *Jun 15, 2017*
This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.
Dask Release 0.14.3 — *May 08, 2017*
This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.
Dask Development Log — *Apr 28, 2017*
This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation
Asynchronous Optimization Algorithms with Dask — *Apr 19, 2017*
This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.
Dask and Pandas and XGBoost — *Mar 28, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Dask Release 0.14.1 — *Mar 23, 2017*
This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.
Developing Convex Optimization Algorithms in Dask — *Mar 22, 2017*
This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.
Dask Release 0.14.0 — *Feb 27, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Dask Development Log — *Feb 20, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Experiment with Dask and TensorFlow — *Feb 11, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Two Easy Ways to Use Scikit Learn and Dask — *Feb 07, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Dask Development Log — *Jan 30, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Custom Parallel Algorithms on a Cluster with Dask — *Jan 24, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Dask Development Log — *Jan 18, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Distributed NumPy on a Cluster with Dask Arrays — *Jan 17, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Distributed Pandas on a Cluster with Dask DataFrames — *Jan 12, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Dask Release 0.13.0 — *Jan 03, 2017*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Dask Development Log — *Dec 24, 2016*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Dask Development Log — *Dec 18, 2016*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Dask Development Log — *Dec 12, 2016*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Dask Development Log — *Dec 05, 2016*
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
Dask Cluster Deployments — *Sep 22, 2016*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Dask and Celery — *Sep 13, 2016*
This post compares two Python distributed task processing systems, Dask.distributed and Celery.
Dask Distributed Release 1.13.0 — *Sep 12, 2016*
I’m pleased to announce a release of Dask’s distributed scheduler, dask.distributed, version 1.13.0.
Dask for Institutions — *Aug 16, 2016*
Dask and Scikit-Learn -- Model Parallelism — *Jul 12, 2016*
This post was written by Jim Crist. The original post lives at http://jcrist.github.io/dask-sklearn-part-1.html (with better styling)
Ad Hoc Distributed Random Forests — *Apr 20, 2016*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Fast Message Serialization — *Apr 14, 2016*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Distributed Dask Arrays — *Feb 26, 2016*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Pandas on HDFS with Dask Dataframes — *Feb 22, 2016*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Introducing Dask distributed — *Feb 17, 2016*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Dask is one year old — *Dec 21, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Distributed Prototype — *Oct 09, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Caching — *Aug 03, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Custom Parallel Workflows — *Jul 23, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Write Complex Parallel Algorithms — *Jun 26, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Distributed Scheduling — *Jun 23, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
State of Dask — *May 19, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Towards Out-of-core DataFrames — *Mar 11, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Towards Out-of-core ND-Arrays -- Dask + Toolz = Bag — *Feb 17, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Towards Out-of-core ND-Arrays -- Slicing and Stacking — *Feb 13, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Towards Out-of-core ND-Arrays -- Spilling to Disk — *Jan 16, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Towards Out-of-core ND-Arrays -- Benchmark MatMul — *Jan 14, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Towards Out-of-core ND-Arrays -- Multi-core Scheduling — *Jan 06, 2015*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Towards Out-of-core ND-Arrays -- Frontend — *Dec 30, 2014*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Towards Out-of-core ND-Arrays — *Dec 27, 2014*
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project