<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts tagged dask-ml</title>
  <updated>2026-03-05T15:05:26.158516+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/tag/dask-ml/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2020/08/06/ray-tune/</id>
    <title>Comparing Dask-ML and Ray Tune's Model Selection Algorithms</title>
    <updated>2020-08-06T00:00:00+00:00</updated>
    <author>
      <name>&lt;a href="https://stsievert.com"&gt;Scott Sievert&lt;/a&gt; (University of Wisconsin–Madison)</name>
    </author>
    <content type="html">&lt;p&gt;Hyperparameter optimization is the process of deducing model parameters that
can’t be learned from data. This process is often time- and resource-consuming,
especially in the context of deep learning. A good description of this process
can be found at “&lt;a class="reference external" href="https://scikit-learn.org/stable/modules/grid_search.html"&gt;Tuning the hyper-parameters of an estimator&lt;/a&gt;,” and
the issues that arise are concisely summarized in Dask-ML’s documentation of
“&lt;a class="reference external" href="https://ml.dask.org/hyper-parameter-search.html"&gt;Hyper Parameter Searches&lt;/a&gt;.”&lt;/p&gt;
&lt;p&gt;There’s a host of libraries and frameworks out there to address this problem.
&lt;a class="reference external" href="https://scikit-learn.org/stable/modules/grid_search.html"&gt;Scikit-Learn’s module&lt;/a&gt; has been mirrored &lt;a class="reference external" href="https://ml.dask.org/hyper-parameter-search.html"&gt;in Dask-ML&lt;/a&gt; and
&lt;a class="reference external" href="https://automl.github.io/auto-sklearn/master/"&gt;auto-sklearn&lt;/a&gt;, both of which offer advanced hyperparameter optimization
techniques. Other implementations that don’t follow the Scikit-Learn interface
include &lt;a class="reference external" href="https://docs.ray.io/en/master/tune.html"&gt;Ray Tune&lt;/a&gt;, &lt;a class="reference external" href="https://www.automl.org/"&gt;AutoML&lt;/a&gt; and &lt;a class="reference external" href="https://medium.com/optuna/optuna-supports-hyperband-93b0cae1a137"&gt;Optuna&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://docs.ray.io"&gt;Ray&lt;/a&gt; recently provided a wrapper to &lt;a class="reference external" href="https://docs.ray.io/en/master/tune.html"&gt;Ray Tune&lt;/a&gt; that mirrors the Scikit-Learn
API called tune-sklearn (&lt;a class="reference external" href="https://docs.ray.io/en/master/tune/api_docs/sklearn.html"&gt;docs&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/ray-project/tune-sklearn"&gt;source&lt;/a&gt;). &lt;a class="reference external" href="https://medium.com/distributed-computing-with-ray/gridsearchcv-2-0-new-and-improved-ee56644cbabf"&gt;The introduction&lt;/a&gt; of this library
states the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;div&gt;&lt;p&gt;Cutting edge hyperparameter tuning techniques (Bayesian optimization, early
stopping, distributed execution) can provide significant speedups over grid
search and random search.&lt;/p&gt;
&lt;p&gt;However, the machine learning ecosystem is missing a solution that provides
users with the ability to leverage these new algorithms while allowing users
to stay within the Scikit-Learn API. In this blog post, we introduce
tune-sklearn [Ray’s tuning library] to bridge this gap. Tune-sklearn is a
drop-in replacement for Scikit-Learn’s model selection module with
state-of-the-art optimization features.&lt;/p&gt;
&lt;p&gt;—&lt;a class="reference external" href="https://medium.com/distributed-computing-with-ray/gridsearchcv-2-0-new-and-improved-ee56644cbabf"&gt;GridSearchCV 2.0 — New and Improved&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;This claim is inaccurate: for over a year Dask-ML has provided access to
“cutting edge hyperparameter tuning techniques” with a Scikit-Learn compatible
API. To correct their statement, let’s look at each of the features that Ray’s
tune-sklearn provides, and compare them to Dask-ML:&lt;/p&gt;
&lt;blockquote&gt;
&lt;div&gt;&lt;p&gt;Here’s what [Ray’s] tune-sklearn has to offer:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Consistency with Scikit-Learn API&lt;/strong&gt; …&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Modern hyperparameter tuning techniques&lt;/strong&gt; …&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Framework support&lt;/strong&gt; …&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale up&lt;/strong&gt; … [to] multiple cores and even multiple machines.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;[Ray’s] Tune-sklearn is also &lt;strong&gt;fast&lt;/strong&gt;.&lt;/p&gt;
&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;Dask-ML’s model selection module has every one of the features:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Consistency with Scikit-Learn API:&lt;/strong&gt; Dask-ML’s model selection API
mirrors the Scikit-Learn model selection API.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Modern hyperparameter tuning techniques:&lt;/strong&gt; Dask-ML offers state-of-the-art
hyperparameter tuning techniques.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Framework support:&lt;/strong&gt; Dask-ML model selection supports many libraries
including Scikit-Learn, PyTorch, Keras, LightGBM and XGBoost.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale up:&lt;/strong&gt; Dask-ML supports distributed tuning (how could it not?) and
larger-than-memory datasets.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dask-ML is also &lt;strong&gt;fast.&lt;/strong&gt; In “&lt;a class="reference internal" href="#speed"&gt;&lt;span class="xref myst"&gt;Speed&lt;/span&gt;&lt;/a&gt;” we show a benchmark between
Dask-ML, Ray and Scikit-Learn:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/2020-model-selection/n_workers=8.png" width="450px"
 /&gt;&lt;/p&gt;
&lt;p&gt;Only time-to-solution is relevant; all of these methods produce similar model
scores. See “&lt;a class="reference internal" href="#speed"&gt;&lt;span class="xref myst"&gt;Speed&lt;/span&gt;&lt;/a&gt;” for details.&lt;/p&gt;
&lt;p&gt;Now, let’s walk through the details on how to use Dask-ML to obtain the 5
features above.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/08/06/ray-tune.md&lt;/span&gt;, line 95)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="consistency-with-the-scikit-learn-api"&gt;

&lt;p&gt;&lt;em&gt;Dask-ML is consistent with the Scikit-Learn API.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Here’s how to use Scikit-Learn’s, Dask-ML’s and Ray’s tune-sklearn
hyperparameter optimization:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;## Trimmed example; see appendix for more detail&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_ml.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HyperbandSearchCV&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HyperbandSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;tune_sklearn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TuneSearchCV&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TuneSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The definitions of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;model&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;params&lt;/span&gt;&lt;/code&gt; follow the normal Scikit-Learn
definitions as detailed in the &lt;a class="reference internal" href="#full-example-usage"&gt;&lt;span class="xref myst"&gt;appendix&lt;/span&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Clearly, both Dask-ML and Ray’s tune-sklearn are Scikit-Learn compatible. Now
let’s focus on how each search performs and how it’s configured.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/08/06/ray-tune.md&lt;/span&gt;, line 126)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="modern-hyperparameter-tuning-techniques"&gt;
&lt;h1&gt;Modern hyperparameter tuning techniques&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;Dask-ML offers state-of-the-art hyperparameter tuning techniques
in a Scikit-Learn interface.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://medium.com/distributed-computing-with-ray/gridsearchcv-2-0-new-and-improved-ee56644cbabf"&gt;The introduction&lt;/a&gt; of Ray’s tune-sklearn made this claim:&lt;/p&gt;
&lt;blockquote&gt;
&lt;div&gt;&lt;p&gt;tune-sklearn is the only
Scikit-Learn interface that allows you to easily leverage Bayesian
Optimization, HyperBand and other optimization techniques by simply toggling a few parameters.&lt;/p&gt;
&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;The state-of-the-art in hyperparameter optimization is currently
“&lt;a class="reference external" href="https://arxiv.org/pdf/1603.06560.pdf"&gt;Hyperband&lt;/a&gt;.” Hyperband reduces the amount of computation
required with a &lt;em&gt;principled&lt;/em&gt; early stopping scheme; past that, it’s the same as
Scikit-Learn’s popular &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomizedSearchCV&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Hyperband &lt;em&gt;works.&lt;/em&gt; As such, it’s very popular. After the introduction of
Hyperband in 2016 by Li et. al, &lt;a class="reference external" href="https://arxiv.org/pdf/1603.06560.pdf"&gt;the paper&lt;/a&gt; has been cited
&lt;a class="reference external" href="https://scholar.google.com/scholar?cites=10473284631669296057&amp;amp;amp;as_sdt=5,39&amp;amp;amp;sciodt=0,39&amp;amp;amp;hl=en"&gt;over 470 times&lt;/a&gt; and has been implemented in many different libraries
including &lt;a class="reference external" href="https://ml.dask.org/modules/generated/dask_ml.model_selection.HyperbandSearchCV.html#dask_ml.model_selection.HyperbandSearchCV"&gt;Dask-ML&lt;/a&gt;, &lt;a class="reference external" href="https://docs.ray.io/en/master/tune/api_docs/schedulers.html#asha-tune-schedulers-ashascheduler"&gt;Ray Tune&lt;/a&gt;, &lt;a class="reference external" href="https://keras-team.github.io/keras-tuner/documentation/tuners/#hyperband-class"&gt;keras-tune&lt;/a&gt;, &lt;a class="reference external" href="https://medium.com/optuna/optuna-supports-hyperband-93b0cae1a137"&gt;Optuna&lt;/a&gt;,
&lt;a class="reference external" href="https://www.automl.org/"&gt;AutoML&lt;/a&gt;,&lt;a class="footnote-reference brackets" href="#automl" id="id1" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;1&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; and &lt;a class="reference external" href="https://nni.readthedocs.io/en/latest/Tuner/HyperbandAdvisor.html"&gt;Microsoft’s NNI&lt;/a&gt;. The original paper shows a
rather drastic improvement over all the relevant
implementations,&lt;a class="footnote-reference brackets" href="#hyperband-figs" id="id2" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;2&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; and this drastic improvement persists in
follow-up works.&lt;a class="footnote-reference brackets" href="#follow-up" id="id3" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;3&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; Some illustrative results from Hyperband are
below:&lt;/p&gt;
&lt;p&gt;&lt;img width="80%" src="/images/2020-model-selection/hyperband-fig-7-8.png"
 style="display: block; margin-left: auto; margin-right: auto;" /&gt;&lt;/p&gt;
&lt;div style="max-width: 80%; word-wrap: break-word;" style="text-align: center;"&gt;
&lt;p&gt;&lt;sup&gt;All algorithms are configured to do the same amount of work except “random
2x” which does twice as much work. “hyperband (finite)” is similar Dask-ML’s
default implementation, and “bracket s=4” is similar to Ray’s default
implementation. “random” is a random search. SMAC,&lt;a class="footnote-reference brackets" href="#smac" id="id4" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;4&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;
spearmint,&lt;a class="footnote-reference brackets" href="#spearmint" id="id5" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;5&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; and TPE&lt;a class="footnote-reference brackets" href="#tpe" id="id6" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;6&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; are popular Bayesian algorithms. &lt;/sup&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Hyperband is undoubtedly a “cutting edge” hyperparameter optimization
technique. Dask-ML and Ray offer Scikit-Learn implementations of this algorithm
that rely on similar implementations, and Dask-ML’s implementation also has a
&lt;a class="reference external" href="https://ml.dask.org/hyper-parameter-search.html#hyperband-parameters-rule-of-thumb"&gt;rule of thumb&lt;/a&gt; for configuration. Both Dask-ML’s and Ray’s documentation
encourages use of Hyperband.&lt;/p&gt;
&lt;p&gt;Ray does support using their Hyperband implementation on top of a technique
called Bayesian sampling. This changes the hyperparameter sampling scheme for
model initialization. This can be used in conjunction with Hyperband’s early
stopping scheme. Adding this option to Dask-ML’s Hyperband implementation is
future work for Dask-ML.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/08/06/ray-tune.md&lt;/span&gt;, line 222)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="framework-support"&gt;
&lt;h1&gt;Framework support&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;Dask-ML model selection supports many libraries including Scikit-Learn, PyTorch, Keras, LightGBM and XGBoost.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Ray’s tune-sklearn supports these frameworks:&lt;/p&gt;
&lt;blockquote&gt;
&lt;div&gt;&lt;p&gt;tune-sklearn is used primarily for tuning
Scikit-Learn models, but it also supports and provides examples for many
other frameworks with Scikit-Learn wrappers such as Skorch (Pytorch),
KerasClassifiers (Keras), and XGBoostClassifiers (XGBoost).&lt;/p&gt;
&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;Clearly, both Dask-ML and Ray support the many of the same libraries.&lt;/p&gt;
&lt;p&gt;However, both Dask-ML and Ray have some qualifications. Certain libraries don’t
offer an implementation of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt;,&lt;a class="footnote-reference brackets" href="#ray-pf" id="id7" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;7&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; so not all of the modern
hyperparameter optimization techniques can be offered. Here’s a table comparing
different libraries and their support in Dask-ML’s model selection and Ray’s
tune-sklearn:&lt;/p&gt;
&lt;div class="pst-scrollable-table-container"&gt;&lt;table class="table"&gt;
&lt;thead&gt;
&lt;tr class="row-odd"&gt;&lt;th class="head text-center"&gt;&lt;p&gt;Model Library&lt;/p&gt;&lt;/th&gt;
&lt;th class="head text-center"&gt;&lt;p&gt;Dask-ML support&lt;/p&gt;&lt;/th&gt;
&lt;th class="head text-center"&gt;&lt;p&gt;Ray support&lt;/p&gt;&lt;/th&gt;
&lt;th class="head text-center"&gt;&lt;p&gt;Dask-ML: early stopping?&lt;/p&gt;&lt;/th&gt;
&lt;th class="head text-center"&gt;&lt;p&gt;Ray: early stopping?&lt;/p&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class="row-even"&gt;&lt;td class="text-center"&gt;&lt;p&gt;&lt;a class="reference external" href="https://scikit-learn.org/"&gt;Scikit-Learn&lt;/a&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔*&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔*&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="row-odd"&gt;&lt;td class="text-center"&gt;&lt;p&gt;&lt;a class="reference external" href="https://pytorch.org/"&gt;PyTorch&lt;/a&gt; (via &lt;a class="reference external" href="https://skorch.readthedocs.io/"&gt;Skorch&lt;/a&gt;)&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="row-even"&gt;&lt;td class="text-center"&gt;&lt;p&gt;&lt;a class="reference external" href="https://keras.io/"&gt;Keras&lt;/a&gt; (via &lt;a class="reference external" href="https://github.com/adriangb/scikeras"&gt;SciKeras&lt;/a&gt;)&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔**&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔**&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="row-odd"&gt;&lt;td class="text-center"&gt;&lt;p&gt;&lt;a class="reference external" href="https://lightgbm.readthedocs.io/"&gt;LightGBM&lt;/a&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;❌&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;❌&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="row-even"&gt;&lt;td class="text-center"&gt;&lt;p&gt;&lt;a class="reference external" href="https://xgboost.ai/"&gt;XGBoost&lt;/a&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;✔&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;❌&lt;/p&gt;&lt;/td&gt;
&lt;td class="text-center"&gt;&lt;p&gt;❌&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;&lt;sup&gt;* Only for &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/computing.html#incremental-learning"&gt;the models that implement &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt;&lt;/a&gt;.&lt;/sup&gt;&lt;br&gt;
&lt;sup&gt;** Thanks to work by the Dask developers around &lt;a class="reference external" href="https://github.com/adriangb/scikeras/issues/24"&gt;scikeras#24&lt;/a&gt;.&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;By this measure, Dask-ML and Ray model selection have the same level of
framework support. Of course, Dask has tangential integration with LightGBM and
XGBoost through &lt;a class="reference external" href="https://ml.dask.org/xgboost.html"&gt;Dask-ML’s &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;xgboost&lt;/span&gt;&lt;/code&gt; module&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/dask/dask-lightgbm"&gt;dask-lightgbm&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/08/06/ray-tune.md&lt;/span&gt;, line 272)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="scale-up"&gt;
&lt;h1&gt;Scale up&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;Dask-ML supports distributed tuning (how could it not?), aka parallelization
across multiple machines/cores. In addition, it also supports
larger-than-memory data.&lt;/em&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;div&gt;&lt;p&gt;[Ray’s] Tune-sklearn leverages Ray Tune, a library for distributed
hyperparameter tuning, to efficiently and transparently parallelize cross
validation on multiple cores and even multiple machines.&lt;/p&gt;
&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;Naturally, Dask-ML also scales to multiple cores/machines because it relies on
Dask. Dask has wide support for &lt;a class="reference external" href="https://docs.dask.org/en/latest/setup.html"&gt;different deployment options&lt;/a&gt; that span
from your personal machine to supercomputers. Dask will very likely work on top
of any computing system you have available, including Kubernetes, SLURM, YARN
and Hadoop clusters as well as your personal machine.&lt;/p&gt;
&lt;p&gt;Dask-ML’s model selection also scales to larger-than-memory datasets, and is
thoroughly tested. Support for larger-than-memory data is untested in Ray, and
there are no examples detailing how to use Ray Tune with the distributed
dataset implementations in PyTorch/Keras.&lt;/p&gt;
&lt;p&gt;In addition, I have benchmarked Dask-ML’s model selection module to see how the
time-to-solution is affected by the number of Dask workers in “&lt;a class="reference external" href="https://blog.dask.org/2019/09/30/dask-hyperparam-opt"&gt;Better and
faster hyperparameter optimization with Dask&lt;/a&gt;.” That is, how does the
time to reach a particular accuracy scale with the number of workers &lt;span class="math notranslate nohighlight"&gt;\(P\)&lt;/span&gt;? At
first, it’ll scale like &lt;span class="math notranslate nohighlight"&gt;\(1/P\)&lt;/span&gt; but with large number of workers the serial
portion will dictate time to solution according to &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Amdahl%27s_law"&gt;Amdahl’s Law&lt;/a&gt;. Briefly, I
found Dask-ML’s &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; speedup started to saturate around 24
workers for a particular search.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/08/06/ray-tune.md&lt;/span&gt;, line 311)&lt;/p&gt;
&lt;p&gt;Document headings start at H3, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="speed"&gt;
&lt;h1&gt;Speed&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;Both Dask-ML and Ray are much faster than Scikit-Learn.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Ray’s tune-sklearn runs some benchmarks in &lt;a class="reference external" href="https://medium.com/distributed-computing-with-ray/gridsearchcv-2-0-new-and-improved-ee56644cbabf"&gt;the introduction&lt;/a&gt; with the
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;GridSearchCV&lt;/span&gt;&lt;/code&gt; class found in Scikit-Learn and Dask-ML. A more fair benchmark
would be use Dask-ML’s &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; because it is almost the same as the
algorithm in Ray’s tune-sklearn. To be specific, I’m interested in comparing
these methods:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Scikit-Learn’s &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomizedSearchCV&lt;/span&gt;&lt;/code&gt;. This is a popular implementation, one
that I’ve bootstrapped myself with a custom model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dask-ML’s &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;. This is an early stopping technique for
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomizedSearchCV&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ray tune-sklearn’s &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;TuneSearchCV&lt;/span&gt;&lt;/code&gt;. This is a slightly different early
stopping technique than &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;’s.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each search is configured to perform the same task: sample 100 parameters and
train for no longer than 100 “epochs” or passes through the
data.&lt;a class="footnote-reference brackets" href="#random-search" id="id8" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;8&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; Each estimator is configured as their respective
documentation suggests. Each search uses 8 workers with a single cross
validation split, and a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt; call takes one second with 50,000
examples. The complete setup can be found in &lt;a class="reference internal" href="#appendix"&gt;&lt;span class="xref myst"&gt;the appendix&lt;/span&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here’s how long each library takes to complete the same search:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/2020-model-selection/n_workers=8.png" width="450px"
 /&gt;&lt;/p&gt;
&lt;p&gt;Notably, we didn’t improve the Dask-ML codebase for this benchmark, and ran the
code as it’s been for the last year.&lt;a class="footnote-reference brackets" href="#priority-impl" id="id9" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;9&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; Regardless, it’s possible that
other artifacts from &lt;a class="reference external" href="http://matthewrocklin.com/blog/work/2017/03/09/biased-benchmarks"&gt;biased benchmarks&lt;/a&gt; crept into this benchmark.&lt;/p&gt;
&lt;p&gt;Clearly, Ray and Dask-ML offer similar performance for 8 workers when compared
with Scikit-Learn. To Ray’s credit, their implementation is ~15% faster than
Dask-ML’s with 8 workers. We suspect that this performance boost comes from the
fact that Ray implements an asynchronous variant of Hyperband. We should
investigate this difference between Dask and Ray, and how each balances the
tradeoffs, number FLOPs vs. time-to-solution. This will vary with the number
of workers: the asynchronous variant of Hyperband provides no benefit if used
with a single worker.&lt;/p&gt;
&lt;p&gt;Dask-ML reaches scores quickly in serial environments, or when the number of
workers is small. Dask-ML prioritizes fitting high scoring models: if there are
100 models to fit and only 4 workers available, Dask-ML selects the models that
have the highest score. This is most relevant in serial
environments;&lt;a class="footnote-reference brackets" href="#priority" id="id10" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;10&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; see “&lt;a class="reference external" href="https://blog.dask.org/2019/09/30/dask-hyperparam-opt"&gt;Better and faster hyperparameter optimization
with Dask&lt;/a&gt;” for benchmarks. This feature is omitted from this
benchmark, which only focuses on time to solution.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/08/06/ray-tune.md&lt;/span&gt;, line 377)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Dask-ML and Ray offer the same features for model selection: state-of-the-art
features with a Scikit-Learn compatible API, and both implementations have
fairly wide support for different frameworks and rely on backends that can
scale to many machines.&lt;/p&gt;
&lt;p&gt;In addition, the Ray implementation has provided motivation for further
development, specifically on the following items:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Adding support for more libraries, including Keras&lt;/strong&gt; (&lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/696"&gt;dask-ml#696&lt;/a&gt;,
&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/713"&gt;dask-ml#713&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/adriangb/scikeras/issues/24"&gt;scikeras#24&lt;/a&gt;). SciKeras is a Scikit-Learn wrapper for
Keras that (now) works with Dask-ML model selection because SciKeras models
implement the Scikit-Learn model API.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Better documenting the models that Dask-ML supports&lt;/strong&gt;
(&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/699"&gt;dask-ml#699&lt;/a&gt;). Dask-ML supports any model that implement the
Scikit-Learn interface, and there are wrappers for Keras, PyTorch, LightGBM
and XGBoost. Now, &lt;a class="reference external" href="https://ml.dask.org"&gt;Dask-ML’s documentation&lt;/a&gt; prominently highlights this
fact.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The Ray implementation has also helped motivate and clarify future work.
Dask-ML should include the following implementations:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A Bayesian sampling scheme for the Hyperband implementation&lt;/strong&gt; that’s
similar to Ray’s and BOHB’s (&lt;a class="reference external" href="https://github.com/dask/dask-ml/issues/697"&gt;dask-ml#697&lt;/a&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A configuration of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; that’s well-suited for
exploratory hyperparameter searches.&lt;/strong&gt; An initial implementation is in
&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/532"&gt;dask-ml#532&lt;/a&gt;, which should be benchmarked against Ray.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Luckily, all of these pieces of development are straightforward modifications
because the Dask-ML model selection framework is pretty flexible.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Thank you &lt;a class="reference external" href="https://github.com/TomAugspurger"&gt;Tom Augspurger&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/mrocklin"&gt;Matthew Rocklin&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/jsignell"&gt;Julia Signell&lt;/a&gt;, and &lt;a class="reference external" href="https://github.com/quasiben"&gt;Benjamin
Zaitlen&lt;/a&gt; for your feedback, suggestions and edits.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/08/06/ray-tune.md&lt;/span&gt;, line 427)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="appendix"&gt;
&lt;h1&gt;Appendix&lt;/h1&gt;
&lt;section id="benchmark-setup"&gt;
&lt;h2&gt;Benchmark setup&lt;/h2&gt;
&lt;p&gt;This is the complete setup for the benchmark between Dask-ML, Scikit-Learn and
Ray. Complete details can be found at
&lt;a class="reference external" href="https://github.com/stsievert/dask-hyperband-comparison"&gt;stsievert/dask-hyperband-comparison&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Let’s create a dummy model that takes 1 second for a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt; call with
50,000 examples. This is appropriate for this benchmark; we’re only interested
in the time required to finish the search, not how well the models do.
Scikit-learn, Ray and Dask-ML have have very similar methods of choosing
hyperparameters to evaluate; they differ in their early stopping techniques.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;scipy.stats&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uniform&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;make_classification&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;benchmark&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ConstantFunction&lt;/span&gt;  &lt;span class="c1"&gt;# custom module&lt;/span&gt;

&lt;span class="c1"&gt;# This model sleeps for `latency * len(X)` seconds before&lt;/span&gt;
&lt;span class="c1"&gt;# reporting a score of `value`.&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ConstantFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;50e3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;value&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="c1"&gt;# This dummy dataset mirrors the MNIST dataset&lt;/span&gt;
&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_classification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;60e3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;784&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This model will take 2 minutes to train for 100 epochs (aka passes through the
data). Details can be found at &lt;a class="reference external" href="https://github.com/stsievert/dask-hyperband-comparison"&gt;stsievert/dask-hyperband-comparison&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Let’s configure our searches to use 8 workers with a single cross-validation
split:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ShuffleSplit&lt;/span&gt;
&lt;span class="n"&gt;split&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ShuffleSplit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_splits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;refit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 20.88 minutes&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_ml.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HyperbandSearchCV&lt;/span&gt;
&lt;span class="n"&gt;dask_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HyperbandSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aggressiveness&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;tune_sklearn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TuneSearchCV&lt;/span&gt;
&lt;span class="n"&gt;ray_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TuneSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;early_stopping&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;dask_search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 2.93 minutes&lt;/span&gt;
&lt;span class="n"&gt;ray_search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 2.49 minutes&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;section id="full-example-usage"&gt;
&lt;h2&gt;Full example usage&lt;/h2&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SGDClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;scipy.stats&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loguniform&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.datasets&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;make_classification&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SGDClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;alpha&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;loguniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e-5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1e-3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;l1_ratio&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_classification&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_ml.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HyperbandSearchCV&lt;/span&gt;
&lt;span class="n"&gt;HyperbandSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;tune_sklearn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TuneSearchCV&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TuneSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;hr class="docutils" /&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2020/08/06/ray-tune.md&lt;/span&gt;, line 40)&lt;/p&gt;
&lt;p&gt;Duplicate reference definition: TSNE [myst.duplicate_def]&lt;/p&gt;
&lt;/aside&gt;
&lt;hr class="footnotes docutils" /&gt;
&lt;aside class="footnote-list brackets"&gt;
&lt;aside class="footnote brackets" id="automl" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id1"&gt;1&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Their implementation of Hyperband in &lt;a class="reference external" href="https://github.com/automl/HpBandSter"&gt;HpBandSter&lt;/a&gt; is included in &lt;a class="reference external" href="https://www.automl.org/wp-content/uploads/2018/09/chapter7-autonet.pdf"&gt;Auto-PyTorch&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/automl/BOAH"&gt;BOAH&lt;/a&gt;.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="hyperband-figs" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id2"&gt;2&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;See Figures 4, 7 and 8 in “&lt;a class="reference external" href="https://arxiv.org/pdf/1603.06560.pdf"&gt;Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization&lt;/a&gt;.”&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="follow-up" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id3"&gt;3&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;See Figure 1 of &lt;a class="reference external" href="http://proceedings.mlr.press/v80/falkner18a/falkner18a.pdf"&gt;the BOHB paper&lt;/a&gt; and &lt;a class="reference external" href="https://arxiv.org/pdf/1801.01596.pdf"&gt;a paper&lt;/a&gt; from an augmented reality company.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="smac" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id4"&gt;4&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;SMAC is described in “&lt;a class="reference external" href="https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf"&gt;Sequential Model-Based Optimization forGeneral Algorithm Configuration&lt;/a&gt;,” and is available &lt;a class="reference external" href="https://www.automl.org/automated-algorithm-design/algorithm-configuration/smac/"&gt;in AutoML&lt;/a&gt;.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="spearmint" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id5"&gt;5&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Spearmint is described in “&lt;a class="reference external" href="https://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf"&gt;Practical Bayesian Optimization of MachineLearning Algorithms&lt;/a&gt;,” and is available in &lt;a class="reference external" href="https://github.com/HIPS/Spearmint"&gt;HIPS/spearmint&lt;/a&gt;.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="tpe" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id6"&gt;6&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;TPE is described in Section 4 of “&lt;a class="reference external" href="http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf"&gt;Algorithms for Hyperparameter Optimization&lt;/a&gt;,” and is available &lt;a class="reference external" href="http://hyperopt.github.io/hyperopt/"&gt;through Hyperopt&lt;/a&gt;.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="ray-pf" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id7"&gt;7&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;From &lt;a class="reference external" href="https://github.com/ray-project/tune-sklearn/blob/31f228e21ef632a89a74947252d8ad5323cbd043/README.md"&gt;Ray’s README.md&lt;/a&gt;: “If the estimator does not support &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt;, a warning will be shown saying early stopping cannot be done and it will simply run the cross-validation on Ray’s parallel back-end.”&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="random-search" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id8"&gt;8&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;I choose to benchmark random searches instead of grid searches because random searches produce better results because grid searches require estimating how important each parameter is; for more detail see “&lt;a class="reference external" href="http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf"&gt;Random Search for Hyperparameter Optimization&lt;/a&gt;” by Bergstra and Bengio.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="priority-impl" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id9"&gt;9&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Despite a relevant implementation in &lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/527"&gt;dask-ml#527&lt;/a&gt;.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="priority" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id10"&gt;10&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Because priority is meaningless if there are an infinite number of workers.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="bohb-exps" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;11&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Details are in “&lt;a class="reference external" href="http://proceedings.mlr.press/v80/falkner18a/falkner18a.pdf"&gt;BOHB: Robust and Efficient Hyperparameter Optimization at Scale&lt;/a&gt;.”&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="nlp-future" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;12&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Future work is combining this with the Dask-ML’s Hyperband implementation.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="openai" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;13&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Computing &lt;a class="reference external" href="https://en.wikipedia.org/wiki/N-gram"&gt;n-grams&lt;/a&gt; requires a ton of memory and computation. For OpenAI, NLP preprocessing took 8 GPU-months! (&lt;a class="reference external" href="https://openai.com/blog/language-unsupervised/#drawbacks"&gt;source&lt;/a&gt;)&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="stopping" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;14&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Hyperband’s theory answers “how many models should be stopped?” and “when should they be stopped?”&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="bohb-parallel" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;15&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;In Section 4.2 of &lt;a class="reference external" href="http://proceedings.mlr.press/v80/falkner18a/falkner18a.pdf"&gt;their paper&lt;/a&gt;.&lt;/p&gt;
&lt;/aside&gt;
&lt;/aside&gt;
</content>
    <link href="https://blog.dask.org/2020/08/06/ray-tune/"/>
    <summary>Hyperparameter optimization is the process of deducing model parameters that
can’t be learned from data. This process is often time- and resource-consuming,
especially in the context of deep learning. A good description of this process
can be found at “Tuning the hyper-parameters of an estimator,” and
the issues that arise are concisely summarized in Dask-ML’s documentation of
“Hyper Parameter Searches.”</summary>
    <category term="dask" label="dask"/>
    <category term="dask-ml" label="dask-ml"/>
    <category term="machine-learning" label="machine-learning"/>
    <category term="ray" label="ray"/>
    <published>2020-08-06T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://blog.dask.org/2019/09/30/dask-hyperparam-opt/</id>
    <title>Better and faster hyperparameter optimization with Dask</title>
    <updated>2019-09-30T00:00:00+00:00</updated>
    <author>
      <name>&lt;a href="http://stsievert.com"&gt;Scott Sievert&lt;/a&gt;</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;&lt;a class="reference external" href="https://stsievert.com"&gt;Scott Sievert&lt;/a&gt; wrote this post. The original post lives at
&lt;a class="reference external" href="https://stsievert.com/blog/2019/09/27/dask-hyperparam-opt/"&gt;https://stsievert.com/blog/2019/09/27/dask-hyperparam-opt/&lt;/a&gt; with better
styling. This work is supported by Anaconda, Inc.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://dask.org"&gt;Dask&lt;/a&gt;’s machine learning package, &lt;a class="reference external" href="https://ml.dask.org/"&gt;Dask-ML&lt;/a&gt; now implements Hyperband, an
advanced “hyperparameter optimization” algorithm that performs rather well.
This post will&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;describe “hyperparameter optimization”, a common problem in machine learning&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;describe Hyperband’s benefits and why it works&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;show how to use Hyperband via example alongside performance comparisons&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this post, I’ll walk through a practical example and highlight key portions
of the paper “&lt;a class="reference external" href="http://conference.scipy.org/proceedings/scipy2019/pdfs/scott_sievert.pdf"&gt;Better and faster hyperparameter optimization with Dask&lt;/a&gt;”, which is also
summarized in a &lt;a class="reference external" href="https://www.youtube.com/watch?v=x67K9FiPFBQ"&gt;~25 minute SciPy 2019 talk&lt;/a&gt;.&lt;/p&gt;
&lt;!--More--&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 41)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="problem"&gt;

&lt;p&gt;Machine learning requires data, an untrained model and “hyperparameters”, parameters that are chosen before training begins that
help with cohesion between the model and data. The user needs to specify values
for these hyperparameters in order to use the model. A good example is
adapting ridge regression or LASSO to the amount of noise in the
data with the regularization parameter.&lt;a class="footnote-reference brackets" href="#alpha" id="id1" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;1&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Model performance strongly depends on the hyperparameters provided. A fairly complex example is with a particular
visualization tool, &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html"&gt;t-SNE&lt;/a&gt;. This tool requires (at least) three
hyperparameters and performance depends radically on the hyperparameters. In fact, the first section in “&lt;a class="reference external" href="https://distill.pub/2016/misread-tsne/"&gt;How to Use t-SNE
Effectively&lt;/a&gt;” is titled “Those hyperparameters really matter”.&lt;/p&gt;
&lt;p&gt;Finding good values for these hyperparameters is critical and has an entire
Scikit-learn documentation page, “&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/grid_search.html"&gt;Tuning the hyperparameters of an
estimator&lt;/a&gt;.” Briefly, finding decent values of hyperparameters
is difficult and requires guessing or searching.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How can these hyperparameters be found quickly and efficiently with an
advanced task scheduler like Dask?&lt;/strong&gt; Parallelism will pose some challenges, but
the Dask architecture enables some advanced algorithms.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: this post presumes knowledge of Dask basics. This material is covered in
Dask’s documentation on &lt;a class="reference external" href="https://docs.dask.org/en/latest/why.html"&gt;Why Dask?&lt;/a&gt;, a ~15 minute &lt;a class="reference external" href="https://www.youtube.com/watch?v=ods97a5Pzw0"&gt;video introduction to
Dask&lt;/a&gt;, a &lt;a class="reference external" href="https://www.youtube.com/watch?v=tQBovBvSDvA"&gt;video introduction to Dask-ML&lt;/a&gt; and &lt;a class="reference external" href="https://stsievert.com/blog/2016/09/09/dask-cluster/"&gt;a
blog post I wrote&lt;/a&gt; on my first use of Dask.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 78)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="contributions"&gt;
&lt;h1&gt;Contributions&lt;/h1&gt;
&lt;p&gt;Dask-ML can quickly find high-performing hyperparameters. I will back this
claim with intuition and experimental evidence.&lt;/p&gt;
&lt;p&gt;Specifically, this is because
Dask-ML now
implements an algorithm introduced by Li et. al. in “&lt;a class="reference external" href="https://arxiv.org/pdf/1603.06560.pdf"&gt;Hyperband: A novel
bandit-based approach to hyperparameter optimization&lt;/a&gt;”.
Pairing of Dask and Hyperband enables some exciting new performance opportunities,
especially because Hyperband has a simple implementation and Dask is an
advanced task scheduler.&lt;a class="footnote-reference brackets" href="#first" id="id2" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;2&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Let’s go
through the basics of Hyperband then illustrate its use and performance with
an example.
This will highlight some key points of &lt;a class="reference external" href="http://conference.scipy.org/proceedings/scipy2019/pdfs/scott_sievert.pdf"&gt;the corresponding paper&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 104)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="hyperband-basics"&gt;
&lt;h1&gt;Hyperband basics&lt;/h1&gt;
&lt;p&gt;The motivation for Hyperband is to find high performing hyperparameters with minimal
training. Given this goal, it makes sense to spend more time training high
performing models – why waste more time training time a model if it’s done poorly in the past?&lt;/p&gt;
&lt;p&gt;One method to spend more time on high performing models is to initialize many
models, start training all of them, and then stop training low performing models
before training is finished. That’s what Hyperband does. At the most basic
level, Hyperband is a (principled) early-stopping scheme for
&lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html"&gt;RandomizedSearchCV&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Deciding when to stop the training of models depends on how strongly
the training data effects the score. There are two extremes:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;when only the training data matter&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;i.e., when the hyperparameters don’t influence the score at all&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;when only the hyperparameters matter&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;i.e., when the training data don’t influence the score at all&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Hyperband balances these two extremes by sweeping over how frequently
models are stopped. This sweep allows a mathematical proof that Hyperband
will find the best model possible with minimal &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt;
calls&lt;a class="footnote-reference brackets" href="#qual" id="id3" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;3&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Hyperband has significant parallelism because it has two “embarrassingly
parallel” for-loops – Dask can exploit this. Hyperband has been implemented
in Dask, specifically in Dask’s machine library Dask-ML.&lt;/p&gt;
&lt;p&gt;How well does it perform? Let’s illustrate via example. Some setup is required
before the performance comparison in &lt;em&gt;&lt;a class="reference internal" href="#performance"&gt;&lt;span class="xref myst"&gt;Performance&lt;/span&gt;&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 140)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="example"&gt;
&lt;h1&gt;Example&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;Note: want to try &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; out yourself? Dask has &lt;a class="reference external" href="https://examples.dask.org/machine-learning/hyperparam-opt.html"&gt;an example use&lt;/a&gt;.
It can even be run in-browser!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I’ll illustrate with a synthetic example. Let’s build a dataset with 4 classes:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;experiment&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;make_circles&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_circles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_informative&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;img src="/images/2019-hyperband/synthetic/dataset.png"
style="max-width: 100%;"
width="200px" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: this content is pulled from
&lt;a class="reference external" href="https://github.com/stsievert/dask-hyperband-comparison"&gt;stsievert/dask-hyperband-comparison&lt;/a&gt;, or makes slight modifications.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Let’s build a fully connected neural net with 24 neurons for classification:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.neural_network&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MLPClassifier&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MLPClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Building the neural net with PyTorch is also possible&lt;a class="footnote-reference brackets" href="#skorch" id="id4" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;4&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; (and what I used in development).&lt;/p&gt;
&lt;p&gt;This neural net’s behavior is dictated by 6 hyperparameters. Only one controls
the model of the optimal architecture (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;hidden_layer_sizes&lt;/span&gt;&lt;/code&gt;, the number of
neurons in each layer). The rest control finding the best model of that
architecture. Details on the hyperparameters are in the
&lt;em&gt;&lt;a class="reference internal" href="#appendix"&gt;&lt;span class="xref myst"&gt;Appendix&lt;/span&gt;&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# details in appendix&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;dict_keys([&amp;#39;hidden_layer_sizes&amp;#39;, &amp;#39;alpha&amp;#39;, &amp;#39;batch_size&amp;#39;, &amp;#39;learning_rate&amp;#39;&lt;/span&gt;
&lt;span class="go"&gt;           &amp;#39;learning_rate_init&amp;#39;, &amp;#39;power_t&amp;#39;, &amp;#39;momentum&amp;#39;])&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hidden_layer_sizes&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# always 24 neurons&lt;/span&gt;
&lt;span class="go"&gt;[(24, ), (12, 12), (6, 6, 6, 6), (4, 4, 4, 4, 4, 4), (12, 6, 3, 3)]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I choose these hyperparameters to have a complex search space that mimics the
searches performed for most neural networks. These searches typically involve
hyperparameters like “dropout”, “learning rate”, “momentum” and “weight
decay”.&lt;a class="footnote-reference brackets" href="#user-facing" id="id5" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;5&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;
End users don’t care hyperparameters like these; they don’t change the
model architecture, only finding the best model of a particular architecture.&lt;/p&gt;
&lt;p&gt;How can high performing hyperparameter values be found quickly?&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 205)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="finding-the-best-parameters"&gt;
&lt;h1&gt;Finding the best parameters&lt;/h1&gt;
&lt;p&gt;First, let’s look at the parameters required for Dask-ML’s implementation
of Hyperband (which is in the class &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;).&lt;/p&gt;
&lt;section id="hyperband-parameters-rule-of-thumb"&gt;
&lt;h2&gt;Hyperband parameters: rule-of-thumb&lt;/h2&gt;
&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; has two inputs:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;max_iter&lt;/span&gt;&lt;/code&gt;, which determines how many times to call &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;the chunk size of the Dask array, which determines how many data each
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt; call receives.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These fall out pretty naturally once it’s known how long to train the best
model and very approximately how many parameters to sample:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;n_examples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 50 passes through dataset for best model&lt;/span&gt;
&lt;span class="n"&gt;n_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;299&lt;/span&gt;  &lt;span class="c1"&gt;# sample about 300 parameters&lt;/span&gt;

&lt;span class="c1"&gt;# inputs to hyperband&lt;/span&gt;
&lt;span class="n"&gt;max_iter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n_params&lt;/span&gt;
&lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n_examples&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;n_params&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The inputs to this rule-of-thumb are exactly what the user cares about:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;a measure of how complex the search space is (via &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_params&lt;/span&gt;&lt;/code&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;how long to train the best model (via &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_examples&lt;/span&gt;&lt;/code&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Notably, there’s no tradeoff between &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_examples&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_params&lt;/span&gt;&lt;/code&gt; like with
Scikit-learn’s &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomizedSearchCV&lt;/span&gt;&lt;/code&gt; because &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_examples&lt;/span&gt;&lt;/code&gt; is only for &lt;em&gt;some&lt;/em&gt;
models, not for &lt;em&gt;all&lt;/em&gt; models. There’s more details on this
rule-of-thumb in the “Notes” section of the &lt;a class="reference external" href="https://ml.dask.org/modules/generated/dask_ml.model_selection.HyperbandSearchCV.html#dask_ml.model_selection.HyperbandSearchCV"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;
docs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;With these inputs a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; object can easily be created.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="finding-the-best-performing-hyperparameters"&gt;
&lt;h2&gt;Finding the best performing hyperparameters&lt;/h2&gt;
&lt;p&gt;This model selection algorithm Hyperband is implemented in the class
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;. Let’s create an instance of that class:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_ml.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HyperbandSearchCV&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HyperbandSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aggressiveness&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;aggressiveness&lt;/span&gt;&lt;/code&gt; defaults to 3. &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;aggressiveness=4&lt;/span&gt;&lt;/code&gt; is chosen because this is an
&lt;em&gt;initial&lt;/em&gt; search; I know nothing about how this search space. Then, this search
should be more aggressive in culling off bad models.&lt;/p&gt;
&lt;p&gt;Hyperband hides some details from the user (which enables the mathematical
guarantees), specifically the details on the amount of training and
the number of models created. These details are available in the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;metadata&lt;/span&gt;&lt;/code&gt;
attribute:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;n_models&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="go"&gt;378&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;partial_fit_calls&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="go"&gt;5721&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now that we have some idea on how long the computation will take, let’s ask it
to find the best set of hyperparameters:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_ml.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rechunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rechunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The dashboard will be active during this time&lt;a class="footnote-reference brackets" href="#dashboard" id="id6" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;6&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;
&lt;video width="600" style="max-width: 100%;" autoplay loop controls &gt;
  &lt;source src="/images/2019-hyperband/dashboard-compress.mp4" type="video/mp4" &gt;
  Your browser does not support the video tag.
&lt;/video&gt;
&lt;/p&gt;
&lt;p&gt;How well do these hyperparameters perform?&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_score_&lt;/span&gt;
&lt;span class="go"&gt;0.9019221418447483&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; mirrors Scikit-learn’s API for &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html"&gt;RandomizedSearchCV&lt;/a&gt;, so it
has access to all the expected attributes and methods:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt;
&lt;span class="go"&gt;{&amp;quot;batch_size&amp;quot;: 64, &amp;quot;hidden_layer_sizes&amp;quot;: [6, 6, 6, 6], ...}&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;0.8989070100111217&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_model_&lt;/span&gt;
&lt;span class="go"&gt;MLPClassifier(...)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Details on the attributes and methods are in the &lt;a class="reference external" href="https://ml.dask.org/modules/generated/dask_ml.model_selection.HyperbandSearchCV.html"&gt;HyperbandSearchCV
documentation&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 322)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="performance"&gt;
&lt;h1&gt;Performance&lt;/h1&gt;
&lt;!--
Plot 1: how well does it do?
Plot 2: how does this scale?
Plot 3: what opportunities does Dask enable?
--&gt;
&lt;p&gt;I ran this 200 times on my personal laptop with 4 cores.
Let’s look at the distribution of final validation scores:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/2019-hyperband/synthetic/final-acc.svg"
style="max-width: 100%;"
 width="400px"/&gt;&lt;/p&gt;
&lt;p&gt;The “passive” comparison is really &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomizedSearchCV&lt;/span&gt;&lt;/code&gt; configured so it takes
an equal amount of work as &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;. Let’s see how this does over
time:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/2019-hyperband/synthetic/val-acc.svg"
style="max-width: 100%;"
 width="400px"/&gt;&lt;/p&gt;
&lt;p&gt;This graph shows the mean score over the 200 runs with the solid line, and the
shaded region represents the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Interquartile_range"&gt;interquartile range&lt;/a&gt;. The dotted green
line indicates the data required to train 4 models to completion.
“Passes through the dataset” is a good proxy
for “time to solution” because there are only 4 workers.&lt;/p&gt;
&lt;p&gt;This graph shows that &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; will find parameters at least 3 times
quicker than &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomizedSearchCV&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;section id="dask-opportunities"&gt;
&lt;h2&gt;Dask opportunities&lt;/h2&gt;
&lt;p&gt;What opportunities does combining Hyperband and Dask create?
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; has a lot of internal parallelism and Dask is an advanced task
scheduler.&lt;/p&gt;
&lt;p&gt;The most obvious opportunity involves job prioritization. Hyperband fits many
models in parallel and Dask might not have that
workers available. This means some jobs have to wait for other jobs
to finish. Of course, Dask can prioritize jobs&lt;a class="footnote-reference brackets" href="#prior" id="id7" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;7&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; and choose which models
to fit first.&lt;/p&gt;
&lt;p&gt;Let’s assign the priority for fitting a certain model to be the model’s most
recent score. How does this prioritization scheme influence the score? Let’s
compare the prioritization schemes in
a single run of the 200 above:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/2019-hyperband/synthetic/priority.svg"
style="max-width: 100%;"
     width="400px" /&gt;&lt;/p&gt;
&lt;p&gt;These two lines are the same in every way except for
the prioritization scheme.
This graph compares the “high scores” prioritization scheme and the Dask’s
default prioritization scheme (“fifo”).&lt;/p&gt;
&lt;p&gt;This graph is certainly helped by the fact that is run with only 4 workers.
Job priority does not matter if every job can be run right away (there’s
nothing to assign priority too!).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="amenability-to-parallelism"&gt;
&lt;h2&gt;Amenability to parallelism&lt;/h2&gt;
&lt;p&gt;How does Hyperband scale with the number of workers?&lt;/p&gt;
&lt;p&gt;I ran another separate experiment to measure. This experiment is described more in the &lt;a class="reference external" href="http://conference.scipy.org/proceedings/scipy2019/pdfs/scott_sievert.pdf"&gt;corresponding
paper&lt;/a&gt;, but the relevant difference is that a &lt;a class="reference external" href="https://pytorch.org/"&gt;PyTorch&lt;/a&gt; neural network is used
through &lt;a class="reference external" href="https://skorch.readthedocs.io/en/stable/"&gt;skorch&lt;/a&gt; instead of Scikit-learn’s MLPClassifier.&lt;/p&gt;
&lt;p&gt;I ran the &lt;em&gt;same&lt;/em&gt; experiment with a different number of Dask
workers.&lt;a class="footnote-reference brackets" href="#same" id="id8" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;8&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; Here’s how &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; scales:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/2019-hyperband/image-denoising/scaling-patience.svg" width="400px"
style="max-width: 100%;"
/&gt;&lt;/p&gt;
&lt;p&gt;Training one model to completion requires 243 seconds (which is marked by the
white line). This is a comparison with &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;patience&lt;/span&gt;&lt;/code&gt;, which stops training models
if their scores aren’t increasing enough. Functionally, this is very useful
because the user might accidentally specify &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_examples&lt;/span&gt;&lt;/code&gt; to be too large.&lt;/p&gt;
&lt;p&gt;It looks like the speedups start to saturate somewhere
between 16 and 24 workers, at least for this example.
Of course, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;patience&lt;/span&gt;&lt;/code&gt; doesn’t work as well for a large number of
workers.&lt;a class="footnote-reference brackets" href="#scale-worker" id="id9" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;9&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 421)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="future-work"&gt;
&lt;h1&gt;Future work&lt;/h1&gt;
&lt;p&gt;There’s some ongoing pull requests to improve &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;. The most
significant of these involves tweaking some Hyperband internals so &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;
works better with initial or very exploratory searches (&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/532"&gt;dask/dask-ml #532&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The biggest improvement I see is treating &lt;em&gt;dataset size&lt;/em&gt; as the scarce resource
that needs to be preserved instead of &lt;em&gt;training time&lt;/em&gt;. This would allow
Hyperband to work with any model, instead of only models that implement
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Serialization is an important part of the distributed Hyperband implementation
in &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;. Scikit-learn and PyTorch can easily handle this because
they support the Pickle protocol&lt;a class="footnote-reference brackets" href="#pickle-post" id="id10" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;10&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;, but
Keras/Tensorflow/MXNet present challenges. The use of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; could
be increased by resolving this issue.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 444)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="appendix"&gt;
&lt;h1&gt;Appendix&lt;/h1&gt;
&lt;p&gt;I choose to tune 7 hyperparameters, which are&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;hidden_layer_sizes&lt;/span&gt;&lt;/code&gt;, which controls the activation function used at each
neuron&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;alpha&lt;/span&gt;&lt;/code&gt;, which controls the amount of regularization&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;More hyperparameters control finding the best neural network:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;batch_size&lt;/span&gt;&lt;/code&gt;, which controls the number of examples the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;optimizer&lt;/span&gt;&lt;/code&gt; uses to
approximate the gradient&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;learning_rate&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;learning_rate_init&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;power_t&lt;/span&gt;&lt;/code&gt;, which control some basic
hyperparameters for the SGD optimizer I’ll be using&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;momentum&lt;/span&gt;&lt;/code&gt;, a more advanced hyperparameter for SGD with Nesterov’s momentum.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;hr class="footnotes docutils" /&gt;
&lt;aside class="footnote-list brackets"&gt;
&lt;aside class="footnote brackets" id="alpha" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id1"&gt;1&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Which amounts to choosing &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;alpha&lt;/span&gt;&lt;/code&gt; in Scikit-learn’s &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html"&gt;Ridge&lt;/a&gt; or &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html"&gt;LASSO&lt;/a&gt;&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="first" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id2"&gt;2&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;To the best of my knowledge, this is the first implementation of Hyperband with an advanced task scheduler&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="qual" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id3"&gt;3&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;More accurately, Hyperband will find close to the best model possible with &lt;span class="math notranslate nohighlight"&gt;\(N\)&lt;/span&gt; &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt; calls in expected score with high probability, where “close” means “within log terms of the upper bound on score”. For details, see Corollary 1 of the &lt;a class="reference external" href="http://conference.scipy.org/proceedings/scipy2019/pdfs/scott_sievert.pdf"&gt;corresponding paper&lt;/a&gt; or Theorem 5 of &lt;a class="reference external" href="https://arxiv.org/pdf/1603.06560.pdf"&gt;Hyperband’s paper&lt;/a&gt;.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="skorch" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id4"&gt;4&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;through the Scikit-learn API wrapper &lt;a class="reference external" href="https://skorch.readthedocs.io/en/stable/"&gt;skorch&lt;/a&gt;&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="user-facing" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id5"&gt;5&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;There’s less tuning for adaptive step size methods like &lt;a class="reference external" href="https://arxiv.org/abs/1412.6980"&gt;Adam&lt;/a&gt; or &lt;a class="reference external" href="http://jmlr.org/papers/v12/duchi11a.html"&gt;Adagrad&lt;/a&gt;, but they might under-perform on the test data (see “&lt;a class="reference external" href="https://arxiv.org/abs/1705.08292"&gt;The Marginal Value of Adaptive Gradient Methods for Machine Learning&lt;/a&gt;”)&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="dashboard" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id6"&gt;6&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;But it probably won’t be this fast: the video is sped up by a factor of 3.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="prior" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id7"&gt;7&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;See Dask’s documentation on &lt;a class="reference external" href="https://distributed.dask.org/en/latest/priority.html"&gt;Prioritizing Work&lt;/a&gt;&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="same" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id8"&gt;8&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Everything is the same between different runs: the hyperparameters sampled, the model’s internal random state, the data passed for fitting. Only the number of workers varies.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="scale-worker" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id9"&gt;9&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;There’s no time benefit to stopping jobs early if there are infinite workers; there’s never a queue of jobs waiting to be run&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="pickle-post" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id10"&gt;10&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;“&lt;a class="reference external" href="http://matthewrocklin.com/blog/work/2018/07/23/protocols-pickle"&gt;Pickle isn’t slow, it’s a protocol&lt;/a&gt;” by Matthew Rocklin&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="regularization" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;11&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Performance comparison: Scikit-learn’s visualization of tuning a Support Vector Machine’s (SVM) regularization parameter: &lt;a class="reference external" href="https://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html"&gt;Scaling the regularization parameter for SVMs&lt;/a&gt;&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="new" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;12&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;It’s been around since 2016… and some call that “old news.”&lt;/p&gt;
&lt;/aside&gt;
&lt;/aside&gt;
</content>
    <link href="https://blog.dask.org/2019/09/30/dask-hyperparam-opt/"/>
    <summary>Scott Sievert wrote this post. The original post lives at
https://stsievert.com/blog/2019/09/27/dask-hyperparam-opt/ with better
styling. This work is supported by Anaconda, Inc.</summary>
    <category term="dask-ml" label="dask-ml"/>
    <category term="machine-learning" label="machine-learning"/>
    <published>2019-09-30T00:00:00+00:00</published>
  </entry>
</feed>
