<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://blog.dask.org</id>
  <title>Dask Working Notes - Posts by &lt;a href="http://stsievert.com"&gt;Scott Sievert&lt;/a&gt;</title>
  <updated>2026-03-05T15:05:19.373710+00:00</updated>
  <link href="https://blog.dask.org"/>
  <link href="https://blog.dask.org/blog/author/a-hrefhttpstsievertcomscott-sieverta/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://blog.dask.org/2019/09/30/dask-hyperparam-opt/</id>
    <title>Better and faster hyperparameter optimization with Dask</title>
    <updated>2019-09-30T00:00:00+00:00</updated>
    <author>
      <name>&lt;a href="http://stsievert.com"&gt;Scott Sievert&lt;/a&gt;</name>
    </author>
    <content type="html">&lt;p&gt;&lt;em&gt;&lt;a class="reference external" href="https://stsievert.com"&gt;Scott Sievert&lt;/a&gt; wrote this post. The original post lives at
&lt;a class="reference external" href="https://stsievert.com/blog/2019/09/27/dask-hyperparam-opt/"&gt;https://stsievert.com/blog/2019/09/27/dask-hyperparam-opt/&lt;/a&gt; with better
styling. This work is supported by Anaconda, Inc.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://dask.org"&gt;Dask&lt;/a&gt;’s machine learning package, &lt;a class="reference external" href="https://ml.dask.org/"&gt;Dask-ML&lt;/a&gt; now implements Hyperband, an
advanced “hyperparameter optimization” algorithm that performs rather well.
This post will&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;describe “hyperparameter optimization”, a common problem in machine learning&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;describe Hyperband’s benefits and why it works&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;show how to use Hyperband via example alongside performance comparisons&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this post, I’ll walk through a practical example and highlight key portions
of the paper “&lt;a class="reference external" href="http://conference.scipy.org/proceedings/scipy2019/pdfs/scott_sievert.pdf"&gt;Better and faster hyperparameter optimization with Dask&lt;/a&gt;”, which is also
summarized in a &lt;a class="reference external" href="https://www.youtube.com/watch?v=x67K9FiPFBQ"&gt;~25 minute SciPy 2019 talk&lt;/a&gt;.&lt;/p&gt;
&lt;!--More--&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 41)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="problem"&gt;

&lt;p&gt;Machine learning requires data, an untrained model and “hyperparameters”, parameters that are chosen before training begins that
help with cohesion between the model and data. The user needs to specify values
for these hyperparameters in order to use the model. A good example is
adapting ridge regression or LASSO to the amount of noise in the
data with the regularization parameter.&lt;a class="footnote-reference brackets" href="#alpha" id="id1" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;1&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Model performance strongly depends on the hyperparameters provided. A fairly complex example is with a particular
visualization tool, &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html"&gt;t-SNE&lt;/a&gt;. This tool requires (at least) three
hyperparameters and performance depends radically on the hyperparameters. In fact, the first section in “&lt;a class="reference external" href="https://distill.pub/2016/misread-tsne/"&gt;How to Use t-SNE
Effectively&lt;/a&gt;” is titled “Those hyperparameters really matter”.&lt;/p&gt;
&lt;p&gt;Finding good values for these hyperparameters is critical and has an entire
Scikit-learn documentation page, “&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/grid_search.html"&gt;Tuning the hyperparameters of an
estimator&lt;/a&gt;.” Briefly, finding decent values of hyperparameters
is difficult and requires guessing or searching.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How can these hyperparameters be found quickly and efficiently with an
advanced task scheduler like Dask?&lt;/strong&gt; Parallelism will pose some challenges, but
the Dask architecture enables some advanced algorithms.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: this post presumes knowledge of Dask basics. This material is covered in
Dask’s documentation on &lt;a class="reference external" href="https://docs.dask.org/en/latest/why.html"&gt;Why Dask?&lt;/a&gt;, a ~15 minute &lt;a class="reference external" href="https://www.youtube.com/watch?v=ods97a5Pzw0"&gt;video introduction to
Dask&lt;/a&gt;, a &lt;a class="reference external" href="https://www.youtube.com/watch?v=tQBovBvSDvA"&gt;video introduction to Dask-ML&lt;/a&gt; and &lt;a class="reference external" href="https://stsievert.com/blog/2016/09/09/dask-cluster/"&gt;a
blog post I wrote&lt;/a&gt; on my first use of Dask.&lt;/em&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 78)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="contributions"&gt;
&lt;h1&gt;Contributions&lt;/h1&gt;
&lt;p&gt;Dask-ML can quickly find high-performing hyperparameters. I will back this
claim with intuition and experimental evidence.&lt;/p&gt;
&lt;p&gt;Specifically, this is because
Dask-ML now
implements an algorithm introduced by Li et. al. in “&lt;a class="reference external" href="https://arxiv.org/pdf/1603.06560.pdf"&gt;Hyperband: A novel
bandit-based approach to hyperparameter optimization&lt;/a&gt;”.
Pairing of Dask and Hyperband enables some exciting new performance opportunities,
especially because Hyperband has a simple implementation and Dask is an
advanced task scheduler.&lt;a class="footnote-reference brackets" href="#first" id="id2" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;2&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Let’s go
through the basics of Hyperband then illustrate its use and performance with
an example.
This will highlight some key points of &lt;a class="reference external" href="http://conference.scipy.org/proceedings/scipy2019/pdfs/scott_sievert.pdf"&gt;the corresponding paper&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 104)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="hyperband-basics"&gt;
&lt;h1&gt;Hyperband basics&lt;/h1&gt;
&lt;p&gt;The motivation for Hyperband is to find high performing hyperparameters with minimal
training. Given this goal, it makes sense to spend more time training high
performing models – why waste more time training time a model if it’s done poorly in the past?&lt;/p&gt;
&lt;p&gt;One method to spend more time on high performing models is to initialize many
models, start training all of them, and then stop training low performing models
before training is finished. That’s what Hyperband does. At the most basic
level, Hyperband is a (principled) early-stopping scheme for
&lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html"&gt;RandomizedSearchCV&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Deciding when to stop the training of models depends on how strongly
the training data effects the score. There are two extremes:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;when only the training data matter&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;i.e., when the hyperparameters don’t influence the score at all&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;when only the hyperparameters matter&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;i.e., when the training data don’t influence the score at all&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Hyperband balances these two extremes by sweeping over how frequently
models are stopped. This sweep allows a mathematical proof that Hyperband
will find the best model possible with minimal &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt;
calls&lt;a class="footnote-reference brackets" href="#qual" id="id3" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;3&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Hyperband has significant parallelism because it has two “embarrassingly
parallel” for-loops – Dask can exploit this. Hyperband has been implemented
in Dask, specifically in Dask’s machine library Dask-ML.&lt;/p&gt;
&lt;p&gt;How well does it perform? Let’s illustrate via example. Some setup is required
before the performance comparison in &lt;em&gt;&lt;a class="reference internal" href="#performance"&gt;&lt;span class="xref myst"&gt;Performance&lt;/span&gt;&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 140)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="example"&gt;
&lt;h1&gt;Example&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;Note: want to try &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; out yourself? Dask has &lt;a class="reference external" href="https://examples.dask.org/machine-learning/hyperparam-opt.html"&gt;an example use&lt;/a&gt;.
It can even be run in-browser!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I’ll illustrate with a synthetic example. Let’s build a dataset with 4 classes:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;experiment&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;make_circles&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_circles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_informative&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;img src="/images/2019-hyperband/synthetic/dataset.png"
style="max-width: 100%;"
width="200px" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: this content is pulled from
&lt;a class="reference external" href="https://github.com/stsievert/dask-hyperband-comparison"&gt;stsievert/dask-hyperband-comparison&lt;/a&gt;, or makes slight modifications.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Let’s build a fully connected neural net with 24 neurons for classification:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sklearn.neural_network&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MLPClassifier&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MLPClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Building the neural net with PyTorch is also possible&lt;a class="footnote-reference brackets" href="#skorch" id="id4" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;4&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; (and what I used in development).&lt;/p&gt;
&lt;p&gt;This neural net’s behavior is dictated by 6 hyperparameters. Only one controls
the model of the optimal architecture (&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;hidden_layer_sizes&lt;/span&gt;&lt;/code&gt;, the number of
neurons in each layer). The rest control finding the best model of that
architecture. Details on the hyperparameters are in the
&lt;em&gt;&lt;a class="reference internal" href="#appendix"&gt;&lt;span class="xref myst"&gt;Appendix&lt;/span&gt;&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# details in appendix&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;dict_keys([&amp;#39;hidden_layer_sizes&amp;#39;, &amp;#39;alpha&amp;#39;, &amp;#39;batch_size&amp;#39;, &amp;#39;learning_rate&amp;#39;&lt;/span&gt;
&lt;span class="go"&gt;           &amp;#39;learning_rate_init&amp;#39;, &amp;#39;power_t&amp;#39;, &amp;#39;momentum&amp;#39;])&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hidden_layer_sizes&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# always 24 neurons&lt;/span&gt;
&lt;span class="go"&gt;[(24, ), (12, 12), (6, 6, 6, 6), (4, 4, 4, 4, 4, 4), (12, 6, 3, 3)]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I choose these hyperparameters to have a complex search space that mimics the
searches performed for most neural networks. These searches typically involve
hyperparameters like “dropout”, “learning rate”, “momentum” and “weight
decay”.&lt;a class="footnote-reference brackets" href="#user-facing" id="id5" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;5&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;
End users don’t care hyperparameters like these; they don’t change the
model architecture, only finding the best model of a particular architecture.&lt;/p&gt;
&lt;p&gt;How can high performing hyperparameter values be found quickly?&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 205)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="finding-the-best-parameters"&gt;
&lt;h1&gt;Finding the best parameters&lt;/h1&gt;
&lt;p&gt;First, let’s look at the parameters required for Dask-ML’s implementation
of Hyperband (which is in the class &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;).&lt;/p&gt;
&lt;section id="hyperband-parameters-rule-of-thumb"&gt;
&lt;h2&gt;Hyperband parameters: rule-of-thumb&lt;/h2&gt;
&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; has two inputs:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;max_iter&lt;/span&gt;&lt;/code&gt;, which determines how many times to call &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;the chunk size of the Dask array, which determines how many data each
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt; call receives.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These fall out pretty naturally once it’s known how long to train the best
model and very approximately how many parameters to sample:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;n_examples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 50 passes through dataset for best model&lt;/span&gt;
&lt;span class="n"&gt;n_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;299&lt;/span&gt;  &lt;span class="c1"&gt;# sample about 300 parameters&lt;/span&gt;

&lt;span class="c1"&gt;# inputs to hyperband&lt;/span&gt;
&lt;span class="n"&gt;max_iter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n_params&lt;/span&gt;
&lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n_examples&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;n_params&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The inputs to this rule-of-thumb are exactly what the user cares about:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;a measure of how complex the search space is (via &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_params&lt;/span&gt;&lt;/code&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;how long to train the best model (via &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_examples&lt;/span&gt;&lt;/code&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Notably, there’s no tradeoff between &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_examples&lt;/span&gt;&lt;/code&gt; and &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_params&lt;/span&gt;&lt;/code&gt; like with
Scikit-learn’s &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomizedSearchCV&lt;/span&gt;&lt;/code&gt; because &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_examples&lt;/span&gt;&lt;/code&gt; is only for &lt;em&gt;some&lt;/em&gt;
models, not for &lt;em&gt;all&lt;/em&gt; models. There’s more details on this
rule-of-thumb in the “Notes” section of the &lt;a class="reference external" href="https://ml.dask.org/modules/generated/dask_ml.model_selection.HyperbandSearchCV.html#dask_ml.model_selection.HyperbandSearchCV"&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;
docs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;With these inputs a &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; object can easily be created.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="finding-the-best-performing-hyperparameters"&gt;
&lt;h2&gt;Finding the best performing hyperparameters&lt;/h2&gt;
&lt;p&gt;This model selection algorithm Hyperband is implemented in the class
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;. Let’s create an instance of that class:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_ml.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HyperbandSearchCV&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HyperbandSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aggressiveness&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;aggressiveness&lt;/span&gt;&lt;/code&gt; defaults to 3. &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;aggressiveness=4&lt;/span&gt;&lt;/code&gt; is chosen because this is an
&lt;em&gt;initial&lt;/em&gt; search; I know nothing about how this search space. Then, this search
should be more aggressive in culling off bad models.&lt;/p&gt;
&lt;p&gt;Hyperband hides some details from the user (which enables the mathematical
guarantees), specifically the details on the amount of training and
the number of models created. These details are available in the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;metadata&lt;/span&gt;&lt;/code&gt;
attribute:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;n_models&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="go"&gt;378&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;partial_fit_calls&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="go"&gt;5721&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now that we have some idea on how long the computation will take, let’s ask it
to find the best set of hyperparameters:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dask_ml.model_selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rechunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rechunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The dashboard will be active during this time&lt;a class="footnote-reference brackets" href="#dashboard" id="id6" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;6&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;
&lt;video width="600" style="max-width: 100%;" autoplay loop controls &gt;
  &lt;source src="/images/2019-hyperband/dashboard-compress.mp4" type="video/mp4" &gt;
  Your browser does not support the video tag.
&lt;/video&gt;
&lt;/p&gt;
&lt;p&gt;How well do these hyperparameters perform?&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_score_&lt;/span&gt;
&lt;span class="go"&gt;0.9019221418447483&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; mirrors Scikit-learn’s API for &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html"&gt;RandomizedSearchCV&lt;/a&gt;, so it
has access to all the expected attributes and methods:&lt;/p&gt;
&lt;div class="highlight-python notranslate"&gt;&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt;
&lt;span class="go"&gt;{&amp;quot;batch_size&amp;quot;: 64, &amp;quot;hidden_layer_sizes&amp;quot;: [6, 6, 6, 6], ...}&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;0.8989070100111217&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_model_&lt;/span&gt;
&lt;span class="go"&gt;MLPClassifier(...)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Details on the attributes and methods are in the &lt;a class="reference external" href="https://ml.dask.org/modules/generated/dask_ml.model_selection.HyperbandSearchCV.html"&gt;HyperbandSearchCV
documentation&lt;/a&gt;.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 322)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="performance"&gt;
&lt;h1&gt;Performance&lt;/h1&gt;
&lt;!--
Plot 1: how well does it do?
Plot 2: how does this scale?
Plot 3: what opportunities does Dask enable?
--&gt;
&lt;p&gt;I ran this 200 times on my personal laptop with 4 cores.
Let’s look at the distribution of final validation scores:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/2019-hyperband/synthetic/final-acc.svg"
style="max-width: 100%;"
 width="400px"/&gt;&lt;/p&gt;
&lt;p&gt;The “passive” comparison is really &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomizedSearchCV&lt;/span&gt;&lt;/code&gt; configured so it takes
an equal amount of work as &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;. Let’s see how this does over
time:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/2019-hyperband/synthetic/val-acc.svg"
style="max-width: 100%;"
 width="400px"/&gt;&lt;/p&gt;
&lt;p&gt;This graph shows the mean score over the 200 runs with the solid line, and the
shaded region represents the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Interquartile_range"&gt;interquartile range&lt;/a&gt;. The dotted green
line indicates the data required to train 4 models to completion.
“Passes through the dataset” is a good proxy
for “time to solution” because there are only 4 workers.&lt;/p&gt;
&lt;p&gt;This graph shows that &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; will find parameters at least 3 times
quicker than &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;RandomizedSearchCV&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;section id="dask-opportunities"&gt;
&lt;h2&gt;Dask opportunities&lt;/h2&gt;
&lt;p&gt;What opportunities does combining Hyperband and Dask create?
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; has a lot of internal parallelism and Dask is an advanced task
scheduler.&lt;/p&gt;
&lt;p&gt;The most obvious opportunity involves job prioritization. Hyperband fits many
models in parallel and Dask might not have that
workers available. This means some jobs have to wait for other jobs
to finish. Of course, Dask can prioritize jobs&lt;a class="footnote-reference brackets" href="#prior" id="id7" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;7&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; and choose which models
to fit first.&lt;/p&gt;
&lt;p&gt;Let’s assign the priority for fitting a certain model to be the model’s most
recent score. How does this prioritization scheme influence the score? Let’s
compare the prioritization schemes in
a single run of the 200 above:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/2019-hyperband/synthetic/priority.svg"
style="max-width: 100%;"
     width="400px" /&gt;&lt;/p&gt;
&lt;p&gt;These two lines are the same in every way except for
the prioritization scheme.
This graph compares the “high scores” prioritization scheme and the Dask’s
default prioritization scheme (“fifo”).&lt;/p&gt;
&lt;p&gt;This graph is certainly helped by the fact that is run with only 4 workers.
Job priority does not matter if every job can be run right away (there’s
nothing to assign priority too!).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="amenability-to-parallelism"&gt;
&lt;h2&gt;Amenability to parallelism&lt;/h2&gt;
&lt;p&gt;How does Hyperband scale with the number of workers?&lt;/p&gt;
&lt;p&gt;I ran another separate experiment to measure. This experiment is described more in the &lt;a class="reference external" href="http://conference.scipy.org/proceedings/scipy2019/pdfs/scott_sievert.pdf"&gt;corresponding
paper&lt;/a&gt;, but the relevant difference is that a &lt;a class="reference external" href="https://pytorch.org/"&gt;PyTorch&lt;/a&gt; neural network is used
through &lt;a class="reference external" href="https://skorch.readthedocs.io/en/stable/"&gt;skorch&lt;/a&gt; instead of Scikit-learn’s MLPClassifier.&lt;/p&gt;
&lt;p&gt;I ran the &lt;em&gt;same&lt;/em&gt; experiment with a different number of Dask
workers.&lt;a class="footnote-reference brackets" href="#same" id="id8" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;8&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt; Here’s how &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; scales:&lt;/p&gt;
&lt;p&gt;&lt;img src="/images/2019-hyperband/image-denoising/scaling-patience.svg" width="400px"
style="max-width: 100%;"
/&gt;&lt;/p&gt;
&lt;p&gt;Training one model to completion requires 243 seconds (which is marked by the
white line). This is a comparison with &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;patience&lt;/span&gt;&lt;/code&gt;, which stops training models
if their scores aren’t increasing enough. Functionally, this is very useful
because the user might accidentally specify &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;n_examples&lt;/span&gt;&lt;/code&gt; to be too large.&lt;/p&gt;
&lt;p&gt;It looks like the speedups start to saturate somewhere
between 16 and 24 workers, at least for this example.
Of course, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;patience&lt;/span&gt;&lt;/code&gt; doesn’t work as well for a large number of
workers.&lt;a class="footnote-reference brackets" href="#scale-worker" id="id9" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;9&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 421)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="future-work"&gt;
&lt;h1&gt;Future work&lt;/h1&gt;
&lt;p&gt;There’s some ongoing pull requests to improve &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;. The most
significant of these involves tweaking some Hyperband internals so &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;
works better with initial or very exploratory searches (&lt;a class="reference external" href="https://github.com/dask/dask-ml/pull/532"&gt;dask/dask-ml #532&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The biggest improvement I see is treating &lt;em&gt;dataset size&lt;/em&gt; as the scarce resource
that needs to be preserved instead of &lt;em&gt;training time&lt;/em&gt;. This would allow
Hyperband to work with any model, instead of only models that implement
&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Serialization is an important part of the distributed Hyperband implementation
in &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt;. Scikit-learn and PyTorch can easily handle this because
they support the Pickle protocol&lt;a class="footnote-reference brackets" href="#pickle-post" id="id10" role="doc-noteref"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;10&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/a&gt;, but
Keras/Tensorflow/MXNet present challenges. The use of &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;HyperbandSearchCV&lt;/span&gt;&lt;/code&gt; could
be increased by resolving this issue.&lt;/p&gt;
&lt;aside class="system-message"&gt;
&lt;p class="system-message-title"&gt;System Message: WARNING/2 (&lt;span class="docutils literal"&gt;/opt/build/repo/2019/09/30/dask-hyperparam-opt.md&lt;/span&gt;, line 444)&lt;/p&gt;
&lt;p&gt;Document headings start at H2, not H1 [myst.header]&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;section id="appendix"&gt;
&lt;h1&gt;Appendix&lt;/h1&gt;
&lt;p&gt;I choose to tune 7 hyperparameters, which are&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;hidden_layer_sizes&lt;/span&gt;&lt;/code&gt;, which controls the activation function used at each
neuron&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;alpha&lt;/span&gt;&lt;/code&gt;, which controls the amount of regularization&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;More hyperparameters control finding the best neural network:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;batch_size&lt;/span&gt;&lt;/code&gt;, which controls the number of examples the &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;optimizer&lt;/span&gt;&lt;/code&gt; uses to
approximate the gradient&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;learning_rate&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;learning_rate_init&lt;/span&gt;&lt;/code&gt;, &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;power_t&lt;/span&gt;&lt;/code&gt;, which control some basic
hyperparameters for the SGD optimizer I’ll be using&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;momentum&lt;/span&gt;&lt;/code&gt;, a more advanced hyperparameter for SGD with Nesterov’s momentum.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;hr class="footnotes docutils" /&gt;
&lt;aside class="footnote-list brackets"&gt;
&lt;aside class="footnote brackets" id="alpha" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id1"&gt;1&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Which amounts to choosing &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;alpha&lt;/span&gt;&lt;/code&gt; in Scikit-learn’s &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html"&gt;Ridge&lt;/a&gt; or &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html"&gt;LASSO&lt;/a&gt;&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="first" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id2"&gt;2&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;To the best of my knowledge, this is the first implementation of Hyperband with an advanced task scheduler&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="qual" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id3"&gt;3&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;More accurately, Hyperband will find close to the best model possible with &lt;span class="math notranslate nohighlight"&gt;\(N\)&lt;/span&gt; &lt;code class="docutils literal notranslate"&gt;&lt;span class="pre"&gt;partial_fit&lt;/span&gt;&lt;/code&gt; calls in expected score with high probability, where “close” means “within log terms of the upper bound on score”. For details, see Corollary 1 of the &lt;a class="reference external" href="http://conference.scipy.org/proceedings/scipy2019/pdfs/scott_sievert.pdf"&gt;corresponding paper&lt;/a&gt; or Theorem 5 of &lt;a class="reference external" href="https://arxiv.org/pdf/1603.06560.pdf"&gt;Hyperband’s paper&lt;/a&gt;.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="skorch" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id4"&gt;4&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;through the Scikit-learn API wrapper &lt;a class="reference external" href="https://skorch.readthedocs.io/en/stable/"&gt;skorch&lt;/a&gt;&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="user-facing" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id5"&gt;5&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;There’s less tuning for adaptive step size methods like &lt;a class="reference external" href="https://arxiv.org/abs/1412.6980"&gt;Adam&lt;/a&gt; or &lt;a class="reference external" href="http://jmlr.org/papers/v12/duchi11a.html"&gt;Adagrad&lt;/a&gt;, but they might under-perform on the test data (see “&lt;a class="reference external" href="https://arxiv.org/abs/1705.08292"&gt;The Marginal Value of Adaptive Gradient Methods for Machine Learning&lt;/a&gt;”)&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="dashboard" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id6"&gt;6&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;But it probably won’t be this fast: the video is sped up by a factor of 3.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="prior" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id7"&gt;7&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;See Dask’s documentation on &lt;a class="reference external" href="https://distributed.dask.org/en/latest/priority.html"&gt;Prioritizing Work&lt;/a&gt;&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="same" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id8"&gt;8&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Everything is the same between different runs: the hyperparameters sampled, the model’s internal random state, the data passed for fitting. Only the number of workers varies.&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="scale-worker" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id9"&gt;9&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;There’s no time benefit to stopping jobs early if there are infinite workers; there’s never a queue of jobs waiting to be run&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="pickle-post" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;&lt;a role="doc-backlink" href="#id10"&gt;10&lt;/a&gt;&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;“&lt;a class="reference external" href="http://matthewrocklin.com/blog/work/2018/07/23/protocols-pickle"&gt;Pickle isn’t slow, it’s a protocol&lt;/a&gt;” by Matthew Rocklin&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="regularization" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;11&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;Performance comparison: Scikit-learn’s visualization of tuning a Support Vector Machine’s (SVM) regularization parameter: &lt;a class="reference external" href="https://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html"&gt;Scaling the regularization parameter for SVMs&lt;/a&gt;&lt;/p&gt;
&lt;/aside&gt;
&lt;aside class="footnote brackets" id="new" role="doc-footnote"&gt;
&lt;span class="label"&gt;&lt;span class="fn-bracket"&gt;[&lt;/span&gt;12&lt;span class="fn-bracket"&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;It’s been around since 2016… and some call that “old news.”&lt;/p&gt;
&lt;/aside&gt;
&lt;/aside&gt;
</content>
    <link href="https://blog.dask.org/2019/09/30/dask-hyperparam-opt/"/>
    <summary>Scott Sievert wrote this post. The original post lives at
https://stsievert.com/blog/2019/09/27/dask-hyperparam-opt/ with better
styling. This work is supported by Anaconda, Inc.</summary>
    <category term="dask-ml" label="dask-ml"/>
    <category term="machine-learning" label="machine-learning"/>
    <published>2019-09-30T00:00:00+00:00</published>
  </entry>
</feed>
