Dask Working Notes - Posts tagged dask-ml

Comparing Dask-ML and Ray Tune's Model Selection Algorithms

2020-08-06T00:00:00+00:00

Hyperparameter optimization is the process of deducing model parameters that can’t be learned from data. This process is often time- and resource-consuming, especially in the context of deep learning. A good description of this process can be found at “Tuning the hyper-parameters of an estimator,” and the issues that arise are concisely summarized in Dask-ML’s documentation of “Hyper Parameter Searches.”

There’s a host of libraries and frameworks out there to address this problem. Scikit-Learn’s module has been mirrored in Dask-ML and auto-sklearn, both of which offer advanced hyperparameter optimization techniques. Other implementations that don’t follow the Scikit-Learn interface include Ray Tune, AutoML and Optuna.

Ray recently provided a wrapper to Ray Tune that mirrors the Scikit-Learn API called tune-sklearn (docs, source). The introduction of this library states the following:

Cutting edge hyperparameter tuning techniques (Bayesian optimization, early stopping, distributed execution) can provide significant speedups over grid search and random search.

However, the machine learning ecosystem is missing a solution that provides users with the ability to leverage these new algorithms while allowing users to stay within the Scikit-Learn API. In this blog post, we introduce tune-sklearn [Ray’s tuning library] to bridge this gap. Tune-sklearn is a drop-in replacement for Scikit-Learn’s model selection module with state-of-the-art optimization features.

—GridSearchCV 2.0 — New and Improved

This claim is inaccurate: for over a year Dask-ML has provided access to “cutting edge hyperparameter tuning techniques” with a Scikit-Learn compatible API. To correct their statement, let’s look at each of the features that Ray’s tune-sklearn provides, and compare them to Dask-ML:

Here’s what [Ray’s] tune-sklearn has to offer:

Consistency with Scikit-Learn API …

Modern hyperparameter tuning techniques …

Framework support …

Scale up … [to] multiple cores and even multiple machines.

[Ray’s] Tune-sklearn is also fast.

Dask-ML’s model selection module has every one of the features:

Consistency with Scikit-Learn API: Dask-ML’s model selection API mirrors the Scikit-Learn model selection API.
Modern hyperparameter tuning techniques: Dask-ML offers state-of-the-art hyperparameter tuning techniques.
Framework support: Dask-ML model selection supports many libraries including Scikit-Learn, PyTorch, Keras, LightGBM and XGBoost.
Scale up: Dask-ML supports distributed tuning (how could it not?) and larger-than-memory datasets.

Dask-ML is also fast. In “Speed” we show a benchmark between Dask-ML, Ray and Scikit-Learn:

Only time-to-solution is relevant; all of these methods produce similar model scores. See “Speed” for details.

Now, let’s walk through the details on how to use Dask-ML to obtain the 5 features above.

Dask-ML is consistent with the Scikit-Learn API.

Here’s how to use Scikit-Learn’s, Dask-ML’s and Ray’s tune-sklearn hyperparameter optimization:

## Trimmed example; see appendix for more detail
from sklearn.model_selection import RandomizedSearchCV
search = RandomizedSearchCV(model, params, ...)
search.fit(X, y)

from dask_ml.model_selection import HyperbandSearchCV
search = HyperbandSearchCV(model, params, ...)
search.fit(X, y, classes=[0, 1])

from tune_sklearn import TuneSearchCV
search = TuneSearchCV(model, params, ...)
search.fit(X, y, classes=[0, 1])

The definitions of model and params follow the normal Scikit-Learn definitions as detailed in the appendix.

Clearly, both Dask-ML and Ray’s tune-sklearn are Scikit-Learn compatible. Now let’s focus on how each search performs and how it’s configured.

Modern hyperparameter tuning techniques

Dask-ML offers state-of-the-art hyperparameter tuning techniques in a Scikit-Learn interface.

The introduction of Ray’s tune-sklearn made this claim:

tune-sklearn is the only Scikit-Learn interface that allows you to easily leverage Bayesian Optimization, HyperBand and other optimization techniques by simply toggling a few parameters.

The state-of-the-art in hyperparameter optimization is currently “Hyperband.” Hyperband reduces the amount of computation required with a principled early stopping scheme; past that, it’s the same as Scikit-Learn’s popular RandomizedSearchCV.

Hyperband works. As such, it’s very popular. After the introduction of Hyperband in 2016 by Li et. al, the paper has been cited over 470 times and has been implemented in many different libraries including Dask-ML, Ray Tune, keras-tune, Optuna, AutoML,[1] and Microsoft’s NNI. The original paper shows a rather drastic improvement over all the relevant implementations,[2] and this drastic improvement persists in follow-up works.[3] Some illustrative results from Hyperband are below:

^{All algorithms are configured to do the same amount of work except “random
2x” which does twice as much work. “hyperband (finite)” is similar Dask-ML’s
default implementation, and “bracket s=4” is similar to Ray’s default
implementation. “random” is a random search. SMAC,[4]
spearmint,[5] and TPE[6] are popular Bayesian algorithms.}

Hyperband is undoubtedly a “cutting edge” hyperparameter optimization technique. Dask-ML and Ray offer Scikit-Learn implementations of this algorithm that rely on similar implementations, and Dask-ML’s implementation also has a rule of thumb for configuration. Both Dask-ML’s and Ray’s documentation encourages use of Hyperband.

Ray does support using their Hyperband implementation on top of a technique called Bayesian sampling. This changes the hyperparameter sampling scheme for model initialization. This can be used in conjunction with Hyperband’s early stopping scheme. Adding this option to Dask-ML’s Hyperband implementation is future work for Dask-ML.

Framework support

Dask-ML model selection supports many libraries including Scikit-Learn, PyTorch, Keras, LightGBM and XGBoost.

Ray’s tune-sklearn supports these frameworks:

tune-sklearn is used primarily for tuning Scikit-Learn models, but it also supports and provides examples for many other frameworks with Scikit-Learn wrappers such as Skorch (Pytorch), KerasClassifiers (Keras), and XGBoostClassifiers (XGBoost).

Clearly, both Dask-ML and Ray support the many of the same libraries.

However, both Dask-ML and Ray have some qualifications. Certain libraries don’t offer an implementation of partial_fit,[7] so not all of the modern hyperparameter optimization techniques can be offered. Here’s a table comparing different libraries and their support in Dask-ML’s model selection and Ray’s tune-sklearn:

Model Library	Dask-ML support	Ray support	Dask-ML: early stopping?	Ray: early stopping?
Scikit-Learn	✔	✔	✔*	✔*
PyTorch (via Skorch)	✔	✔	✔	✔
Keras (via SciKeras)	✔	✔	✔**	✔**
LightGBM	✔	✔	❌	❌
XGBoost	✔	✔	❌	❌

^{* Only for the models that implement partial_fit.}
^{** Thanks to work by the Dask developers around scikeras#24.}

By this measure, Dask-ML and Ray model selection have the same level of framework support. Of course, Dask has tangential integration with LightGBM and XGBoost through Dask-ML’s xgboost module and dask-lightgbm.

Scale up

Dask-ML supports distributed tuning (how could it not?), aka parallelization across multiple machines/cores. In addition, it also supports larger-than-memory data.

[Ray’s] Tune-sklearn leverages Ray Tune, a library for distributed hyperparameter tuning, to efficiently and transparently parallelize cross validation on multiple cores and even multiple machines.

Naturally, Dask-ML also scales to multiple cores/machines because it relies on Dask. Dask has wide support for different deployment options that span from your personal machine to supercomputers. Dask will very likely work on top of any computing system you have available, including Kubernetes, SLURM, YARN and Hadoop clusters as well as your personal machine.

Dask-ML’s model selection also scales to larger-than-memory datasets, and is thoroughly tested. Support for larger-than-memory data is untested in Ray, and there are no examples detailing how to use Ray Tune with the distributed dataset implementations in PyTorch/Keras.

In addition, I have benchmarked Dask-ML’s model selection module to see how the time-to-solution is affected by the number of Dask workers in “Better and faster hyperparameter optimization with Dask.” That is, how does the time to reach a particular accuracy scale with the number of workers \(P\)? At first, it’ll scale like \(1/P\) but with large number of workers the serial portion will dictate time to solution according to Amdahl’s Law. Briefly, I found Dask-ML’s HyperbandSearchCV speedup started to saturate around 24 workers for a particular search.

Speed

Both Dask-ML and Ray are much faster than Scikit-Learn.

Ray’s tune-sklearn runs some benchmarks in the introduction with the GridSearchCV class found in Scikit-Learn and Dask-ML. A more fair benchmark would be use Dask-ML’s HyperbandSearchCV because it is almost the same as the algorithm in Ray’s tune-sklearn. To be specific, I’m interested in comparing these methods:

Scikit-Learn’s RandomizedSearchCV. This is a popular implementation, one that I’ve bootstrapped myself with a custom model.
Dask-ML’s HyperbandSearchCV. This is an early stopping technique for RandomizedSearchCV.
Ray tune-sklearn’s TuneSearchCV. This is a slightly different early stopping technique than HyperbandSearchCV’s.

Each search is configured to perform the same task: sample 100 parameters and train for no longer than 100 “epochs” or passes through the data.[8] Each estimator is configured as their respective documentation suggests. Each search uses 8 workers with a single cross validation split, and a partial_fit call takes one second with 50,000 examples. The complete setup can be found in the appendix.

Here’s how long each library takes to complete the same search:

Notably, we didn’t improve the Dask-ML codebase for this benchmark, and ran the code as it’s been for the last year.[9] Regardless, it’s possible that other artifacts from biased benchmarks crept into this benchmark.

Clearly, Ray and Dask-ML offer similar performance for 8 workers when compared with Scikit-Learn. To Ray’s credit, their implementation is ~15% faster than Dask-ML’s with 8 workers. We suspect that this performance boost comes from the fact that Ray implements an asynchronous variant of Hyperband. We should investigate this difference between Dask and Ray, and how each balances the tradeoffs, number FLOPs vs. time-to-solution. This will vary with the number of workers: the asynchronous variant of Hyperband provides no benefit if used with a single worker.

Dask-ML reaches scores quickly in serial environments, or when the number of workers is small. Dask-ML prioritizes fitting high scoring models: if there are 100 models to fit and only 4 workers available, Dask-ML selects the models that have the highest score. This is most relevant in serial environments;[10] see “Better and faster hyperparameter optimization with Dask” for benchmarks. This feature is omitted from this benchmark, which only focuses on time to solution.

Conclusion

Dask-ML and Ray offer the same features for model selection: state-of-the-art features with a Scikit-Learn compatible API, and both implementations have fairly wide support for different frameworks and rely on backends that can scale to many machines.

In addition, the Ray implementation has provided motivation for further development, specifically on the following items:

Adding support for more libraries, including Keras (dask-ml#696, dask-ml#713, scikeras#24). SciKeras is a Scikit-Learn wrapper for Keras that (now) works with Dask-ML model selection because SciKeras models implement the Scikit-Learn model API.
Better documenting the models that Dask-ML supports (dask-ml#699). Dask-ML supports any model that implement the Scikit-Learn interface, and there are wrappers for Keras, PyTorch, LightGBM and XGBoost. Now, Dask-ML’s documentation prominently highlights this fact.

The Ray implementation has also helped motivate and clarify future work. Dask-ML should include the following implementations:

A Bayesian sampling scheme for the Hyperband implementation that’s similar to Ray’s and BOHB’s (dask-ml#697).
A configuration of HyperbandSearchCV that’s well-suited for exploratory hyperparameter searches. An initial implementation is in dask-ml#532, which should be benchmarked against Ray.

Luckily, all of these pieces of development are straightforward modifications because the Dask-ML model selection framework is pretty flexible.

Thank you Tom Augspurger, Matthew Rocklin, Julia Signell, and Benjamin Zaitlen for your feedback, suggestions and edits.

Appendix

Benchmark setup

This is the complete setup for the benchmark between Dask-ML, Scikit-Learn and Ray. Complete details can be found at stsievert/dask-hyperband-comparison.

Let’s create a dummy model that takes 1 second for a partial_fit call with 50,000 examples. This is appropriate for this benchmark; we’re only interested in the time required to finish the search, not how well the models do. Scikit-learn, Ray and Dask-ML have have very similar methods of choosing hyperparameters to evaluate; they differ in their early stopping techniques.

from scipy.stats import uniform
from sklearn.model_selection import make_classification
from benchmark import ConstantFunction  # custom module

# This model sleeps for `latency * len(X)` seconds before
# reporting a score of `value`.
model = ConstantFunction(latency=1 / 50e3, max_iter=max_iter)

params = {"value": uniform(0, 1)}
# This dummy dataset mirrors the MNIST dataset
X_train, y_train = make_classification(n_samples=int(60e3), n_features=784)

This model will take 2 minutes to train for 100 epochs (aka passes through the data). Details can be found at stsievert/dask-hyperband-comparison.

Let’s configure our searches to use 8 workers with a single cross-validation split:

from sklearn.model_selection import RandomizedSearchCV, ShuffleSplit
split = ShuffleSplit(test_size=0.2, n_splits=1)
kwargs = dict(cv=split, refit=False)

search = RandomizedSearchCV(model, params, n_jobs=8, n_iter=n_params, **kwargs)
search.fit(X_train, y_train)  # 20.88 minutes

from dask_ml.model_selection import HyperbandSearchCV
dask_search = HyperbandSearchCV(
    model, params, test_size=0.2, max_iter=max_iter, aggressiveness=4
)

from tune_sklearn import TuneSearchCV
ray_search = TuneSearchCV(
    model, params, n_iter=n_params, max_iters=max_iter, early_stopping=True, **kwargs
)

dask_search.fit(X_train, y_train)  # 2.93 minutes
ray_search.fit(X_train, y_train)  # 2.49 minutes

Full example usage

from sklearn.linear_model import SGDClassifier
from scipy.stats import uniform, loguniform
from sklearn.datasets import make_classification
model = SGDClassifier()
params = {"alpha": loguniform(1e-5, 1e-3), "l1_ratio": uniform(0, 1)}
X, y = make_classification()

from sklearn.model_selection import RandomizedSearchCV
search = RandomizedSearchCV(model, params, ...)
search.fit(X, y)

from dask_ml.model_selection import HyperbandSearchCV
HyperbandSearchCV(model, params, ...)
search.fit(X, y, classes=[0, 1])

from tune_sklearn import TuneSearchCV
search = TuneSearchCV(model, params, ...)
search.fit(X, y, classes=[0, 1])

Better and faster hyperparameter optimization with Dask

2019-09-30T00:00:00+00:00

Scott Sievert wrote this post. The original post lives at https://stsievert.com/blog/2019/09/27/dask-hyperparam-opt/ with better styling. This work is supported by Anaconda, Inc.

Dask’s machine learning package, Dask-ML now implements Hyperband, an advanced “hyperparameter optimization” algorithm that performs rather well. This post will

describe “hyperparameter optimization”, a common problem in machine learning
describe Hyperband’s benefits and why it works
show how to use Hyperband via example alongside performance comparisons

In this post, I’ll walk through a practical example and highlight key portions of the paper “Better and faster hyperparameter optimization with Dask”, which is also summarized in a ~25 minute SciPy 2019 talk.

Machine learning requires data, an untrained model and “hyperparameters”, parameters that are chosen before training begins that help with cohesion between the model and data. The user needs to specify values for these hyperparameters in order to use the model. A good example is adapting ridge regression or LASSO to the amount of noise in the data with the regularization parameter.[1]

Model performance strongly depends on the hyperparameters provided. A fairly complex example is with a particular visualization tool, t-SNE. This tool requires (at least) three hyperparameters and performance depends radically on the hyperparameters. In fact, the first section in “How to Use t-SNE Effectively” is titled “Those hyperparameters really matter”.

Finding good values for these hyperparameters is critical and has an entire Scikit-learn documentation page, “Tuning the hyperparameters of an estimator.” Briefly, finding decent values of hyperparameters is difficult and requires guessing or searching.

How can these hyperparameters be found quickly and efficiently with an advanced task scheduler like Dask? Parallelism will pose some challenges, but the Dask architecture enables some advanced algorithms.

Note: this post presumes knowledge of Dask basics. This material is covered in Dask’s documentation on Why Dask?, a ~15 minute video introduction to Dask, a video introduction to Dask-ML and a blog post I wrote on my first use of Dask.

Contributions

Dask-ML can quickly find high-performing hyperparameters. I will back this claim with intuition and experimental evidence.

Specifically, this is because Dask-ML now implements an algorithm introduced by Li et. al. in “Hyperband: A novel bandit-based approach to hyperparameter optimization”. Pairing of Dask and Hyperband enables some exciting new performance opportunities, especially because Hyperband has a simple implementation and Dask is an advanced task scheduler.[2]

Let’s go through the basics of Hyperband then illustrate its use and performance with an example. This will highlight some key points of the corresponding paper.

Hyperband basics

The motivation for Hyperband is to find high performing hyperparameters with minimal training. Given this goal, it makes sense to spend more time training high performing models – why waste more time training time a model if it’s done poorly in the past?

One method to spend more time on high performing models is to initialize many models, start training all of them, and then stop training low performing models before training is finished. That’s what Hyperband does. At the most basic level, Hyperband is a (principled) early-stopping scheme for RandomizedSearchCV.

Deciding when to stop the training of models depends on how strongly the training data effects the score. There are two extremes:

when only the training data matter
- i.e., when the hyperparameters don’t influence the score at all
when only the hyperparameters matter
- i.e., when the training data don’t influence the score at all

Hyperband balances these two extremes by sweeping over how frequently models are stopped. This sweep allows a mathematical proof that Hyperband will find the best model possible with minimal partial_fit calls[3].

Hyperband has significant parallelism because it has two “embarrassingly parallel” for-loops – Dask can exploit this. Hyperband has been implemented in Dask, specifically in Dask’s machine library Dask-ML.

How well does it perform? Let’s illustrate via example. Some setup is required before the performance comparison in Performance.

Example

Note: want to try HyperbandSearchCV out yourself? Dask has an example use. It can even be run in-browser!

I’ll illustrate with a synthetic example. Let’s build a dataset with 4 classes:

>>> from experiment import make_circles
>>> X, y = make_circles(n_classes=4, n_features=6, n_informative=2)
>>> scatter(X[:, :2], color=y)

Note: this content is pulled from stsievert/dask-hyperband-comparison, or makes slight modifications.

Let’s build a fully connected neural net with 24 neurons for classification:

>>> from sklearn.neural_network import MLPClassifier
>>> model = MLPClassifier()

Building the neural net with PyTorch is also possible[4] (and what I used in development).

This neural net’s behavior is dictated by 6 hyperparameters. Only one controls the model of the optimal architecture (hidden_layer_sizes, the number of neurons in each layer). The rest control finding the best model of that architecture. Details on the hyperparameters are in the Appendix.

>>> params = ...  # details in appendix
>>> params.keys()
dict_keys(['hidden_layer_sizes', 'alpha', 'batch_size', 'learning_rate'
           'learning_rate_init', 'power_t', 'momentum'])
>>> params["hidden_layer_sizes"]  # always 24 neurons
[(24, ), (12, 12), (6, 6, 6, 6), (4, 4, 4, 4, 4, 4), (12, 6, 3, 3)]

I choose these hyperparameters to have a complex search space that mimics the searches performed for most neural networks. These searches typically involve hyperparameters like “dropout”, “learning rate”, “momentum” and “weight decay”.[5] End users don’t care hyperparameters like these; they don’t change the model architecture, only finding the best model of a particular architecture.

How can high performing hyperparameter values be found quickly?

Finding the best parameters

First, let’s look at the parameters required for Dask-ML’s implementation of Hyperband (which is in the class HyperbandSearchCV).

Hyperband parameters: rule-of-thumb

HyperbandSearchCV has two inputs:

max_iter, which determines how many times to call partial_fit
the chunk size of the Dask array, which determines how many data each partial_fit call receives.

These fall out pretty naturally once it’s known how long to train the best model and very approximately how many parameters to sample:

n_examples = 50 * len(X_train)  # 50 passes through dataset for best model
n_params = 299  # sample about 300 parameters

# inputs to hyperband
max_iter = n_params
chunk_size = n_examples // n_params

The inputs to this rule-of-thumb are exactly what the user cares about:

a measure of how complex the search space is (via n_params)
how long to train the best model (via n_examples)

Notably, there’s no tradeoff between n_examples and n_params like with Scikit-learn’s RandomizedSearchCV because n_examples is only for some models, not for all models. There’s more details on this rule-of-thumb in the “Notes” section of the HyperbandSearchCV docs.

With these inputs a HyperbandSearchCV object can easily be created.

Finding the best performing hyperparameters

This model selection algorithm Hyperband is implemented in the class HyperbandSearchCV. Let’s create an instance of that class:

>>> from dask_ml.model_selection import HyperbandSearchCV
>>>
>>> search = HyperbandSearchCV(
...     est, params, max_iter=max_iter, aggressiveness=4
... )

aggressiveness defaults to 3. aggressiveness=4 is chosen because this is an initial search; I know nothing about how this search space. Then, this search should be more aggressive in culling off bad models.

Hyperband hides some details from the user (which enables the mathematical guarantees), specifically the details on the amount of training and the number of models created. These details are available in the metadata attribute:

>>> search.metadata["n_models"]
378
>>> search.metadata["partial_fit_calls"]
5721

Now that we have some idea on how long the computation will take, let’s ask it to find the best set of hyperparameters:

>>> from dask_ml.model_selection import train_test_split
>>> X_train, y_train, X_test, y_test = train_test_split(X, y)
>>>
>>> X_train = X_train.rechunk(chunk_size)
>>> y_train = y_train.rechunk(chunk_size)
>>>
>>> search.fit(X_train, y_train)

The dashboard will be active during this time[6]:

Your browser does not support the video tag.

How well do these hyperparameters perform?

>>> search.best_score_
0.9019221418447483

HyperbandSearchCV mirrors Scikit-learn’s API for RandomizedSearchCV, so it has access to all the expected attributes and methods:

>>> search.best_params_
{"batch_size": 64, "hidden_layer_sizes": [6, 6, 6, 6], ...}
>>> search.score(X_test, y_test)
0.8989070100111217
>>> search.best_model_
MLPClassifier(...)

Details on the attributes and methods are in the HyperbandSearchCV documentation.

Performance

I ran this 200 times on my personal laptop with 4 cores. Let’s look at the distribution of final validation scores:

The “passive” comparison is really RandomizedSearchCV configured so it takes an equal amount of work as HyperbandSearchCV. Let’s see how this does over time:

This graph shows the mean score over the 200 runs with the solid line, and the shaded region represents the interquartile range. The dotted green line indicates the data required to train 4 models to completion. “Passes through the dataset” is a good proxy for “time to solution” because there are only 4 workers.

This graph shows that HyperbandSearchCV will find parameters at least 3 times quicker than RandomizedSearchCV.

Dask opportunities

What opportunities does combining Hyperband and Dask create? HyperbandSearchCV has a lot of internal parallelism and Dask is an advanced task scheduler.

The most obvious opportunity involves job prioritization. Hyperband fits many models in parallel and Dask might not have that workers available. This means some jobs have to wait for other jobs to finish. Of course, Dask can prioritize jobs[7] and choose which models to fit first.

Let’s assign the priority for fitting a certain model to be the model’s most recent score. How does this prioritization scheme influence the score? Let’s compare the prioritization schemes in a single run of the 200 above:

These two lines are the same in every way except for the prioritization scheme. This graph compares the “high scores” prioritization scheme and the Dask’s default prioritization scheme (“fifo”).

This graph is certainly helped by the fact that is run with only 4 workers. Job priority does not matter if every job can be run right away (there’s nothing to assign priority too!).

Amenability to parallelism

How does Hyperband scale with the number of workers?

I ran another separate experiment to measure. This experiment is described more in the corresponding paper, but the relevant difference is that a PyTorch neural network is used through skorch instead of Scikit-learn’s MLPClassifier.

I ran the same experiment with a different number of Dask workers.[8] Here’s how HyperbandSearchCV scales:

Training one model to completion requires 243 seconds (which is marked by the white line). This is a comparison with patience, which stops training models if their scores aren’t increasing enough. Functionally, this is very useful because the user might accidentally specify n_examples to be too large.

It looks like the speedups start to saturate somewhere between 16 and 24 workers, at least for this example. Of course, patience doesn’t work as well for a large number of workers.[9]

Future work

There’s some ongoing pull requests to improve HyperbandSearchCV. The most significant of these involves tweaking some Hyperband internals so HyperbandSearchCV works better with initial or very exploratory searches (dask/dask-ml #532).

The biggest improvement I see is treating dataset size as the scarce resource that needs to be preserved instead of training time. This would allow Hyperband to work with any model, instead of only models that implement partial_fit.

Serialization is an important part of the distributed Hyperband implementation in HyperbandSearchCV. Scikit-learn and PyTorch can easily handle this because they support the Pickle protocol[10], but Keras/Tensorflow/MXNet present challenges. The use of HyperbandSearchCV could be increased by resolving this issue.

Appendix

I choose to tune 7 hyperparameters, which are

hidden_layer_sizes, which controls the activation function used at each neuron
alpha, which controls the amount of regularization

More hyperparameters control finding the best neural network:

batch_size, which controls the number of examples the optimizer uses to approximate the gradient
learning_rate, learning_rate_init, power_t, which control some basic hyperparameters for the SGD optimizer I’ll be using
momentum, a more advanced hyperparameter for SGD with Nesterov’s momentum.