Dask Working Notes - Posts tagged RAPIDS

cuML and Dask hyperparameter optimization

2019-03-27T00:00:00+00:00

DGX-1 Workstation
Host Memory: 512 GB
GPU Tesla V100 x 8
cudf 0.6
cuml 0.6
dask 1.1.4
Jupyter notebook

TLDR; Hyper-parameter Optimization is functional but slow with cuML

cuML and Dask Hyper-parameter Optimization

cuML is an open source GPU accelerated machine learning library primarily developed at NVIDIA which mirrors the Scikit-Learn API. The current suite of algorithms includes GLMs, Kalman Filtering, clustering, and dimensionality reduction. Many of these machine learning algorithms use hyper-parameters. These are parameters used during the model training process but are not “learned” during the training. Often these parameters are coefficients or penalty thresholds and finding the “best” hyper parameter can be computationally costly. In the PyData community, we often reach to Scikit-Learn’s GridSearchCV or RandomizedSearchCV for easy definition of the search space for hyper-parameters – this is called hyper-parameter optimization. Within the Dask community, Dask-ML has incrementally improved the efficiency of hyper-parameter optimization by leveraging both Scikit-Learn and Dask to use multi-core and distributed schedulers: Grid and RandomizedSearch with DaskML.

With the newly created drop-in replacement for Scikit-Learn, cuML, we experimented with Dask’s GridSearchCV. In the upcoming 0.6 release of cuML, the estimators are serializable and are functional within the Scikit-Learn/dask-ml framework, but slow compared with Scikit-Learn estimators. And while speeds are slow now, we know how to boost performance, have filed several issues, and hope to show performance gains in future releases.

All code and timing measurements can be found in this Jupyter notebook

Fast Fitting!

cuML is fast! But finding that speed requires developing a bit of GPU knowledge and some intuition. For example, there is a non-zero cost of moving data from device to GPU and, when data is “small” there are little to no performance gains. “Small”, currently might mean less than 100MB.

In the following example we use the diabetes data set provided by sklearn and linearly fit the data with RidgeRegression

\[ \min\limits_w ||y - Xw||^2_2 + alpha \* ||w||^2_2\]

alpha is the hyper-parameter and we initially set to 1.

import numpy as np
from cuml import Ridge as cumlRidge
import dask_ml.model_selection as dcv
from sklearn import datasets, linear_model
from sklearn.externals.joblib import parallel_backend
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2)

fit_intercept = True
normalize = False
alpha = np.array([1.0])

ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

ridge.fit(X_train, y_train)
cu_ridge.fit(X_train, y_train)=

The above ran with a single timing measurement of:

Scikit-Learn Ridge: 28 ms
cuML Ridge: 1.12 s

But the data is quite small, ~28KB. Increasing the size to ~2.8GB and re-running we see significant gains:

dup_ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
dup_cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

# move data from host to device
record_data = (('fea%d'%i, dup_data[:,i]) for i in range(dup_data.shape[1]))
gdf_data = cudf.DataFrame(record_data)
gdf_train = cudf.DataFrame(dict(train=dup_train))

#sklearn
dup_ridge.fit(dup_data, dup_train)

# cuml
dup_cu_ridge.fit(gdf_data, gdf_train.train)

With new timing measurements of:

Scikit-Learn Ridge: 4.82 s ± 694 ms
cuML Ridge: 450 ms ± 47.6 ms

With more data we clearly see faster fitting times, but the time to move data to the GPU (through CUDF) was 19.7s. This cost of data movement is one of the reasons why RAPIDS/cuDF was developed – keep data on the GPU and avoid having to move back and forth.

Hyper-Parameter Optimization Experiments

So moving to the GPU can be costly, but once there, with larger data sizes, we gain significant performance optimizations. Naively, we thought, “well, we have GPU machine learning, we have distributed hyper-parameter optimization… we should have distributed, GPU-accelerated, hyper-parameter optimization!”

Scikit-Learn assumes a specific, but well defined API for estimators over which it will perform hyper-parameter optimization. Most estimators/classifiers in Scikit-Learn look like the following:

class DummyEstimator(BaseEstimator):
    def __init__(self, params=...):
        ...

    def fit(self, X, y=None):
        ...

    def predict(self, X):
        ...

    def score(self, X, y=None):
        ...

    def get_params(self):
        ...

    def set_params(self, params...):
        ...

When we started experimenting with hyper-parameter optimization, we found a few API holes missing, these were resolved, mostly handling matching argument structure and various getters/setters.

get_params and set_params (#271)
fix/clf-solver (#318)
map fit_transform to sklearn implementation (#330)
Fea get params small changes (#322)

With holes plugged up we tested again. Using the same diabetes data set, we are now performing hyper-parameter optimization and searching over many alpha parameters for the best scoring alpha.

params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

grid = GridSearchCV(clf, params, scoring='r2')
grid.fit(X_train, y_train)

cu_grid = GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(X_train, y_train)

Again, reminding ourselves that the data is small ~28KB, we don’t expect to observe cuml performing faster than sklearn. Instead, we want to demonstrate functionality.

Again, reminding ourselves that the data is small ~28KB, we don’t expect to observe cuml performing faster than Scikit-Learn. Instead, we want to demonstrate functionality. Additionally, we also tried swapping out Dask-ML’s implementation of GridSearchCV (which adheres to the same API as Scikit-Learn) to use all of the GPUs we have available in parallel.

params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

grid = dcv.GridSearchCV(clf, params, scoring='r2')
grid.fit(X_train, y_train)

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(X_train, y_train)

Timing Measurements:

GridSearchCV	sklearn-Ridge	cuml-ridge
Scikit-Learn	88.4 ms ± 6.11 ms	6.51 s ± 132 ms
Dask-ML	873 ms ± 347 ms	740 ms ± 142 ms

Unsurprisingly, we see that GridSearchCV and Ridge Regression from Scikit-Learn is the fastest in this context. There is cost to distributing work and data, and as we previously mentioned, moving data from host to device.

How does performance scale as we scale data?

two_dup_data = np.array(np.vstack([X_train]*int(1e2)))
two_dup_train = np.array(np.hstack([y_train]*int(1e2)))
three_dup_data = np.array(np.vstack([X_train]*int(1e3)))
three_dup_train = np.array(np.hstack([y_train]*int(1e3)))

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(two_dup_data, two_dup_train)

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(three_dup_data, three_dup_train)

grid = dcv.GridSearchCV(clf, params, scoring='r2')
grid.fit(three_dup_data, three_dup_train)

Timing Measurements:

Data (MB)	cuML+Dask-ML	sklearn+Dask-ML
2.8 MB	13.8s
28 MB	1min 17s	4.87 s

cuML + dask-ml (Distributed GridSearchCV) does significantly worse as data sizes increase! Why? Primarily, two reasons:

Non optimized movement of data between host and device compounded by N devices and the size of the parameter space
Scoring methods are not implemented in with cuML

Below is the Dask graph for the GridSearch

There are 50 (cv=5 times 10 parameters for alpha) instances of chunking up our test data set and scoring performance. That means 50 times we are moving data back forth between host and device for fitting and 50 times for scoring. That’s not great, but it’s also very solvable – build scoring functions for GPUs!

Immediate Future Work

We know the problems, GH Issues have been filed, and we are working on these issues – come help!

Built In Scorers (#242)
DeviceNDArray as input data (#369)
Communication with UCX (#2344)

Building GPU Groupby-Aggregations for Dask

2019-03-04T00:00:00+00:00

We’ve sufficiently aligned Dask DataFrame and cuDF to get groupby aggregations like the following to work well.

df.groupby('x').y.mean()

This post describes the kind of work we had to do as a model for future development.

Plan

As outlined in a previous post, Dask, Pandas, and GPUs: first steps, our plan to produce distributed GPU dataframes was to combine Dask DataFrame with cudf. In particular, we had to

change Dask DataFrame so that it would parallelize not just around the Pandas DataFrames that it works with today, but around anything that looked enough like a Pandas DataFrame
change cuDF so that it would look enough like a Pandas DataFrame to fit within the algorithms in Dask DataFrame

Changes

On the Dask side this mostly meant replacing

Replacing isinstance(df, pd.DataFrame) checks with is_dataframe_like(df) checks (after defining a suitable is_dataframe_like/is_series_like/is_index_like functions
Avoiding some more exotic functionality in Pandas, and instead trying to use more common functionality that we can expect to be in most DataFrame implementations

On the cuDF side this means making dozens of tiny changes to align the cuDF API to the Pandas API, and to add in missing features.

Dask Changes:
- Remove explicit pandas checks and provide cudf lazy registration #4359
- Replace isinstance(…, pandas) with is_dataframe_like #4375
- Add has_parallel_type
- Lazily register more cudf functions and move to backends file #4396
- Avoid checking against types in is_dataframe_like #4418
- Replace cudf-specific code with dask-cudf import #4470
- Avoid groupby.agg(callable) in groupby-var #4482 – this one is notable in that by simplifying our Pandas usage we actually got a significant speedup on the Pandas side.
cuDF Changes:

I don’t really expect anyone to go through all of those issues, but my hope is that by skimming over the issue titles people will get a sense for the kinds of changes we’re making here. It’s a large number of small things.

Also, kudos to Thomson Comer who solved most of the cuDF issues above.

There are still some pending issues

Square Root #1055, needed for groupby-std
cuDF needs multi-index support for columns #483, needed for:
```
gropuby.agg({'x': ['sum', mean'], 'y': ['min', 'max']})
```

But things mostly work

But generally things work pretty well today:

In [1]: import dask_cudf

In [2]: df = dask_cudf.read_csv('yellow_tripdata_2016-*.csv')

In [3]: df.groupby('passenger_count').trip_distance.mean().compute()
Out[3]: <cudf.Series nrows=10 >

In [4]: _.to_pandas()
Out[4]:
0    0.625424
1    4.976895
2    4.470014
3    5.955262
4    4.328076
5    3.079661
6    2.998077
7    3.147452
8    5.165570
9    5.916169
dtype: float64

Experience

First, most of this work was handled by the cuDF developers (which may be evident from the relative lengths of the issue lists above). When we started this process it felt like a never-ending stream of tiny issues. We weren’t able to see the next set of issues until we had finished the current set. Fortunately, most of them were pretty easy to fix. Additionally, as we went on, it seemed to get a bit easier over time.

Additionally, lots of things work other than groupby-aggregations as a result of the changes above. From the perspective of someone accustomed to Pandas, The cuDF library is starting to feel more reliable. We hit missing functionality less frequently when using cuDF on other operations.

What’s next?

More recently we’ve been working on the various join/merge operations in Dask DataFrame like indexed joins on a sorted column, joins between large and small dataframes (a common special case) and so on. Getting these algorithms from the mainline Dask DataFrame codebase to work with cuDF is resulting in a similar set of issues to what we saw above with groupby-aggregations, but so far the list is much smaller. We hope that this is a trend as we continue on to other sets of functionality into the future like I/O, time-series operations, rolling windows, and so on.