Optimising hyper-parameters efficiently with Scikit-Optimize

One of the most well-known techniques for experimenting with various model configurations is Grid Search.

With grid search, you specify a discrete search space (a parameter grid) of all of the parameter values you would like to test. The search permutes through the grid, testing various combinations until all are exhausted. Basic a specified performance metric (e.g. error), you can select the best parameter combination for your model.

What's wrong with this?

If you have a large parameter grid, this doesn't work too well:

In [1]:
import numpy as np

param_grid = {
    'param_a': [0.01, 0.03, 0.1],
    'param_b': [0, 1, 2],
    'param_c': [1, 50, 100]
}

def num_searches(param_grid):
    return np.prod([len(p) for p in param_grid.values()])
    
num_searches(param_grid)
Out[1]:
27

And maybe we want to search over four possible values instead for param_a, and add two more new parameters:

In [2]:
param_grid = {
    'param_a': [0.01, 0.03, 0.1, 0.3],
    'param_b': [0, 1, 2],
    'param_c': [1, 50, 100],
    'param_d': ["a", "b"],
    'param_e': [0, 1, 2]
}

num_searches(param_grid)
Out[2]:
216

As you can see from the first grid, there's already 27 combinations to try. Then this jumps to 216 for our larger grid. Depending on the complexity of the model and the amount of data to process, this can very easily become infeasible.

There are a few approaches to solving this, including:

  • breaking down the search into multiple smaller steps (such as searching param_a and param_b first, with defaults for the others, then using the best values to search the remaining parameters - this can be tricky in practice)
  • searching the parameter space at random (which has an additional benefit of discovering better parameter values when random samples are drawn frmo a continuous range)

While Scikit-Learn doesn't provide many more options, some clever people have developed a drop-in replacement for Scikit-Learn's GridSearchCV and RandomizedSearchCV called BayesSearchCV in a package called Scikit-Optimize.

Let's install Scikit-Optimize and implement BayesSearchCV with a simple example!

Installing Scikit-Optimize

Assuming you already have already installed Anaconda and Jupyter, you will need to do the following:

  • pip install scikit-optimize

If you have trouble installing, you may first need to run the following to install one of Scikit-Optmize's dependencies:

  • pip install scikit-garden

Implementing BayesSearchCV

Here's an example implementation using a sample dataset and Logistic Regression.

In [3]:
import warnings
warnings.filterwarnings('ignore')

from skopt import BayesSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# prep some sample data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=1234)

# we're using a logistic regression model
clf = LogisticRegression(random_state=1234, verbose=0)

# this is our parameter grid
param_grid = {
    'solver': ['liblinear', 'saga'],  
    'penalty': ['l1','l2'],
    'tol': (1e-5, 1e-3, 'log-uniform'),
    'C': (1e-5, 100, 'log-uniform'),
    'fit_intercept': [True, False]
}

# set up our optimiser to find the best params in 30 searches
opt = BayesSearchCV(
    clf,
    param_grid,
    n_iter=30,
    random_state=1234,
    verbose=0
)

opt.fit(X_train, y_train)
In [4]:
print('Best params achieve a test score of', opt.score(X_test, y_test), ':')

opt.best_params_
Best params achieve a test score of 0.958041958042 :
Out[4]:
{'C': 100.0,
 'fit_intercept': True,
 'penalty': 'l1',
 'solver': 'liblinear',
 'tol': 0.00094035472283658726}

By increasing the value of n_iter, you can continue the search to find better parameter combinations. You can also use the optimiser for prediction, by calling .predict() or .predict_proba() for probabilities, or extract and use the best one standalone:

In [5]:
opt.best_estimator_
Out[5]:
LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=1234, solver='liblinear',
          tol=0.00094035472283658726, verbose=0, warm_start=False)

You may also find it useful to re-use the best parameters programatically to define an equivalent model:

In [6]:
LogisticRegression(**opt.best_params_)
Out[6]:
LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear',
          tol=0.00094035472283658726, verbose=0, warm_start=False)