Building a utility function wrapper for Scikit-Learn models

In a previous post we learned how to access a notebook programmatically using the ipynb package.

This is very powerful as it allows a data scientist to focus on implementing a model which is re-usable, specifying a fit and predict method to provide some structure to their code.

In this post, we're going to build a utility wrapper which takes the previous code and the following functionality:

  • Serialization, so we don't have to re-fit models if we don't need to
  • Scoring, so we can determine how well our model is performing
  • Feature importance, so we can determine the predictive power of individual features - and provide insight into feature selection

Building the wrapper

Here is the code:

In [1]:
from sklearn.externals import joblib
from sklearn.metrics import roc_auc_score
from sklearn.exceptions import NotFittedError
import os.path

class ModelUtils(object):
    
    def __init__(self, model, serialize_path=None):
        """
            If serialize_path is specified and valid, load the model from disk.
        """
        self.serialize_path = serialize_path
        
        if self.serialize_path is not None and os.path.exists(self.serialize_path):
            print('Loaded from', self.serialize_path)
            self.clf = joblib.load(self.serialize_path)
            self.is_fitted = True
            return
        
        self.clf = model
        self.is_fitted = False

    def fit(self, X, y):
        """
            Fit our model, saving the model to disk if serialize_path is specified.
        """
        # fit our model
        self.clf.fit(X, y)
        self.is_fitted = True
        
        # serialise to path
        if self.serialize_path is not None:
            joblib.dump(self.clf, self.serialize_path)
            print('Saved to', self.serialize_path)
            
        return self
    
    def predict(self, X):
        if not self.is_fitted:
            raise NotFittedError
        return self.clf.predict(X)
    
    def score(self, X, y_true):
        """
            Generates a score for the model based on predicting on X and comparing 
            to y_true.
        """
        y_pred = self.predict(X)
        return roc_auc_score(y_true, y_pred)
    
    def feature_importance(self, X, y, normalize=True):
        """
            To calculate feature importance, we iterate through each feature i, 
            generating a model score with all other features zeroed.
            
            If normalize is True, divide the results by the minimum score, such that
            each score represents "N times better than the worst feature".
        """
        scores = [self.score(self.__zero_except(X, i), y) for i in range(X.shape[1])]
        
        if normalize:
            return scores / min(scores)
        return scores
    
    def __zero_except(self, X, i):
        """
            A helper function to replace all but the ith column with zeroes, and 
            return the result. (There is probably a cleaner way to do this.)
        """
        X_copy = X.copy()
        X_i = X[:,i]
        X_copy.fill(0)
        X_copy[:,i] = X_i
        return X_copy

Using the wrapper

Now we have our ModelUtils wrapper class, lets import CustomModel as before and put it to work.

As we instantiate the wrapper, we're specifying test.pkl in the current directory as the location to serialize the model.

If serialize_path is configured and valid, the pre-fitted model will be loaded from there, and the predict function will be immediately available. If configured but the file does not exist, ModelUtils will serialize to this location after fitting the model.

In [2]:
from ipynb.fs.defs.model import CustomModel

model = ModelUtils(CustomModel(), serialize_path='test.pkl')

Let's load up the sample data again, and fit our model and then use it to create some predictions. Note the Saved to test.pkl output.

In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.75, 
                                                    test_size=0.25, 
                                                    random_state=1234) # more reproducibility

# fit our model (as before)
model.fit(X_train, y_train)

# generate some predictions (as before)
model.predict(X_test);  # fun fact: the ; character suppresses notebook output
Saved to test.pkl

Model and feature performance

Finally, here are our two new functions.

First, let's score the performance of our model. This is using a metric called ROC AUC - we won't explain what that is in this post in any detail, but essentially it is a measure of how well the model can separate each class in y.

Then we will calculate relative feature importance for each of the 30 features in the sample dataset. Based on individual scoring performance, what this means is the the first feature is ~2.12x more powerful than the lowest performing feature, and the best feature is ~35.86x more powerful.

In [4]:
# score our model
auc = model.score(X_test, y_test)
print('AUC:', auc)

# calculate relative feature importance
importance = model.feature_importance(X_test, y_test)

print('Top feature relative performance:', max(importance))
print(model.feature_importance(X_test, y_test))
AUC: 0.989669421488
Top feature relative performance: 35.855513308
[  2.121673     8.6730038   34.85551331   2.06844106  25.14068441
  29.74904943  33.68060837  34.68441065  24.88973384  15.51711027
   4.66920152  18.64638783   4.85931559  34.7148289   16.36121673
  24.82129278  27.2851711   27.96958175  15.74904943  16.24334601   1.
  29.3269962   35.85551331  35.83269962  26.1634981   30.85931559
  33.77186312  35.12927757  27.24714829  23.27376426]

👏