Building a utility function wrapper for Scikit-Learn models
In a previous post we learned how to access a notebook programmatically using the ipynb
package.
This is very powerful as it allows a data scientist to focus on implementing a model which is re-usable, specifying a fit
and predict
method to provide some structure to their code.
In this post, we're going to build a utility wrapper which takes the previous code and the following functionality:
- Serialization, so we don't have to re-fit models if we don't need to
- Scoring, so we can determine how well our model is performing
- Feature importance, so we can determine the predictive power of individual features - and provide insight into feature selection
Building the wrapper¶
Here is the code:
from sklearn.externals import joblib
from sklearn.metrics import roc_auc_score
from sklearn.exceptions import NotFittedError
import os.path
class ModelUtils(object):
def __init__(self, model, serialize_path=None):
"""
If serialize_path is specified and valid, load the model from disk.
"""
self.serialize_path = serialize_path
if self.serialize_path is not None and os.path.exists(self.serialize_path):
print('Loaded from', self.serialize_path)
self.clf = joblib.load(self.serialize_path)
self.is_fitted = True
return
self.clf = model
self.is_fitted = False
def fit(self, X, y):
"""
Fit our model, saving the model to disk if serialize_path is specified.
"""
# fit our model
self.clf.fit(X, y)
self.is_fitted = True
# serialise to path
if self.serialize_path is not None:
joblib.dump(self.clf, self.serialize_path)
print('Saved to', self.serialize_path)
return self
def predict(self, X):
if not self.is_fitted:
raise NotFittedError
return self.clf.predict(X)
def score(self, X, y_true):
"""
Generates a score for the model based on predicting on X and comparing
to y_true.
"""
y_pred = self.predict(X)
return roc_auc_score(y_true, y_pred)
def feature_importance(self, X, y, normalize=True):
"""
To calculate feature importance, we iterate through each feature i,
generating a model score with all other features zeroed.
If normalize is True, divide the results by the minimum score, such that
each score represents "N times better than the worst feature".
"""
scores = [self.score(self.__zero_except(X, i), y) for i in range(X.shape[1])]
if normalize:
return scores / min(scores)
return scores
def __zero_except(self, X, i):
"""
A helper function to replace all but the ith column with zeroes, and
return the result. (There is probably a cleaner way to do this.)
"""
X_copy = X.copy()
X_i = X[:,i]
X_copy.fill(0)
X_copy[:,i] = X_i
return X_copy
Using the wrapper¶
Now we have our ModelUtils
wrapper class, lets import CustomModel
as before and put it to work.
As we instantiate the wrapper, we're specifying test.pkl
in the current directory as the location to serialize the model.
If serialize_path
is configured and valid, the pre-fitted model will be loaded from there, and the predict
function will be immediately available. If configured but the file does not exist, ModelUtils
will serialize to this location after fitting the model.
from ipynb.fs.defs.model import CustomModel
model = ModelUtils(CustomModel(), serialize_path='test.pkl')
Let's load up the sample data again, and fit our model and then use it to create some predictions. Note the Saved to test.pkl
output.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.75,
test_size=0.25,
random_state=1234) # more reproducibility
# fit our model (as before)
model.fit(X_train, y_train)
# generate some predictions (as before)
model.predict(X_test); # fun fact: the ; character suppresses notebook output
Model and feature performance¶
Finally, here are our two new functions.
First, let's score the performance of our model. This is using a metric called ROC AUC - we won't explain what that is in this post in any detail, but essentially it is a measure of how well the model can separate each class in y
.
Then we will calculate relative feature importance for each of the 30 features in the sample dataset. Based on individual scoring performance, what this means is the the first feature is ~2.12x more powerful than the lowest performing feature, and the best feature is ~35.86x more powerful.
# score our model
auc = model.score(X_test, y_test)
print('AUC:', auc)
# calculate relative feature importance
importance = model.feature_importance(X_test, y_test)
print('Top feature relative performance:', max(importance))
print(model.feature_importance(X_test, y_test))
👏