In a previous post we learned how to access a notebook programmatically using the
This is very powerful as it allows a data scientist to focus on implementing a model which is re-usable, specifying a
predict method to provide some structure to their code.
In this post, we're going to build a utility wrapper which takes the previous code and the following functionality:
- Serialization, so we don't have to re-fit models if we don't need to
- Scoring, so we can determine how well our model is performing
- Feature importance, so we can determine the predictive power of individual features - and provide insight into feature selection
Building the wrapper¶
Here is the code:
from sklearn.externals import joblib from sklearn.metrics import roc_auc_score from sklearn.exceptions import NotFittedError import os.path class ModelUtils(object): def __init__(self, model, serialize_path=None): """ If serialize_path is specified and valid, load the model from disk. """ self.serialize_path = serialize_path if self.serialize_path is not None and os.path.exists(self.serialize_path): print('Loaded from', self.serialize_path) self.clf = joblib.load(self.serialize_path) self.is_fitted = True return self.clf = model self.is_fitted = False def fit(self, X, y): """ Fit our model, saving the model to disk if serialize_path is specified. """ # fit our model self.clf.fit(X, y) self.is_fitted = True # serialise to path if self.serialize_path is not None: joblib.dump(self.clf, self.serialize_path) print('Saved to', self.serialize_path) return self def predict(self, X): if not self.is_fitted: raise NotFittedError return self.clf.predict(X) def score(self, X, y_true): """ Generates a score for the model based on predicting on X and comparing to y_true. """ y_pred = self.predict(X) return roc_auc_score(y_true, y_pred) def feature_importance(self, X, y, normalize=True): """ To calculate feature importance, we iterate through each feature i, generating a model score with all other features zeroed. If normalize is True, divide the results by the minimum score, such that each score represents "N times better than the worst feature". """ scores = [self.score(self.__zero_except(X, i), y) for i in range(X.shape)] if normalize: return scores / min(scores) return scores def __zero_except(self, X, i): """ A helper function to replace all but the ith column with zeroes, and return the result. (There is probably a cleaner way to do this.) """ X_copy = X.copy() X_i = X[:,i] X_copy.fill(0) X_copy[:,i] = X_i return X_copy
Using the wrapper¶
Now we have our
ModelUtils wrapper class, lets import
CustomModel as before and put it to work.
As we instantiate the wrapper, we're specifying
test.pkl in the current directory as the location to serialize the model.
serialize_path is configured and valid, the pre-fitted model will be loaded from there, and the
predict function will be immediately available. If configured but the file does not exist,
ModelUtils will serialize to this location after fitting the model.
from ipynb.fs.defs.model import CustomModel model = ModelUtils(CustomModel(), serialize_path='test.pkl')
Let's load up the sample data again, and fit our model and then use it to create some predictions. Note the
Saved to test.pkl output.
from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=1234) # more reproducibility # fit our model (as before) model.fit(X_train, y_train) # generate some predictions (as before) model.predict(X_test); # fun fact: the ; character suppresses notebook output
Saved to test.pkl
Model and feature performance¶
Finally, here are our two new functions.
First, let's score the performance of our model. This is using a metric called ROC AUC - we won't explain what that is in this post in any detail, but essentially it is a measure of how well the model can separate each class in
Then we will calculate relative feature importance for each of the 30 features in the sample dataset. Based on individual scoring performance, what this means is the the first feature is ~2.12x more powerful than the lowest performing feature, and the best feature is ~35.86x more powerful.
# score our model auc = model.score(X_test, y_test) print('AUC:', auc) # calculate relative feature importance importance = model.feature_importance(X_test, y_test) print('Top feature relative performance:', max(importance)) print(model.feature_importance(X_test, y_test))
AUC: 0.989669421488 Top feature relative performance: 35.855513308 [ 2.121673 8.6730038 34.85551331 2.06844106 25.14068441 29.74904943 33.68060837 34.68441065 24.88973384 15.51711027 4.66920152 18.64638783 4.85931559 34.7148289 16.36121673 24.82129278 27.2851711 27.96958175 15.74904943 16.24334601 1. 29.3269962 35.85551331 35.83269962 26.1634981 30.85931559 33.77186312 35.12927757 27.24714829 23.27376426]