Automating error analysis with RuleFit models
When building machine learning models, the goal is generally to improve the performance of a model based on some performance metric. One of the most simple metrics is error. Error is simply the inverse of model accuracy - so if a model had 95% accuracy, this would correspond with 5% error.
There are many ways to improve the performance of a model and subsequently decrease the model error. This includes adding more training observations (rows), enriching training obversations with more features (columns), modifying the model algorithm or optimising the algorithm parameters.
Reducing error by introducing new features¶
In the post linked above we looked at how to optimise the parameters of a given algorithm, so for now we're interested in what we can do with the data itself.
While there are many ways to create more training observations, this is often infeasible due to consideration of cost (imagine the cost of high-end medical studies) and time (such as waiting for enough events to occur).
The next option we have is to introduce new features to the observations already in our model. There is often lots of different approaches here too, including:
- engineering new features based on existing features
- creating new features from available data not already used
- making more data available (such as from external providers)
The goal of this post is to explore a method not for assessing which approach to take, but for identifying where the gaps are to help you assess all of the options available to you.
Modelling error analysis¶
To get started, let's install an implementation of RuleFit from GitHub using pip:
pip install git+https://github.com/christophM/rulefit
Now we're going to load up a sample data set to work on, partitioning it into data for training our initial model, and data for testing its performance. Note that feature names will be important for this exercise.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# load our data - we also care about feature names
data = load_breast_cancer()
data.feature_names = list(map(lambda s: s.replace(' ' , '_'), data.feature_names))
X, y = data.data, data.target
# split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.75,
test_size=0.25,
random_state=1234) # more reproducibility
With our data ready, let's build a quick logistic regression model on the training data. We're also going to generate predictions for our test data (as positive probabilities, or the likelihood of the class label being True).
from sklearn.linear_model import LogisticRegression
import numpy as np
# define our model
model = LogisticRegression(random_state=1234)
# fit our model
model.fit(X_train, y_train)
# generate some predictions
y_hat = model.predict_proba(X_test)[:,1]
At the start of this post we discussed model error, so let's now calculate this for our model to see how much room for improvement there is.
# calculate error on each obversation in the test set
y_error = np.absolute(y_test - y_hat)
# is there much room for improvement?
print('model error:', y_error.mean())
It looks like there is almost 7% mean absolute error. Maybe we can find some good leads for improving on this?
To do so, we're going to create a new model using the RuleFit
class, but instead of targetting the original class label y, we're going to calculate the absolute error of each observation.
The absolute error is the difference between the discrete actual value of y (0 or 1) and the continuous positive probability we predicted (0.0 to 1.0).
from rulefit import RuleFit
from sklearn.ensemble import GradientBoostingRegressor
# define and fit our shiny new RuleFit model
generator_params = {
'max_depth': 5, # control the complexity of our rules
'n_estimators': 1000,
'learning_rate': 0.003,
'random_state': 1234
}
generator = GradientBoostingRegressor(**generator_params)
rf = RuleFit(generator)
rf.fit(X_test, y_error, feature_names=data.feature_names)
With the RuleFit
model fitted to our errors, we can generate a set of rules that might help us to isolate areas of our data that need enriching with new features.
RuleFit
actually generates rules for the data having a positive impact on the model, but we can ignore these for error analysis for filtering coef
> 0.
If we multiply the coefficient and support values calculated by RuleFit
, we can use that as a rough estimate for how much error is due to that subset of the data.
By summing these estimates, we get an approximate amount of error explained by these rules. This will differ from the above simply because our rules may not perfect fit our errors (that is, our error model has its own error).
import pandas as pd
# get the outputs
rules = rf.get_rules()
# remove the rules we're not interested in. if the coefficient isn't above 0
# there rule is not a good indicator of an area for improvement
rules = rules[(rules.coef > 0.) & (rules.type != 'linear')]
# we can estimate an effect for each rule on the error score from above by
# multiplying the coefficient and support values
rules['effect'] = rules['coef'] * rules['support']
print('modelled error:', rules['effect'].sum())
print('unexplained error:', np.max(y_error.mean() - rules['effect'].sum(),0))
Let's take a look at the top 10 rules:
# display the top 10 rules by effect
pd.set_option('display.max_colwidth', -1)
rules.nlargest(10,'effect')
To wrap up, let's produce a report of the top 3 rules, including up to 10 examples from the data to which the rules apply.
This report can be used in conjunction with subject-matter experitise on the data to isolate areas for feature enrichment, to improve your model!
from IPython.core.display import display
import pandas as pd
# prepare a dataframe for use below (we really care about the `query` function)
df = pd.DataFrame(X_test, columns=data.feature_names)
df['y_error'] = y_error
df['y_sq_error'] = y_error**2
for index, rule in rules.nlargest(3, 'effect').iterrows():
print('rule:', rule['rule'])
print('support:', rule['support'])
print('coef:', rule['coef'])
print('estimated error effect (support x coef):', rule['effect'])
# it might be useful to compare the local error to the estimated model effect
print('rule MAE:', df.query(rule['rule'])['y_error'].mean())
print('rule RMSE:', df.query(rule['rule'])['y_sq_error'].mean()**(1/2))
# we can use the rule to filter the data
display(df.query(rule['rule']).nlargest(10, 'y_error'))
✨