Automating error analysis with RuleFit models

When building machine learning models, the goal is generally to improve the performance of a model based on some performance metric. One of the most simple metrics is error. Error is simply the inverse of model accuracy - so if a model had 95% accuracy, this would correspond with 5% error.

There are many ways to improve the performance of a model and subsequently decrease the model error. This includes adding more training observations (rows), enriching training obversations with more features (columns), modifying the model algorithm or optimising the algorithm parameters.

Reducing error by introducing new features

In the post linked above we looked at how to optimise the parameters of a given algorithm, so for now we're interested in what we can do with the data itself.

While there are many ways to create more training observations, this is often infeasible due to consideration of cost (imagine the cost of high-end medical studies) and time (such as waiting for enough events to occur).

The next option we have is to introduce new features to the observations already in our model. There is often lots of different approaches here too, including:

  • engineering new features based on existing features
  • creating new features from available data not already used
  • making more data available (such as from external providers)

The goal of this post is to explore a method not for assessing which approach to take, but for identifying where the gaps are to help you assess all of the options available to you.

Modelling error analysis

To get started, let's install an implementation of RuleFit from GitHub using pip:

pip install git+https://github.com/christophM/rulefit

Now we're going to load up a sample data set to work on, partitioning it into data for training our initial model, and data for testing its performance. Note that feature names will be important for this exercise.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# load our data - we also care about feature names
data = load_breast_cancer()
data.feature_names = list(map(lambda s: s.replace(' ' , '_'), data.feature_names))
X, y = data.data, data.target

# split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.75, 
                                                    test_size=0.25, 
                                                    random_state=1234) # more reproducibility

With our data ready, let's build a quick logistic regression model on the training data. We're also going to generate predictions for our test data (as positive probabilities, or the likelihood of the class label being True).

In [2]:
from sklearn.linear_model import LogisticRegression
import numpy as np

# define our model
model = LogisticRegression(random_state=1234)

# fit our model
model.fit(X_train, y_train)

# generate some predictions
y_hat = model.predict_proba(X_test)[:,1]

At the start of this post we discussed model error, so let's now calculate this for our model to see how much room for improvement there is.

In [3]:
# calculate error on each obversation in the test set
y_error = np.absolute(y_test - y_hat)

# is there much room for improvement?
print('model error:', y_error.mean())
model error: 0.0705335674785

It looks like there is almost 7% mean absolute error. Maybe we can find some good leads for improving on this?

To do so, we're going to create a new model using the RuleFit class, but instead of targetting the original class label y, we're going to calculate the absolute error of each observation.

The absolute error is the difference between the discrete actual value of y (0 or 1) and the continuous positive probability we predicted (0.0 to 1.0).

In [4]:
from rulefit import RuleFit
from sklearn.ensemble import GradientBoostingRegressor

# define and fit our shiny new RuleFit model
generator_params = {
    'max_depth': 5,       # control the complexity of our rules
    'n_estimators': 1000,  
    'learning_rate': 0.003,
    'random_state': 1234
}
generator = GradientBoostingRegressor(**generator_params)

rf = RuleFit(generator)
rf.fit(X_test, y_error, feature_names=data.feature_names)
Out[4]:
RuleFit(tree_generator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.003, loss='ls', max_depth=5,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=1000,
             presort='auto', random_state=1234, subsample=1.0, verbose=0,
             warm_start=False))

With the RuleFit model fitted to our errors, we can generate a set of rules that might help us to isolate areas of our data that need enriching with new features.

RuleFit actually generates rules for the data having a positive impact on the model, but we can ignore these for error analysis for filtering coef > 0.

If we multiply the coefficient and support values calculated by RuleFit, we can use that as a rough estimate for how much error is due to that subset of the data.

By summing these estimates, we get an approximate amount of error explained by these rules. This will differ from the above simply because our rules may not perfect fit our errors (that is, our error model has its own error).

In [5]:
import pandas as pd

# get the outputs
rules = rf.get_rules()

# remove the rules we're not interested in. if the coefficient isn't above 0
# there rule is not a good indicator of an area for improvement
rules = rules[(rules.coef > 0.) & (rules.type != 'linear')]

# we can estimate an effect for each rule on the error score from above by 
# multiplying the coefficient and support values
rules['effect'] = rules['coef'] * rules['support']

print('modelled error:', rules['effect'].sum())
print('unexplained error:', np.max(y_error.mean() - rules['effect'].sum(),0))
modelled error: 0.046811390404179753
unexplained error: 0.0237221770743

Let's take a look at the top 10 rules:

In [6]:
# display the top 10 rules by effect
pd.set_option('display.max_colwidth', -1)
rules.nlargest(10,'effect')
Out[6]:
rule type coef support effect
883 worst_area <= 976.25 & worst_area > 553.299987793 & radius_error > 0.275350004435 rule 0.056166 0.216783 0.012176
1176 compactness_error > 0.0203649997711 & worst_fractal_dimension <= 0.113150000572 & area_error > 23.2399997711 & radius_error <= 0.339100003242 rule 0.265549 0.041958 0.011142
769 worst_area > 548.650024414 & area_error > 23.2350006104 & worst_area <= 976.25 rule 0.028937 0.223776 0.006475
458 worst_radius > 15.345000267 & worst_concavity > 0.207249999046 & worst_symmetry > 0.203749999404 & worst_symmetry <= 0.560700058937 & worst_radius <= 16.8100013733 rule 0.101535 0.041958 0.004260
1548 worst_symmetry > 0.203749999404 & area_error <= 33.375 & worst_symmetry <= 0.560700058937 & worst_perimeter > 100.555000305 rule 0.024403 0.153846 0.003754
5361 radius_error > 0.344399988651 & symmetry_error <= 0.017725000158 & area_error > 23.2399997711 & worst_perimeter <= 105.199996948 rule 0.133092 0.027972 0.003723
357 worst_radius > 15.5699996948 & worst_symmetry > 0.203749999404 & mean_texture > 19.1049995422 & worst_symmetry <= 0.560700058937 & radius_error <= 0.418200016022 rule 0.032694 0.069930 0.002286
57 worst_area > 548.650024414 & worst_area <= 1086.0 rule 0.002326 0.405594 0.000943
1433 worst_area > 548.650024414 & mean_perimeter > 79.2050018311 & mean_texture > 21.1399993896 & area_error <= 36.4049987793 rule 0.012087 0.069930 0.000845
5547 worst_area > 744.0 & worst_symmetry > 0.203749999404 & mean_texture > 19.1049995422 & worst_symmetry <= 0.560700058937 & radius_error <= 0.418200016022 rule 0.004637 0.069930 0.000324

To wrap up, let's produce a report of the top 3 rules, including up to 10 examples from the data to which the rules apply.

This report can be used in conjunction with subject-matter experitise on the data to isolate areas for feature enrichment, to improve your model!

In [7]:
from IPython.core.display import display
import pandas as pd

# prepare a dataframe for use below (we really care about the `query` function)
df = pd.DataFrame(X_test, columns=data.feature_names)
df['y_error'] = y_error
df['y_sq_error'] = y_error**2

for index, rule in rules.nlargest(3, 'effect').iterrows():
    print('rule:', rule['rule'])
    print('support:', rule['support'])
    print('coef:', rule['coef'])
    print('estimated error effect (support x coef):', rule['effect'])
    
    # it might be useful to compare the local error to the estimated model effect
    print('rule MAE:', df.query(rule['rule'])['y_error'].mean())
    print('rule RMSE:', df.query(rule['rule'])['y_sq_error'].mean()**(1/2))
    
    # we can use the rule to filter the data
    display(df.query(rule['rule']).nlargest(10, 'y_error'))
rule: worst_area <= 976.25 & worst_area > 553.299987793 & radius_error > 0.275350004435
support: 0.21678321678321677
coef: 0.05616597023836435
estimated error effect (support x coef): 0.012175839702023041
rule MAE: 0.2658110370900938
rule RMSE: 0.4155759177730745
mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension ... worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension y_error y_sq_error
91 11.76 18.14 75.00 431.1 0.09968 0.05914 0.02685 0.03515 0.1619 0.06287 ... 85.10 553.6 0.1137 0.07974 0.0612 0.07160 0.1978 0.06915 0.974325 0.949308
83 15.37 22.76 100.20 728.2 0.09200 0.10360 0.11220 0.07483 0.1717 0.06097 ... 107.50 830.9 0.1257 0.19970 0.2846 0.14760 0.2556 0.06828 0.929329 0.863652
89 14.60 23.29 93.97 664.7 0.08682 0.06636 0.08390 0.05271 0.1627 0.05416 ... 102.20 758.2 0.1312 0.15810 0.2675 0.13590 0.2477 0.06836 0.856296 0.733242
47 13.80 15.79 90.43 584.1 0.10070 0.12800 0.07789 0.05069 0.1662 0.06566 ... 110.30 812.4 0.1411 0.35420 0.2779 0.13830 0.2589 0.10300 0.832785 0.693531
137 14.22 27.85 92.55 623.9 0.08223 0.10390 0.11030 0.04408 0.1342 0.06129 ... 102.50 764.0 0.1081 0.24260 0.3064 0.08219 0.1890 0.07796 0.707592 0.500686
73 11.80 16.58 78.99 432.0 0.10910 0.17000 0.16590 0.07415 0.2678 0.07371 ... 91.93 591.7 0.1385 0.40920 0.4504 0.18650 0.5774 0.10300 0.703641 0.495111
130 14.99 22.11 97.53 693.7 0.08515 0.10250 0.06859 0.03876 0.1944 0.05913 ... 110.20 867.1 0.1077 0.33450 0.3114 0.13080 0.3163 0.09251 0.646361 0.417782
111 13.27 14.76 84.74 551.7 0.07355 0.05055 0.03261 0.02648 0.1386 0.05318 ... 104.50 830.6 0.1006 0.12380 0.1350 0.10010 0.2027 0.06206 0.471311 0.222134
119 16.25 19.51 109.80 815.8 0.10260 0.18930 0.22360 0.09194 0.2151 0.06578 ... 122.10 939.7 0.1377 0.44620 0.5897 0.17750 0.3318 0.09136 0.467160 0.218238
25 13.90 19.24 88.73 602.9 0.07991 0.05326 0.02995 0.02070 0.1579 0.05594 ... 104.40 830.5 0.1064 0.14150 0.1673 0.08150 0.2356 0.07603 0.344350 0.118577

10 rows × 32 columns

rule: compactness_error > 0.0203649997711 & worst_fractal_dimension <= 0.113150000572 & area_error > 23.2399997711 & radius_error <= 0.339100003242
support: 0.04195804195804196
coef: 0.26554881554597404
estimated error effect (support x coef): 0.011141908344586324
rule MAE: 0.7144778689742627
rule RMSE: 0.7290404836131931
mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension ... worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension y_error y_sq_error
83 15.37 22.76 100.20 728.2 0.09200 0.1036 0.11220 0.07483 0.1717 0.06097 ... 107.50 830.9 0.1257 0.1997 0.2846 0.14760 0.2556 0.06828 0.929329 0.863652
47 13.80 15.79 90.43 584.1 0.10070 0.1280 0.07789 0.05069 0.1662 0.06566 ... 110.30 812.4 0.1411 0.3542 0.2779 0.13830 0.2589 0.10300 0.832785 0.693531
137 14.22 27.85 92.55 623.9 0.08223 0.1039 0.11030 0.04408 0.1342 0.06129 ... 102.50 764.0 0.1081 0.2426 0.3064 0.08219 0.1890 0.07796 0.707592 0.500686
73 11.80 16.58 78.99 432.0 0.10910 0.1700 0.16590 0.07415 0.2678 0.07371 ... 91.93 591.7 0.1385 0.4092 0.4504 0.18650 0.5774 0.10300 0.703641 0.495111
130 14.99 22.11 97.53 693.7 0.08515 0.1025 0.06859 0.03876 0.1944 0.05913 ... 110.20 867.1 0.1077 0.3345 0.3114 0.13080 0.3163 0.09251 0.646361 0.417782
119 16.25 19.51 109.80 815.8 0.10260 0.1893 0.22360 0.09194 0.2151 0.06578 ... 122.10 939.7 0.1377 0.4462 0.5897 0.17750 0.3318 0.09136 0.467160 0.218238

6 rows × 32 columns

rule: worst_area > 548.650024414 & area_error > 23.2350006104 & worst_area <= 976.25
support: 0.22377622377622378
coef: 0.02893728555291459
estimated error effect (support x coef): 0.006475476487365502
rule MAE: 0.2578116453233828
rule RMSE: 0.40903437023906347
mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension ... worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension y_error y_sq_error
91 11.76 18.14 75.00 431.1 0.09968 0.05914 0.02685 0.03515 0.1619 0.06287 ... 85.10 553.6 0.1137 0.07974 0.0612 0.07160 0.1978 0.06915 0.974325 0.949308
83 15.37 22.76 100.20 728.2 0.09200 0.10360 0.11220 0.07483 0.1717 0.06097 ... 107.50 830.9 0.1257 0.19970 0.2846 0.14760 0.2556 0.06828 0.929329 0.863652
89 14.60 23.29 93.97 664.7 0.08682 0.06636 0.08390 0.05271 0.1627 0.05416 ... 102.20 758.2 0.1312 0.15810 0.2675 0.13590 0.2477 0.06836 0.856296 0.733242
47 13.80 15.79 90.43 584.1 0.10070 0.12800 0.07789 0.05069 0.1662 0.06566 ... 110.30 812.4 0.1411 0.35420 0.2779 0.13830 0.2589 0.10300 0.832785 0.693531
137 14.22 27.85 92.55 623.9 0.08223 0.10390 0.11030 0.04408 0.1342 0.06129 ... 102.50 764.0 0.1081 0.24260 0.3064 0.08219 0.1890 0.07796 0.707592 0.500686
73 11.80 16.58 78.99 432.0 0.10910 0.17000 0.16590 0.07415 0.2678 0.07371 ... 91.93 591.7 0.1385 0.40920 0.4504 0.18650 0.5774 0.10300 0.703641 0.495111
130 14.99 22.11 97.53 693.7 0.08515 0.10250 0.06859 0.03876 0.1944 0.05913 ... 110.20 867.1 0.1077 0.33450 0.3114 0.13080 0.3163 0.09251 0.646361 0.417782
111 13.27 14.76 84.74 551.7 0.07355 0.05055 0.03261 0.02648 0.1386 0.05318 ... 104.50 830.6 0.1006 0.12380 0.1350 0.10010 0.2027 0.06206 0.471311 0.222134
119 16.25 19.51 109.80 815.8 0.10260 0.18930 0.22360 0.09194 0.2151 0.06578 ... 122.10 939.7 0.1377 0.44620 0.5897 0.17750 0.3318 0.09136 0.467160 0.218238
25 13.90 19.24 88.73 602.9 0.07991 0.05326 0.02995 0.02070 0.1579 0.05594 ... 104.40 830.5 0.1064 0.14150 0.1673 0.08150 0.2356 0.07603 0.344350 0.118577

10 rows × 32 columns