When building machine learning models, the goal is generally to improve the performance of a model based on some performance metric. One of the most simple metrics is error. Error is simply the inverse of model accuracy - so if a model had 95% accuracy, this would correspond with 5% error.

There are many ways to improve the performance of a model and subsequently decrease the model error. This includes adding more training observations (rows), enriching training obversations with more features (columns), modifying the model algorithm or optimising the algorithm parameters.

Reducing error by introducing new features¶

In the post linked above we looked at how to optimise the parameters of a given algorithm, so for now we're interested in what we can do with the data itself.

While there are many ways to create more training observations, this is often infeasible due to consideration of cost (imagine the cost of high-end medical studies) and time (such as waiting for enough events to occur).

The next option we have is to introduce new features to the observations already in our model. There is often lots of different approaches here too, including:

engineering new features based on existing features
creating new features from available data not already used
making more data available (such as from external providers)

The goal of this post is to explore a method not for assessing which approach to take, but for identifying where the gaps are to help you assess all of the options available to you.

Modelling error analysis¶

To get started, let's install an implementation of RuleFit from GitHub using pip:

pip install git+https://github.com/christophM/rulefit

Now we're going to load up a sample data set to work on, partitioning it into data for training our initial model, and data for testing its performance. Note that feature names will be important for this exercise.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# load our data - we also care about feature names
data = load_breast_cancer()
data.feature_names = list(map(lambda s: s.replace(' ' , '_'), data.feature_names))
X, y = data.data, data.target

# split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.75, 
                                                    test_size=0.25, 
                                                    random_state=1234) # more reproducibility

With our data ready, let's build a quick logistic regression model on the training data. We're also going to generate predictions for our test data (as positive probabilities, or the likelihood of the class label being True).

from sklearn.linear_model import LogisticRegression
import numpy as np

# define our model
model = LogisticRegression(random_state=1234)

# fit our model
model.fit(X_train, y_train)

# generate some predictions
y_hat = model.predict_proba(X_test)[:,1]

At the start of this post we discussed model error, so let's now calculate this for our model to see how much room for improvement there is.

# calculate error on each obversation in the test set
y_error = np.absolute(y_test - y_hat)

# is there much room for improvement?
print('model error:', y_error.mean())

model error: 0.0705335674785

It looks like there is almost 7% mean absolute error. Maybe we can find some good leads for improving on this?

To do so, we're going to create a new model using the RuleFit class, but instead of targetting the original class label y, we're going to calculate the absolute error of each observation.

The absolute error is the difference between the discrete actual value of y (0 or 1) and the continuous positive probability we predicted (0.0 to 1.0).

from rulefit import RuleFit
from sklearn.ensemble import GradientBoostingRegressor

# define and fit our shiny new RuleFit model
generator_params = {
    'max_depth': 5,       # control the complexity of our rules
    'n_estimators': 1000,  
    'learning_rate': 0.003,
    'random_state': 1234
}
generator = GradientBoostingRegressor(**generator_params)

rf = RuleFit(generator)
rf.fit(X_test, y_error, feature_names=data.feature_names)

RuleFit(tree_generator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.003, loss='ls', max_depth=5,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=1000,
             presort='auto', random_state=1234, subsample=1.0, verbose=0,
             warm_start=False))

With the RuleFit model fitted to our errors, we can generate a set of rules that might help us to isolate areas of our data that need enriching with new features.

RuleFit actually generates rules for the data having a positive impact on the model, but we can ignore these for error analysis for filtering coef > 0.

If we multiply the coefficient and support values calculated by RuleFit, we can use that as a rough estimate for how much error is due to that subset of the data.

By summing these estimates, we get an approximate amount of error explained by these rules. This will differ from the above simply because our rules may not perfect fit our errors (that is, our error model has its own error).

import pandas as pd

# get the outputs
rules = rf.get_rules()

# remove the rules we're not interested in. if the coefficient isn't above 0
# there rule is not a good indicator of an area for improvement
rules = rules[(rules.coef > 0.) & (rules.type != 'linear')]

# we can estimate an effect for each rule on the error score from above by 
# multiplying the coefficient and support values
rules['effect'] = rules['coef'] * rules['support']

print('modelled error:', rules['effect'].sum())
print('unexplained error:', np.max(y_error.mean() - rules['effect'].sum(),0))

modelled error: 0.046811390404179753
unexplained error: 0.0237221770743

Let's take a look at the top 10 rules:

# display the top 10 rules by effect
pd.set_option('display.max_colwidth', -1)
rules.nlargest(10,'effect')

To wrap up, let's produce a report of the top 3 rules, including up to 10 examples from the data to which the rules apply.

This report can be used in conjunction with subject-matter experitise on the data to isolate areas for feature enrichment, to improve your model!

from IPython.core.display import display
import pandas as pd

# prepare a dataframe for use below (we really care about the `query` function)
df = pd.DataFrame(X_test, columns=data.feature_names)
df['y_error'] = y_error
df['y_sq_error'] = y_error**2

for index, rule in rules.nlargest(3, 'effect').iterrows():
    print('rule:', rule['rule'])
    print('support:', rule['support'])
    print('coef:', rule['coef'])
    print('estimated error effect (support x coef):', rule['effect'])
    
    # it might be useful to compare the local error to the estimated model effect
    print('rule MAE:', df.query(rule['rule'])['y_error'].mean())
    print('rule RMSE:', df.query(rule['rule'])['y_sq_error'].mean()**(1/2))
    
    # we can use the rule to filter the data
    display(df.query(rule['rule']).nlargest(10, 'y_error'))

rule: worst_area <= 976.25 & worst_area > 553.299987793 & radius_error > 0.275350004435
support: 0.21678321678321677
coef: 0.05616597023836435
estimated error effect (support x coef): 0.012175839702023041
rule MAE: 0.2658110370900938
rule RMSE: 0.4155759177730745

rule: compactness_error > 0.0203649997711 & worst_fractal_dimension <= 0.113150000572 & area_error > 23.2399997711 & radius_error <= 0.339100003242
support: 0.04195804195804196
coef: 0.26554881554597404
estimated error effect (support x coef): 0.011141908344586324
rule MAE: 0.7144778689742627
rule RMSE: 0.7290404836131931

rule: worst_area > 548.650024414 & area_error > 23.2350006104 & worst_area <= 976.25
support: 0.22377622377622378
coef: 0.02893728555291459
estimated error effect (support x coef): 0.006475476487365502
rule MAE: 0.2578116453233828
rule RMSE: 0.40903437023906347

✨

	rule	type	coef	support	effect
883	worst_area <= 976.25 & worst_area > 553.299987793 & radius_error > 0.275350004435	rule	0.056166	0.216783	0.012176
1176	compactness_error > 0.0203649997711 & worst_fractal_dimension <= 0.113150000572 & area_error > 23.2399997711 & radius_error <= 0.339100003242	rule	0.265549	0.041958	0.011142
769	worst_area > 548.650024414 & area_error > 23.2350006104 & worst_area <= 976.25	rule	0.028937	0.223776	0.006475
458	worst_radius > 15.345000267 & worst_concavity > 0.207249999046 & worst_symmetry > 0.203749999404 & worst_symmetry <= 0.560700058937 & worst_radius <= 16.8100013733	rule	0.101535	0.041958	0.004260
1548	worst_symmetry > 0.203749999404 & area_error <= 33.375 & worst_symmetry <= 0.560700058937 & worst_perimeter > 100.555000305	rule	0.024403	0.153846	0.003754
5361	radius_error > 0.344399988651 & symmetry_error <= 0.017725000158 & area_error > 23.2399997711 & worst_perimeter <= 105.199996948	rule	0.133092	0.027972	0.003723
357	worst_radius > 15.5699996948 & worst_symmetry > 0.203749999404 & mean_texture > 19.1049995422 & worst_symmetry <= 0.560700058937 & radius_error <= 0.418200016022	rule	0.032694	0.069930	0.002286
57	worst_area > 548.650024414 & worst_area <= 1086.0	rule	0.002326	0.405594	0.000943
1433	worst_area > 548.650024414 & mean_perimeter > 79.2050018311 & mean_texture > 21.1399993896 & area_error <= 36.4049987793	rule	0.012087	0.069930	0.000845
5547	worst_area > 744.0 & worst_symmetry > 0.203749999404 & mean_texture > 19.1049995422 & worst_symmetry <= 0.560700058937 & radius_error <= 0.418200016022	rule	0.004637	0.069930	0.000324

	mean_radius	mean_texture	mean_perimeter	mean_area	mean_smoothness	mean_compactness	mean_concavity	mean_concave_points	mean_symmetry	mean_fractal_dimension	...	worst_perimeter	worst_area	worst_smoothness	worst_compactness	worst_concavity	worst_concave_points	worst_symmetry	worst_fractal_dimension	y_error	y_sq_error
91	11.76	18.14	75.00	431.1	0.09968	0.05914	0.02685	0.03515	0.1619	0.06287	...	85.10	553.6	0.1137	0.07974	0.0612	0.07160	0.1978	0.06915	0.974325	0.949308
83	15.37	22.76	100.20	728.2	0.09200	0.10360	0.11220	0.07483	0.1717	0.06097	...	107.50	830.9	0.1257	0.19970	0.2846	0.14760	0.2556	0.06828	0.929329	0.863652
89	14.60	23.29	93.97	664.7	0.08682	0.06636	0.08390	0.05271	0.1627	0.05416	...	102.20	758.2	0.1312	0.15810	0.2675	0.13590	0.2477	0.06836	0.856296	0.733242
47	13.80	15.79	90.43	584.1	0.10070	0.12800	0.07789	0.05069	0.1662	0.06566	...	110.30	812.4	0.1411	0.35420	0.2779	0.13830	0.2589	0.10300	0.832785	0.693531
137	14.22	27.85	92.55	623.9	0.08223	0.10390	0.11030	0.04408	0.1342	0.06129	...	102.50	764.0	0.1081	0.24260	0.3064	0.08219	0.1890	0.07796	0.707592	0.500686
73	11.80	16.58	78.99	432.0	0.10910	0.17000	0.16590	0.07415	0.2678	0.07371	...	91.93	591.7	0.1385	0.40920	0.4504	0.18650	0.5774	0.10300	0.703641	0.495111
130	14.99	22.11	97.53	693.7	0.08515	0.10250	0.06859	0.03876	0.1944	0.05913	...	110.20	867.1	0.1077	0.33450	0.3114	0.13080	0.3163	0.09251	0.646361	0.417782
111	13.27	14.76	84.74	551.7	0.07355	0.05055	0.03261	0.02648	0.1386	0.05318	...	104.50	830.6	0.1006	0.12380	0.1350	0.10010	0.2027	0.06206	0.471311	0.222134
119	16.25	19.51	109.80	815.8	0.10260	0.18930	0.22360	0.09194	0.2151	0.06578	...	122.10	939.7	0.1377	0.44620	0.5897	0.17750	0.3318	0.09136	0.467160	0.218238
25	13.90	19.24	88.73	602.9	0.07991	0.05326	0.02995	0.02070	0.1579	0.05594	...	104.40	830.5	0.1064	0.14150	0.1673	0.08150	0.2356	0.07603	0.344350	0.118577

	mean_radius	mean_texture	mean_perimeter	mean_area	mean_smoothness	mean_compactness	mean_concavity	mean_concave_points	mean_symmetry	mean_fractal_dimension	...	worst_perimeter	worst_area	worst_smoothness	worst_compactness	worst_concavity	worst_concave_points	worst_symmetry	worst_fractal_dimension	y_error	y_sq_error
83	15.37	22.76	100.20	728.2	0.09200	0.1036	0.11220	0.07483	0.1717	0.06097	...	107.50	830.9	0.1257	0.1997	0.2846	0.14760	0.2556	0.06828	0.929329	0.863652
47	13.80	15.79	90.43	584.1	0.10070	0.1280	0.07789	0.05069	0.1662	0.06566	...	110.30	812.4	0.1411	0.3542	0.2779	0.13830	0.2589	0.10300	0.832785	0.693531
137	14.22	27.85	92.55	623.9	0.08223	0.1039	0.11030	0.04408	0.1342	0.06129	...	102.50	764.0	0.1081	0.2426	0.3064	0.08219	0.1890	0.07796	0.707592	0.500686
73	11.80	16.58	78.99	432.0	0.10910	0.1700	0.16590	0.07415	0.2678	0.07371	...	91.93	591.7	0.1385	0.4092	0.4504	0.18650	0.5774	0.10300	0.703641	0.495111
130	14.99	22.11	97.53	693.7	0.08515	0.1025	0.06859	0.03876	0.1944	0.05913	...	110.20	867.1	0.1077	0.3345	0.3114	0.13080	0.3163	0.09251	0.646361	0.417782
119	16.25	19.51	109.80	815.8	0.10260	0.1893	0.22360	0.09194	0.2151	0.06578	...	122.10	939.7	0.1377	0.4462	0.5897	0.17750	0.3318	0.09136	0.467160	0.218238

	mean_radius	mean_texture	mean_perimeter	mean_area	mean_smoothness	mean_compactness	mean_concavity	mean_concave_points	mean_symmetry	mean_fractal_dimension	...	worst_perimeter	worst_area	worst_smoothness	worst_compactness	worst_concavity	worst_concave_points	worst_symmetry	worst_fractal_dimension	y_error	y_sq_error
91	11.76	18.14	75.00	431.1	0.09968	0.05914	0.02685	0.03515	0.1619	0.06287	...	85.10	553.6	0.1137	0.07974	0.0612	0.07160	0.1978	0.06915	0.974325	0.949308
83	15.37	22.76	100.20	728.2	0.09200	0.10360	0.11220	0.07483	0.1717	0.06097	...	107.50	830.9	0.1257	0.19970	0.2846	0.14760	0.2556	0.06828	0.929329	0.863652
89	14.60	23.29	93.97	664.7	0.08682	0.06636	0.08390	0.05271	0.1627	0.05416	...	102.20	758.2	0.1312	0.15810	0.2675	0.13590	0.2477	0.06836	0.856296	0.733242
47	13.80	15.79	90.43	584.1	0.10070	0.12800	0.07789	0.05069	0.1662	0.06566	...	110.30	812.4	0.1411	0.35420	0.2779	0.13830	0.2589	0.10300	0.832785	0.693531
137	14.22	27.85	92.55	623.9	0.08223	0.10390	0.11030	0.04408	0.1342	0.06129	...	102.50	764.0	0.1081	0.24260	0.3064	0.08219	0.1890	0.07796	0.707592	0.500686
73	11.80	16.58	78.99	432.0	0.10910	0.17000	0.16590	0.07415	0.2678	0.07371	...	91.93	591.7	0.1385	0.40920	0.4504	0.18650	0.5774	0.10300	0.703641	0.495111
130	14.99	22.11	97.53	693.7	0.08515	0.10250	0.06859	0.03876	0.1944	0.05913	...	110.20	867.1	0.1077	0.33450	0.3114	0.13080	0.3163	0.09251	0.646361	0.417782
111	13.27	14.76	84.74	551.7	0.07355	0.05055	0.03261	0.02648	0.1386	0.05318	...	104.50	830.6	0.1006	0.12380	0.1350	0.10010	0.2027	0.06206	0.471311	0.222134
119	16.25	19.51	109.80	815.8	0.10260	0.18930	0.22360	0.09194	0.2151	0.06578	...	122.10	939.7	0.1377	0.44620	0.5897	0.17750	0.3318	0.09136	0.467160	0.218238
25	13.90	19.24	88.73	602.9	0.07991	0.05326	0.02995	0.02070	0.1579	0.05594	...	104.40	830.5	0.1064	0.14150	0.1673	0.08150	0.2356	0.07603	0.344350	0.118577