Introduction to Classification using Logistic Regression with Scikit-Learn

This post, including the source .ipynb notebook file, will be used as a basis for other topics. You can obtain a copy of the source by clicking the Source link at the post of this post.

To keep things simple, we're going to utilise one of the many toy datasets built into Scikit-Learn! (And yes, it is a real dataset.)

We're also not going to explain how Scikit-Learn's LogisticRegression is implemented in this post.

To structure our code, we will define our model in two parts:

  • The code we need to fit our model
  • The code we need to use our fitted model to generate predictions

When it comes to model building, these are the two main functional components - so, and for reasons which will be explained in other posts, we're going to build a Python class called CustomModel, with a function for each of these components:

In [1]:
from sklearn.linear_model import LogisticRegression

class CustomModel(object):
    
    def fit(self, X, y):

        # LogisticRegression implements a number of parameters, you can read about them here:
        # http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
        #
        # With the exception of `random_state`, each of these are the defaults.
        
        model_params = {
            'penalty': 'l2',
            'dual': False,
            'tol': 0.0001,
            'C': 1.0,
            'fit_intercept': True,
            'intercept_scaling': 1,
            'class_weight': None,
            'random_state': 1234,    # Fixed to 1234 for reproducibility
            'solver': 'liblinear',
            'max_iter': 100,
            'multi_class': 'ovr',
            'verbose': 0,
            'warm_start': False,
            'n_jobs': 1
        }
    
        self.clf = LogisticRegression(**model_params)
        self.clf.fit(X, y)
        
        return self # fun fact: returning self enables method chaining i.e. .fit().predict()

    def predict(self, X):
        
        # We only want to output the positive case (the second column returned by `predict_proba`:
    
        return self.clf.predict_proba(X)[:,1]

Now we're ready to use our model!

In the next section we're going to load the sample data discussed above, and divide it into two portions:

  • 75% for model fitting
  • 25% for predictions
In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.75, 
                                                    test_size=0.25, 
                                                    random_state=1234) # more reproducibility

Now we have everything we need, lets load up our model, fit it with the training data, and generate some predictions:

In [4]:
# load our model
model = CustomModel()

# fit our model
model.fit(X_train, y_train)

# generate some predictions
model.predict(X_test)
Out[4]:
array([  9.25168417e-01,   9.99922130e-01,   9.53635418e-01,
         9.88416588e-01,   9.97542577e-01,   9.95232506e-01,
         4.60659258e-02,   9.98390194e-01,   6.59002902e-10,
         2.76899836e-06,   8.30718694e-10,   9.63993586e-01,
         9.94157890e-01,   9.50980576e-01,   9.96974859e-01,
         6.97038792e-10,   9.99809391e-01,   9.96431765e-01,
         9.99363563e-01,   8.43800531e-06,   9.95502414e-01,
         7.77576547e-03,   1.12727716e-09,   3.40904102e-17,
         3.68627970e-09,   6.55649762e-01,   3.51723839e-03,
         9.97326888e-01,   9.98785233e-01,   9.97552026e-01,
         9.86350517e-01,   9.98844211e-01,   5.70842717e-04,
         9.87742427e-01,   9.19814189e-01,   9.78443649e-01,
         9.92882821e-01,   1.14676290e-02,   1.48817234e-01,
         9.98733024e-01,   4.13813658e-05,   9.93177003e-01,
         1.72319657e-10,   8.54534408e-01,   8.81187668e-01,
         9.97568264e-01,   9.98086681e-01,   8.32784885e-01,
         4.49929586e-11,   8.89087737e-01,   9.28259947e-01,
         9.91244116e-01,   9.94876558e-01,   1.51106510e-08,
         2.60668778e-01,   9.99597520e-01,   9.98940073e-01,
         9.99968817e-01,   9.91318570e-01,   8.29369844e-03,
         9.93238377e-01,   9.92431535e-01,   9.29775117e-01,
         9.99271713e-01,   9.96474598e-01,   2.41572863e-04,
         1.51376226e-11,   9.97330558e-01,   9.98831771e-01,
         4.79400697e-01,   9.99798779e-01,   3.57307727e-07,
         9.99656809e-01,   7.03641088e-01,   9.98247027e-01,
         9.96093354e-01,   9.99588791e-01,   2.58369708e-08,
         9.98136922e-01,   7.97865310e-03,   9.99065333e-01,
         9.98470351e-01,   9.94581260e-01,   9.29328694e-01,
         1.41996390e-02,   1.43214384e-04,   3.71155631e-05,
         4.45838811e-06,   9.13207438e-01,   8.56295696e-01,
         9.99467328e-01,   9.74324559e-01,   9.99328632e-01,
         2.91312374e-12,   1.00998256e-01,   9.86992421e-01,
         9.97149193e-01,   9.13815924e-01,   9.98807818e-01,
         9.84005486e-01,   3.17865443e-08,   2.30937811e-11,
         9.98036358e-01,   9.99532884e-01,   1.24075526e-03,
         9.98819765e-01,   9.99752279e-01,   8.53677349e-04,
         1.53192255e-01,   9.30832406e-01,   1.49723823e-05,
         5.28688983e-01,   1.48786146e-03,   9.92804571e-51,
         8.86447353e-01,   9.95516043e-01,   9.98554149e-01,
         1.75078944e-03,   9.99922978e-01,   4.67159833e-01,
         9.99825913e-01,   9.57716419e-01,   9.95069689e-01,
         9.98728887e-01,   7.49375338e-14,   9.92513330e-01,
         1.49918676e-02,   1.63977226e-02,   9.95785292e-01,
         9.56124754e-01,   3.53639065e-01,   9.96011137e-01,
         7.27728677e-33,   9.97779030e-01,   7.77872222e-02,
         9.90058068e-01,   9.80367925e-01,   2.92408222e-01,
         9.98164180e-01,   1.67926421e-01,   9.99996297e-01,
         6.35631576e-10,   1.06440027e-01])